Somatic Variant Calling: Mutect2 Tumor-Normal

BWA-MEM alignment · BQSR · Mutect2 tumor-normal · Panel of Normals · contamination estimation · FilterMutectCalls · COSMIC annotation · mutational signatures

~180 min Advanced Bash / Python / R
Byte

Overview

Somatic variant calling identifies mutations that arose in a tumor but are absent from the patient’s normal tissue — these are the driver and passenger mutations of cancer. Unlike germline calling (where allele frequency is ~50% or ~100%), somatic mutations have variable allele frequency (often 5–40%) because tumors are mixtures of normal cells, tumor clones, and sub-clones. Mutect2 is the GATK gold-standard for this task.

Byte
Somatic vs germline calling — key differences:
  • Allele frequency: Germline het = ~50% AF; Somatic can be 3–50% (depends on tumor purity and clonality)
  • Matched normal: Required to subtract germline variants and sequencing artifacts
  • Panel of Normals (PoN): Collection of normal samples used to filter recurrent technical artifacts
  • Orientation bias: FFPE-preserved tumor DNA has characteristic OxoG and FFPE artifacts that must be corrected
  • Tumor purity: Must estimate what fraction of sequenced cells are tumor vs contaminating normal

Dataset

We use the TCGA-A8-A08B breast cancer tumor-normal pair from TCGA — one of the most widely used somatic variant calling benchmark datasets, with validated mutations from multiple orthogonal platforms.

PropertyValue
Cancer typeBreast invasive carcinoma (BRCA)
SampleTCGA-A8-A08B (tumor + matched blood normal)
AccessionGDC Data Portal: TCGA-A8-A08B
SequencingWES (whole exome sequencing), ~100× tumor, ~50× normal
ReferenceGRCh38 / hg38
Expected mutations~200–500 somatic SNVs (typical breast cancer WES)
Dominant signatureCOSMIC SBS1 (deamination of 5-methylcytosine) + SBS3 (BRCA1/2 HR deficiency)

Step 1 — Alignment & Base Quality Score Recalibration

Bash — BWA-MEM alignment, duplicate marking, BQSR
######################################################################
## STEP 1: Align reads and prepare analysis-ready BAM files
## The GATK best practices pre-processing pipeline applies to BOTH
## the tumor and the matched normal sample identically.
## Key steps: align → sort → mark duplicates → BQSR → index
######################################################################

module load bwa/0.7.17
module load gatk/4.4.0
module load samtools/1.17
module load picard/3.0.0

REF="/path/to/hg38/Homo_sapiens_assembly38.fasta"
DBSNP="/path/to/hg38/dbsnp_146.hg38.vcf.gz"
MILLS="/path/to/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"

## Process both TUMOR and NORMAL samples with the same pipeline
for SAMPLE in TUMOR NORMAL; do

    R1="0_raw/${SAMPLE}_R1.fastq.gz"
    R2="0_raw/${SAMPLE}_R2.fastq.gz"
    SM="${SAMPLE}"   # sample name for read group tags

    ## --- Step 1a: BWA-MEM alignment with read group tags ---
    ## Read groups (RG) are MANDATORY for GATK tools
    ## @RG tags: ID=run identifier, SM=sample name, PL=platform, PU=platform unit
    ## -R : read group string in SAM/BAM format
    ## -t 16 : 16 threads
    bwa mem \
        -t 16 \
        -R "@RG\tID:${SM}\tSM:${SM}\tPL:ILLUMINA\tPU:${SM}.1" \
        ${REF} ${R1} ${R2} | \
      samtools sort \
        -@ 8 \
        -O BAM \
        -o 1_aligned/${SM}_sorted.bam
    samtools index 1_aligned/${SM}_sorted.bam

    ## --- Step 1b: Mark duplicates with Picard MarkDuplicates ---
    ## For somatic calling: DO NOT remove duplicates (just mark them)
    ## Mutect2 is aware of duplicates and uses them to detect strand artifacts
    ## REMOVE_DUPLICATES=false : only mark, not remove
    ## CREATE_INDEX=true : automatically create BAM index
    picard MarkDuplicates \
        I=1_aligned/${SM}_sorted.bam \
        O=1_aligned/${SM}_markdup.bam \
        M=1_aligned/${SM}_markdup_metrics.txt \
        REMOVE_DUPLICATES=false \
        CREATE_INDEX=true \
        VALIDATION_STRINGENCY=LENIENT

    ## --- Step 1c: Base Quality Score Recalibration (BQSR) ---
    ## BQSR corrects systematic errors in the base quality scores from the sequencer
    ## Step 1: BaseRecalibrator builds a recalibration table
    ##   --known-sites : known variant sites (used as true positives to exclude from error model)
    ##   These known SNPs/indels should NOT be flagged as sequencing errors
    gatk BaseRecalibrator \
        -I 1_aligned/${SM}_markdup.bam \
        -R ${REF} \
        --known-sites ${DBSNP} \
        --known-sites ${MILLS} \
        -O 1_aligned/${SM}_recal.table

    ## Step 2: ApplyBQSR applies the recalibration table to the BAM
    ## --bqsr-recal-file : the recalibration table from step above
    gatk ApplyBQSR \
        -I 1_aligned/${SM}_markdup.bam \
        -R ${REF} \
        --bqsr-recal-file 1_aligned/${SM}_recal.table \
        -O 1_aligned/${SM}_final.bam

    echo "Pre-processing complete: ${SM}"
    samtools flagstat 1_aligned/${SM}_final.bam
done
Download:

Step 2 — Panel of Normals (PoN)

Bash — Build a Panel of Normals to filter technical artifacts
######################################################################
## STEP 2: Create a Panel of Normals (PoN)
## A PoN is a VCF containing variant sites that appear in multiple
## normal (non-tumor) samples. These are either:
##   1. Germline variants that are common in the population
##   2. Recurrent sequencing/alignment artifacts at specific sites
## Variants in the PoN are filtered OUT of somatic calls.
## The GATK recommends building a PoN from ≥40 normal samples;
## here we use the GATK public hg38 PoN as an alternative.
######################################################################

## --- Option A: Use GATK public PoN (recommended for most users) ---
## Download the GATK4 Mutect2 PoN for hg38 from GATK resource bundle
gsutil cp \
  gs://gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz \
  0_raw/1000g_pon.hg38.vcf.gz
gsutil cp \
  gs://gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz.tbi \
  0_raw/1000g_pon.hg38.vcf.gz.tbi

PON="0_raw/1000g_pon.hg38.vcf.gz"

## --- Option B: Build your own PoN from normal samples ---
## If you have multiple normal samples from the same sequencing protocol,
## building a custom PoN captures protocol-specific artifacts better.
## Step 1: Run Mutect2 in tumor-only mode on each normal sample
for NORMAL_SAMPLE in normal1 normal2 normal3; do  # replace with your normals
    gatk Mutect2 \
        -I 1_aligned/${NORMAL_SAMPLE}_final.bam \
        -R ${REF} \
        --max-mnp-distance 0 \     # important: disable MNP merging for PoN creation
        -O 2_pon/${NORMAL_SAMPLE}_for_pon.vcf.gz
done

## Step 2: Combine all normal VCFs into a GenomicsDB workspace
gatk GenomicsDBImport \
    -R ${REF} \
    -L 0_raw/exome_capture_targets.interval_list \
    --genomicsdb-workspace-path 2_pon/pon_db \
    $(for N in normal1 normal2 normal3; do echo "-V 2_pon/${N}_for_pon.vcf.gz"; done)

## Step 3: Create the PoN VCF from the GenomicsDB
gatk CreateSomaticPanelOfNormals \
    -R ${REF} \
    -V gendb://2_pon/pon_db \
    -O 2_pon/custom_pon.vcf.gz

echo "Panel of Normals ready: ${PON}"
Download:
Panel of Normals decoded
📌Why use a PoN?Even with a matched normal, some recurrent technical artifacts appear in both tumor and normal at low allele frequency. These could be mapping artifacts at highly repetitive regions, systematic oxidative damage from FFPE preservation, or sequencer-specific noise. The PoN captures these recurrent artifacts from many normal samples and flags them for removal from somatic calls.
🎯PoN size mattersGATK recommends ≥40 normal samples for a PoN, sequenced with the same protocol (same exome capture kit, same sequencer). Fewer normals = lower sensitivity for artifact detection. Using a PoN from a different protocol can introduce false negatives (filtering real mutations that happen to overlap artifact sites in your PoN).

Step 3 — Mutect2 Tumor-Normal Calling

Bash — Mutect2 somatic variant calling with matched normal
######################################################################
## STEP 3: Run Mutect2 in tumor-normal mode
## Mutect2 uses a Bayesian model to call variants that are present
## in the tumor but absent (or at very low frequency) in the normal.
## It simultaneously filters germline variants using the matched normal
## and artifact sites using the Panel of Normals.
######################################################################

REF="/path/to/hg38/Homo_sapiens_assembly38.fasta"
GNOMAD="/path/to/hg38/af-only-gnomad.hg38.vcf.gz"  # germline variant allele frequencies
PON="0_raw/1000g_pon.hg38.vcf.gz"

## --- Run Mutect2 ---
## -I TUMOR_BAM  + --tumor-sample   : tumor BAM and sample name
## -I NORMAL_BAM + --normal-sample  : matched normal BAM and sample name
##   Sample names must match the SM tag in the BAM @RG header
## --germline-resource : gnomAD allele frequency database
##   Used to calculate prior probability that a variant is germline
##   Sites with high gnomAD AF are more likely germline → assigned lower somatic score
## --panel-of-normals : PoN VCF; variants appearing in PoN are flagged
## --af-of-alleles-not-in-resource : prior AF for sites NOT in gnomAD
##   0.0000025 ≈ 1/(2 × 200,000 diploid genomes) — default for gnomAD v3
## -L : limit to exome capture regions (much faster; avoids off-target noise)
## --f1r2-tar-gz : output file for orientation bias model (needed for Step 4)
## --tmp-dir : temp directory (use a fast local disk)
gatk Mutect2 \
    -I 1_aligned/TUMOR_final.bam \
    --tumor-sample TUMOR \
    -I 1_aligned/NORMAL_final.bam \
    --normal-sample NORMAL \
    -R ${REF} \
    --germline-resource ${GNOMAD} \
    --panel-of-normals ${PON} \
    --af-of-alleles-not-in-resource 0.0000025 \
    -L 0_raw/exome_capture_targets.interval_list \
    -O 3_calls/tumor_normal_raw.vcf.gz \
    --f1r2-tar-gz 3_calls/f1r2.tar.gz \
    --tmp-dir /tmp/ \
    -nct 16

echo "Raw somatic calls: $(bcftools view -H 3_calls/tumor_normal_raw.vcf.gz | wc -l)"

## --- Get pileup summaries for contamination estimation ---
## GetPileupSummaries: counts alleles at known variant sites for each sample
## Used by CalculateContamination (Step 4)
for SAMPLE in TUMOR NORMAL; do
    gatk GetPileupSummaries \
        -I 1_aligned/${SAMPLE}_final.bam \
        -V ${GNOMAD} \
        -L ${GNOMAD} \
        -O 3_calls/${SAMPLE}_pileup.table
done
Download:

Step 4 — Contamination Estimation & FilterMutectCalls

Bash — LearnReadOrientationModel, CalculateContamination, FilterMutectCalls
######################################################################
## STEP 4: Estimate contamination and orientation bias, then filter
## This is the GATK best practice filtering pipeline for somatic calls.
## Three filtering steps:
##   1. LearnReadOrientationModel : learn FFPE/OxoG artifact model
##   2. CalculateContamination    : estimate cross-sample contamination
##   3. FilterMutectCalls         : apply all filters and tag PASS/FAIL
######################################################################

## --- Step 4a: Learn read orientation (OxoG / FFPE artifact) model ---
## LearnReadOrientationModel: uses the f1r2 counts from Mutect2
## to build a model of orientation bias artifacts.
## OxoG artifact: G→T (or C→A) mutations only on the read strand facing
## the oxidized guanine — distinguishable from true mutations
## FFPE artifact: C→T mutations from cytosine deamination during fixation
gatk LearnReadOrientationModel \
    -I 3_calls/f1r2.tar.gz \
    -O 3_calls/read_orientation_model.tar.gz

## --- Step 4b: Calculate cross-sample contamination ---
## CalculateContamination: estimates what fraction of the tumor sample
## is contaminated with DNA from another individual
## Cross-contamination creates false positives (the contaminant's germline
## variants appear as low-AF "somatic" mutations)
## -matched-normal : use matched normal to improve contamination estimate
gatk CalculateContamination \
    -I 3_calls/TUMOR_pileup.table \
    -matched-normal 3_calls/NORMAL_pileup.table \
    -O 3_calls/contamination.table \
    --tumor-segmentation 3_calls/segments.table

cat 3_calls/contamination.table
## contamination column: should be < 0.05 (< 5%) for clean samples

## --- Step 4c: FilterMutectCalls ---
## Applies all filters to the raw Mutect2 calls:
##   1. Orientation bias filter (using the read orientation model)
##   2. Contamination filter (removes variants explained by contamination)
##   3. Panel of Normals filter
##   4. Read support filters (min reads, strand bias, etc.)
## PASS variants = somatic SNV/indel candidates passing all filters
gatk FilterMutectCalls \
    -R ${REF} \
    -V 3_calls/tumor_normal_raw.vcf.gz \
    --contamination-table 3_calls/contamination.table \
    --tumor-segmentation 3_calls/segments.table \
    --ob-priors 3_calls/read_orientation_model.tar.gz \
    -O 3_calls/tumor_normal_filtered.vcf.gz

## Count PASS variants
echo "PASS somatic variants:"
bcftools view -f PASS 3_calls/tumor_normal_filtered.vcf.gz | bcftools stats | grep "^SN"

## Extract PASS variants to a clean VCF
bcftools view -f PASS 3_calls/tumor_normal_filtered.vcf.gz \
    -O z -o 3_calls/tumor_normal_PASS.vcf.gz
bcftools index --tbi 3_calls/tumor_normal_PASS.vcf.gz
Download:
🎯
Typical PASS rate: Mutect2 typically calls 20,000–100,000 raw variants before filtering. After FilterMutectCalls, a typical WES tumor should yield 200–2,000 PASS somatic variants (depending on tumor mutation burden). If you get <100 PASS variants, check contamination estimate and PoN; if >10,000 PASS variants, check for FFPE artifacts and ensure matched normal was used.

Step 5 — Variant Annotation with Ensembl VEP

Bash — Annotate somatic variants with VEP + COSMIC
######################################################################
## STEP 5: Annotate somatic variants with Ensembl VEP
## VEP assigns each variant:
##   - Gene name and Ensembl gene/transcript ID
##   - Consequence (missense, nonsense, splice site, frameshift, etc.)
##   - Protein change (e.g. p.Val600Glu for BRAF V600E)
##   - Population frequency in gnomAD
##   - COSMIC cancer mutation database annotation
##   - ClinVar pathogenicity annotation
##   - SIFT/PolyPhen functional prediction scores
######################################################################

module load vep/108

## --- Run VEP annotation ---
## --input_file   : input VCF (filtered PASS variants)
## --output_file  : annotated VCF output
## --vcf          : output in VCF format (adds INFO fields)
## --everything   : enable all annotation features
##   (same as combining: --sift b --polyphen b --ccds --hgvs --symbol etc.)
## --offline      : use local VEP cache (much faster than online API)
## --cache_version 108 : VEP database version
## --assembly GRCh38 : genome assembly version
## --fork 8       : run 8 parallel threads
## --pick          : pick one canonical transcript consequence per variant
##   (reduces output size; use --per_gene for all transcripts)
## --custom COSMIC.vcf.gz,COSMIC,vcf,exact,0 :
##   add COSMIC annotations from a local VCF file
##   "exact" = require exact position + allele match
vep \
    --input_file 3_calls/tumor_normal_PASS.vcf.gz \
    --output_file 4_annotated/tumor_normal_vep.vcf.gz \
    --vcf \
    --compress_output bgzip \
    --everything \
    --offline \
    --cache_version 108 \
    --assembly GRCh38 \
    --fork 8 \
    --pick \
    --custom /path/to/CosmicCodingMuts.vcf.gz,COSMIC,vcf,exact,0 \
    --stats_file 4_annotated/vep_stats.html

## --- Parse annotation to a table ---
bcftools +split-vep \
    4_annotated/tumor_normal_vep.vcf.gz \
    -f '%CHROM\t%POS\t%REF\t%ALT\t%SYMBOL\t%Consequence\t%HGVSp\t%SIFT\t%PolyPhen\t%gnomADe_AF\t%COSMIC\n' \
    -d > 4_annotated/somatic_mutations_annotated.tsv

echo "Annotated variants: $(wc -l < 4_annotated/somatic_mutations_annotated.tsv)"
echo "Check vep_stats.html for annotation summary"
Download:

Step 6 — Mutational Signature Analysis

Python — SigProfilerExtractor mutational signatures
######################################################################
## STEP 6: Decompose somatic mutations into COSMIC mutational signatures
## Mutational signatures = patterns of base substitutions that reflect
## the biological processes that generated those mutations.
## Examples:
##   SBS1  = clock-like deamination (age)
##   SBS4  = tobacco/smoking (C>A at GpCpX)
##   SBS7a/b = UV exposure (CC>TT dipyrimidines)
##   SBS3  = HR deficiency (BRCA1/2 mutations) — enriched in breast cancer
######################################################################

## Install: pip install SigProfilerExtractor
## Also install reference genome: python -c "from SigProfilerMatrixGenerator import install; install.install('GRCh38')"

from SigProfilerExtractor import sigpro as sig
from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
import pandas as pd
import matplotlib.pyplot as plt

## --- Step 6a: Generate the SBS96 mutation count matrix ---
## SBS96 = 96 trinucleotide context categories for single-base substitutions
## Each mutation is described by: REF base (with 5' and 3' context) + ALT base
## e.g. A[C>T]G = a C→T substitution where the 5' base is A and 3' base is G
path_to_vcf = "3_calls/"   # directory containing the PASS VCF file

## SigProfilerMatrixGenerator reads all VCFs in the directory
matGen.SigProfilerMatrixGeneratorFunc(
    "TCGA_BRCA",       # project name
    "GRCh38",          # genome assembly
    path_to_vcf,       # directory with VCF files
    plot=True,         # create spectrum plots
    exome=True         # restrict to exome (our data is WES)
)
## Output: SBS96 matrix in path_to_vcf/output/SBS/

## --- Step 6b: Extract de novo signatures ---
## SigProfilerExtractor decomposes the SBS96 matrix into N signatures
## by NMF (non-negative matrix factorization)
## Then compares to COSMIC reference signatures
sig.sigProfilerExtractor(
    input_type    = "matrix",       # input is an SBS96 matrix file
    output        = "5_signatures/",
    input_data    = "3_calls/output/SBS/TCGA_BRCA.SBS96.all",
    reference_genome = "GRCh38",
    minimum_signatures = 1,
    maximum_signatures = 5,        # test 1 to 5 de novo signatures
    nmf_replicates    = 100,       # 100 NMF runs for stability assessment
    cpu               = 4
)

## --- Step 6c: Read and plot signature contributions ---
## Load the decomposed signature proportions
sig_activities = pd.read_csv("5_signatures/SBS96/Suggested_Solution/COSMIC_SBS96_Activities/Activities_refit.txt",
                              sep="\t", index_col=0)

## Plot signature contributions as a stacked bar chart
ax = sig_activities.T.plot(kind="bar", stacked=True, figsize=(10, 5),
                            colormap="Set2")
ax.set_xlabel("Sample")
ax.set_ylabel("Number of mutations attributed to signature")
ax.set_title("Mutational signature decomposition (COSMIC SBS)")
ax.legend(bbox_to_anchor=(1.01, 1), loc="upper left", fontsize=8)
plt.tight_layout()
plt.savefig("5_signatures/signature_contributions.png", dpi=150)
print(sig_activities)
Download:
🌊
Expected BRCA result: The TCGA-A8-A08B breast tumor should show a mixture of SBS1 (age-related clock-like deamination — almost every tumor has this), SBS3 (BRCA1/2 homologous recombination deficiency), and possibly SBS5 (another clock-like signature). Detecting SBS3 suggests this tumour has impaired DNA repair, which predicts sensitivity to PARP inhibitors — a clinically actionable finding.