GATK4 Best Practices SNP Calling | IndianElucidBiology

Overview

The GATK Best Practices workflow for germline short variant discovery is the gold standard for SNP and indel calling from short-read DNA sequencing data. This tutorial follows the workflow exactly as described for diploid organisms.

Pipeline Steps

FastQC quality check
BWA-MEM alignment to reference
Sort, mark duplicates (Picard)
BQSR round 1 — base quality recalibration
HaplotypeCaller — per-sample gVCF
GenomicsDBImport + joint genotyping
Variant quality score recalibration (VQSR)

Tools Required

BWA (bwa-mem2 or bwa)
samtools
Picard (MarkDuplicates)
GATK4 (BaseRecalibrator, HaplotypeCaller, GenotypeGVCFs)
bcftools / VCFtools (filtering)

GATK best practices workflow overview. Image from the Bioinformatics Workbook (Arun Seetharam).

Ideal Directory Structure

GATK workflows generate many intermediate files. Strict folder discipline prevents overwriting and makes debugging easier.

gatk_ahalleri_project/

gatk_ahalleri_project/ ├── 0_index/ # reference FASTA + BWA index + .dict + .fai ├── 1_data/ # raw FASTQ (read-only!) │ ├── ERR1760144_1.fastq.gz │ └── ... (8 files, 4 samples × 2) ├── 2_fastqc/ # FastQC + MultiQC reports ├── 3_pre-processing/ # sorted BAM, MarkDuplicates output │ ├── ERR1760144.sorted.bam │ └── ERR1760144.dedup.bam # duplicates marked ├── 4_gatk-round-1/ # BaseRecalibrator tables (pre-BQSR) ├── 5_recalibration/ # BQSR-recalibrated BAMs │ └── ERR1760144.bqsr.bam ├── 6_gatk_round-2/ # per-sample gVCF files (HaplotypeCaller output) │ └── ERR1760144.g.vcf.gz └── 7_filtering/ # joint-genotyped + hard-filtered VCF ├── all_samples.vcf.gz └── all_samples.filtered.vcf.gz

🧰

Why all these folders? GATK is a multi-step pipeline where each step takes the previous step's output. If a later step fails, you want to restart from the last successful step — not from scratch! Keeping outputs in separate numbered folders lets you pick up exactly where you left off.

Dataset

Four populations of Arabidopsis halleri subsp. halleri sequenced to estimate genomic diversity and population differentiation (Fischer et al., 2016). Reference genome: A. halleri Ahal2.2 (Ensembl Plants).

Sample	ENA ID	Read Length	Insert Size	Total Bases
Aha18	ERR1760144	202 bp	250 bp	21.4 Gb
AhaN1	ERR1760145	200 bp	150 bp	30.1 Gb
AhaN3	ERR1760146	200 bp	150 bp	29.6 Gb
AhaN4	ERR1760147	200 bp	150 bp	31.2 Gb

Setup & Directory Organization

Bash — Create directory structure and download data

mkdir -p {0_index,1_data,2_fastqc,3_pre-processing,4_gatk-round-1,5_recalibration,6_gatk_round-2,7_filtering}

# Download reads from ENA (faster than SRA)
cd 1_data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/004/ERR1760144/ERR1760144_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/004/ERR1760144/ERR1760144_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/005/ERR1760145/ERR1760145_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/005/ERR1760145/ERR1760145_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/006/ERR1760146/ERR1760146_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/006/ERR1760146/ERR1760146_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/007/ERR1760147/ERR1760147_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/007/ERR1760147/ERR1760147_2.fastq.gz
cd ..

# Download A. halleri reference genome
cd 0_index
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-43/fasta/arabidopsis_halleri/dna/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa.gz
gunzip Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa.gz
cd ..

Download:

Setup commands decoded

📁mkdir -p {0_index,...}	Brace expansion creates all 8 directories in one command. The `-p` flag means "no error if they already exist". Running this at the start ensures your pipeline never fails due to a missing output folder.
🔒samtools faidx	Creates a `.fai` index of the reference FASTA. GATK requires this to quickly jump to any genomic coordinate. Think of it as adding page numbers to a book.
📖gatk CreateSequenceDictionary	Creates a `.dict` file listing all chromosomes and their lengths. GATK uses this as a "table of contents" to validate that your BAM and reference have matching chromosomes before running expensive analyses.
🗺️bwa-mem2 index	Builds a BWA-MEM2 index of the reference (faster modern version of BWA). Like HISAT2-build but for DNA — required before alignment. Takes ~15 min for a plant-sized genome.

Step 0 — Quality Check

Bash — FastQC on all samples

cd 2_fastqc
for fq in ../1_data/*.fastq.gz; do
    ln -s $fq
done

module load parallel
module load fastqc
parallel "fastqc {}" ::: *.fastq.gz
cd ..

Download:

Step 1 — BWA-MEM Alignment

Bash — BWA index + alignment with read groups (SLURM)

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=48:00:00
#SBATCH --job-name=bwa_align
#SBATCH --output=bwa_align.%j.out
#SBATCH --error=bwa_align.%j.err

module load bwa
module load samtools

REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
OUTDIR="3_pre-processing"

# Index the reference (run once)
bwa index ${REF}
samtools faidx ${REF}

# Declare sample metadata (name, ERR ID)
declare -A SAMPLES=(
    [Aha18]=ERR1760144
    [AhaN1]=ERR1760145
    [AhaN3]=ERR1760146
    [AhaN4]=ERR1760147
)

for SAMPLE in "${!SAMPLES[@]}"; do
    ERR="${SAMPLES[$SAMPLE]}"
    R1="1_data/${ERR}_1.fastq.gz"
    R2="1_data/${ERR}_2.fastq.gz"

    echo "Aligning: ${SAMPLE} (${ERR})"

    # Read Group tag is CRITICAL for GATK - identifies sample in multi-sample VCF
    RG="@RG\tID:${ERR}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:${SAMPLE}_lib1\tPU:${ERR}.1"

    bwa mem -t 16 \
        -R "${RG}" \
        ${REF} ${R1} ${R2} \
        | samtools view -bS - \
        | samtools sort -o ${OUTDIR}/${SAMPLE}.sorted.bam -

    samtools index ${OUTDIR}/${SAMPLE}.sorted.bam
done

Download:

BWA-MEM flags decoded

🧵-t 16	Use 16 CPU threads. BWA-MEM scales well — 16 threads is ~12× faster than single-threaded for these large WGS datasets.
🏷️-R "@RG\t..."	Read Group tag — the most important GATK requirement. It embeds sample metadata into every alignment record. `SM:`=sample name (appears in VCF header), `ID:`=run ID, `PL:`=platform (ILLUMINA), `LB:`=library. GATK will refuse to run HaplotypeCaller without these tags.
🔀pipe `\|` to samtools	BWA writes to stdout, which we pipe directly to `samtools view` (SAM→BAM) then `samtools sort`. This saves ~15 GB of temporary SAM disk space per sample by never writing the SAM at all.

🏷️

Read Group anatomy: @RG\tID:ERR1760144\tSM:Aha18\tPL:ILLUMINA\tLB:Aha18_lib1\tPU:ERR1760144.1 — think of it as a barcode label on every read saying "I came from sample Aha18, sequenced on Illumina, in run ERR1760144". Without it, GATK can't distinguish which sample a read belongs to during joint genotyping!

Read Groups are mandatory for GATK. Without @RG tags, HaplotypeCaller will crash or produce incorrect results. The SM: tag becomes the sample name in the final VCF. The ID: tag uniquely identifies each sequencing run. Always add read groups at the alignment step.

Step 2 — Mark Duplicates with Picard

Bash — Picard MarkDuplicates

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --time=24:00:00
#SBATCH --job-name=markdup
#SBATCH --output=markdup.%j.out
#SBATCH --mem=32G

module load picard
module load samtools

INDIR="3_pre-processing"

for SAMPLE in Aha18 AhaN1 AhaN3 AhaN4; do
    echo "Marking duplicates: ${SAMPLE}"

    picard MarkDuplicates \
        INPUT=${INDIR}/${SAMPLE}.sorted.bam \
        OUTPUT=${INDIR}/${SAMPLE}.dedup.bam \
        METRICS_FILE=${INDIR}/${SAMPLE}.dup_metrics.txt \
        REMOVE_DUPLICATES=false \
        ASSUME_SORTED=true \
        VALIDATION_STRINGENCY=SILENT

    samtools index ${INDIR}/${SAMPLE}.dedup.bam

    echo "  Duplication metrics:"
    grep -A2 "ESTIMATED_LIBRARY_SIZE" ${INDIR}/${SAMPLE}.dup_metrics.txt | tail -2
done

Download:

Picard MarkDuplicates parameters

📥INPUT / OUTPUT	Input sorted BAM and output deduplicated BAM. The output BAM keeps all reads — duplicates are flagged (not deleted). GATK will automatically ignore flagged duplicates during variant calling.
📊METRICS_FILE	Writes a duplication statistics file. Open this to see your library complexity — high duplication (>50%) means low-complexity library or over-sequencing, not a pipeline error.
🚫REMOVE_DUPLICATES=false	Keeps duplicates in the BAM with a flag bit set rather than physically removing them. This is the GATK-recommended approach — it preserves the BAM for other analyses and lets you change the duplicate-handling strategy later.
✅ASSUME_SORTED=true	Skips the sort check (we already sorted with samtools). Speeds up processing by ~20%.

🎭

What IS a duplicate? When the same DNA fragment gets sequenced more than once (common with PCR amplification), the copies are "duplicates". They look like extra evidence for a variant but are really the same molecule. GATK marks them so it only counts each unique molecule once — otherwise, a PCR artifact would look like a real high-frequency variant!

Step 3 — Base Quality Score Recalibration (BQSR)

BQSR corrects systematic errors in base quality scores assigned by the sequencer. It requires a set of known variant sites to identify non-variant positions for modeling.

Bash — GATK BaseRecalibrator + ApplyBQSR

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --job-name=bqsr
#SBATCH --output=bqsr.%j.out
#SBATCH --mem=32G

module load gatk/4.2.0.0
module load samtools

REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
INDIR="3_pre-processing"
OUTDIR="5_recalibration"

# Create GATK sequence dictionary (required)
gatk CreateSequenceDictionary -R ${REF}

# NOTE: For A. halleri, download known SNPs from Ensembl:
# wget http://ftp.ensemblgenomes.org/pub/plants/release-43/vcf/arabidopsis_halleri/Arabidopsis_halleri.vcf.gz
KNOWN_SITES="0_index/Arabidopsis_halleri.vcf.gz"

for SAMPLE in Aha18 AhaN1 AhaN3 AhaN4; do
    echo "BQSR for: ${SAMPLE}"

    # Step 1: Build recalibration table
    gatk BaseRecalibrator \
        -I ${INDIR}/${SAMPLE}.dedup.bam \
        -R ${REF} \
        --known-sites ${KNOWN_SITES} \
        -O ${OUTDIR}/${SAMPLE}.recal_data.table

    # Step 2: Apply the recalibration
    gatk ApplyBQSR \
        -I ${INDIR}/${SAMPLE}.dedup.bam \
        -R ${REF} \
        --bqsr-recal-file ${OUTDIR}/${SAMPLE}.recal_data.table \
        -O ${OUTDIR}/${SAMPLE}.bqsr.bam

    samtools index ${OUTDIR}/${SAMPLE}.bqsr.bam
    echo "  Done: ${SAMPLE}.bqsr.bam"
done

Download:

BQSR commands decoded

📐BaseRecalibrator	Scans every base in the BAM and builds a statistical model of quality score errors. It uses known variant sites as "ground truth" — positions where a mismatch from the reference is a real SNP, not a sequencing error. Output: a recalibration table.
🔧ApplyBQSR	Reads the recalibration table and rewrites the quality scores in the BAM. Bases that were systematically underestimated get higher scores; overestimated bases get lower scores. The result is a BAM where quality scores accurately reflect actual error probabilities.
📍--known-sites	A VCF of known variants for the species. These sites are excluded from error modeling because a mismatch there is a real SNP, not a sequencing error. Without this, GATK would "learn" from real variants and over-correct quality scores.

🔬

Why does BQSR matter? Illumina assigns base quality scores (Phred scale) during sequencing, but these scores have systematic biases — e.g., bases after a GGC context are consistently over-scored. BQSR fixes these biases, giving GATK more accurate error probabilities, which improves variant detection sensitivity and reduces false positives.

Step 4 — HaplotypeCaller (per-sample gVCF)

Bash — HaplotypeCaller in GVCF mode

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --time=72:00:00
#SBATCH --job-name=haplotypecaller
#SBATCH --output=hc.%j.out
#SBATCH --mem=32G

module load gatk/4.2.0.0

REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
INDIR="5_recalibration"
OUTDIR="4_gatk-round-1"

for SAMPLE in Aha18 AhaN1 AhaN3 AhaN4; do
    echo "HaplotypeCaller: ${SAMPLE}"

    # -ERC GVCF = emit all sites for later joint genotyping
    # This allows combining samples without re-running the slow HC step
    gatk HaplotypeCaller \
        -R ${REF} \
        -I ${INDIR}/${SAMPLE}.bqsr.bam \
        -O ${OUTDIR}/${SAMPLE}.g.vcf.gz \
        -ERC GVCF \
        --native-pair-hmm-threads 8

    echo "  gVCF created: ${SAMPLE}.g.vcf.gz"
done

Download:

HaplotypeCaller flags decoded

🧬-ERC GVCF	Emit Reference Confidence in GVCF mode. Records evidence at every genomic position (not just variant sites). This enables efficient joint genotyping: add a new sample later by just adding its gVCF — no need to re-run the slow HC step on old samples.
🧵--native-pair-hmm-threads 8	HaplotypeCaller's most compute-intensive step (pairHMM likelihood calculation) runs on 8 threads. This alone can 4–8× speed up the per-sample runtime.
📄-O sample.g.vcf.gz	Output a bgzip-compressed GVCF. The `.g.vcf.gz` extension convention signals this is a GVCF (not a regular VCF) to downstream tools.

Why GVCF mode? Running HaplotypeCaller with -ERC GVCF creates per-sample gVCF files that record evidence for ALL sites (variant and non-variant). This enables efficient joint genotyping by combining gVCFs without re-running the expensive HaplotypeCaller step when new samples are added later.

Step 5 — Joint Genotyping

Bash — GenomicsDBImport + GenotypeGVCFs

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=48:00:00
#SBATCH --job-name=joint_genotype
#SBATCH --mem=64G

module load gatk/4.2.0.0

REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
GVCF_DIR="4_gatk-round-1"
OUTDIR="6_gatk_round-2"
DB_DIR="genomicsdb"

# Get chromosome/scaffold list from reference dict
INTERVALS=$(grep "^@SQ" ${REF%.fa}.dict | awk '{print $2}' | sed 's/SN://' | tr '\n' ' ')

# Step 1: Consolidate gVCFs into GenomicsDB
gatk GenomicsDBImport \
    -V ${GVCF_DIR}/Aha18.g.vcf.gz \
    -V ${GVCF_DIR}/AhaN1.g.vcf.gz \
    -V ${GVCF_DIR}/AhaN3.g.vcf.gz \
    -V ${GVCF_DIR}/AhaN4.g.vcf.gz \
    --genomicsdb-workspace-path ${DB_DIR} \
    --intervals chr1 \
    --reader-threads 8

# Step 2: Joint genotyping across all samples
gatk GenotypeGVCFs \
    -R ${REF} \
    -V gendb://${DB_DIR} \
    -O ${OUTDIR}/all_samples.raw.vcf.gz

echo "Raw VCF created: ${OUTDIR}/all_samples.raw.vcf.gz"

# Quick stats
gatk CountVariants -V ${OUTDIR}/all_samples.raw.vcf.gz

Download:

Step 6 — Variant Filtering

For organisms without high-quality known variant databases for VQSR training, use hard filters as an alternative:

Bash — Hard Filtering (SNPs and Indels)

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=16G

module load gatk/4.2.0.0
module load bcftools

RAW="6_gatk_round-2/all_samples.raw.vcf.gz"
OUTDIR="7_filtering"

# Extract SNPs only
gatk SelectVariants -V ${RAW} --select-type-to-include SNP \
    -O ${OUTDIR}/raw.snps.vcf.gz

# Apply GATK recommended hard filters for SNPs
# QD < 2.0 : Variant Confidence/Quality by Depth
# MQ < 40.0 : RMS Mapping Quality
# FS > 60.0 : Fisher Strand bias
# SOR > 3.0 : Strand Odds Ratio
# MQRankSum < -12.5 : Mapping Quality Rank Sum Test
# ReadPosRankSum < -8.0 : Read Position Rank Sum Test
gatk VariantFiltration \
    -V ${OUTDIR}/raw.snps.vcf.gz \
    --filter-expression "QD < 2.0"            --filter-name "QD2" \
    --filter-expression "MQ < 40.0"           --filter-name "MQ40" \
    --filter-expression "FS > 60.0"           --filter-name "FS60" \
    --filter-expression "SOR > 3.0"           --filter-name "SOR3" \
    --filter-expression "MQRankSum < -12.5"   --filter-name "MQRankSum-12.5" \
    --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
    -O ${OUTDIR}/filtered.snps.vcf.gz

# Keep only PASS variants
bcftools view -f PASS ${OUTDIR}/filtered.snps.vcf.gz \
    -O z -o ${OUTDIR}/final.snps.pass.vcf.gz
bcftools index ${OUTDIR}/final.snps.pass.vcf.gz

echo "Final SNP count:"
bcftools stats ${OUTDIR}/final.snps.pass.vcf.gz | grep "^SN" | grep "SNPs"

# Also filter indels
gatk SelectVariants -V ${RAW} --select-type-to-include INDEL \
    -O ${OUTDIR}/raw.indels.vcf.gz

gatk VariantFiltration \
    -V ${OUTDIR}/raw.indels.vcf.gz \
    --filter-expression "QD < 2.0"   --filter-name "QD2" \
    --filter-expression "FS > 200.0" --filter-name "FS200" \
    --filter-expression "SOR > 10.0" --filter-name "SOR10" \
    --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
    -O ${OUTDIR}/filtered.indels.vcf.gz

bcftools view -f PASS ${OUTDIR}/filtered.indels.vcf.gz \
    -O z -o ${OUTDIR}/final.indels.pass.vcf.gz

Download:

Hard filter expressions decoded

📉QD < 2.0	Quality by Depth — variant quality score divided by read depth. Low QD means weak variant signal relative to coverage. Real variants typically have QD > 10.
🧭MQ < 40.0	Mapping Quality — RMS mapping quality of reads. Low MQ means reads aligned to repetitive regions where the mapping is uncertain. Variants there are unreliable.
🎣FS > 60.0	Fisher Strand bias — measures if variant reads preferentially appear on one strand. PCR artifacts and errors often show extreme strand bias. Real SNPs are strand-balanced.
✅bcftools view -f PASS	Extracts only variants where the FILTER column is "PASS" (i.e., passed all filter expressions). Variants that failed any filter have their filter name written instead of PASS.

🎯

Hard filters vs VQSR: VQSR (Variant Quality Score Recalibration) is more powerful but needs large variant databases for training. For non-model organisms or small cohorts (<30 samples), hard filters are the recommended alternative — simpler and still effective.

Expected Results

Output	File	Expected Value
Alignment rates	*.log	>95% for all 4 samples
Duplication rate	*.dup_metrics.txt	10–30% (normal for short reads)
Raw variants	all_samples.raw.vcf.gz	~2–5M variants (SNPs + indels)
Filtered SNPs	final.snps.pass.vcf.gz	~1–3M high-quality SNPs
Filtered indels	final.indels.pass.vcf.gz	~200–500k high-quality indels

Quality check your VCF. After filtering, plot the distribution of variant quality metrics (QD, FS, MQ) using R or Python to verify your filters were appropriate. A good SNP set should have a bimodal QD distribution with a clear separation between real variants (high QD) and noise (low QD).

Troubleshooting

HaplotypeCaller: "SAM/BAM/CRAM file is malformed" or missing read group

GATK requires read groups on all BAM files. Check with samtools view -H sample.bam | grep "^@RG". If missing, add with Picard's AddOrReplaceReadGroups.

GenomicsDBImport: runs out of memory or disk space

Process one chromosome/scaffold at a time using --intervals. GenomicsDB can consume 10–50× the original gVCF size in temporary disk space. Use --tmp-dir to point to a partition with sufficient space.

BQSR: "No known sites provided" error

For organisms without a dbSNP equivalent, do a "bootstrap" BQSR: (1) Call raw variants, (2) Hard-filter to high-confidence variants, (3) Use those as known sites for BQSR, (4) Re-call variants on the recalibrated BAMs.

GATK4 Best Practices SNP & Indel Calling

On this page

Overview

Pipeline Steps

Tools Required

Ideal Directory Structure

Dataset

Setup & Directory Organization

Step 0 — Quality Check

Step 1 — BWA-MEM Alignment

Step 2 — Mark Duplicates with Picard

Step 3 — Base Quality Score Recalibration (BQSR)

Step 4 — HaplotypeCaller (per-sample gVCF)

Step 5 — Joint Genotyping

Step 6 — Variant Filtering

Expected Results

Troubleshooting

HaplotypeCaller: "SAM/BAM/CRAM file is malformed" or missing read group

GenomicsDBImport: runs out of memory or disk space

BQSR: "No known sites provided" error