GATK4 Best Practices SNP & Indel Calling

Population genomics variant calling on four Arabidopsis halleri populations using the full GATK4 BQSR + HaplotypeCaller + joint genotyping pipeline.

~180 min PRJEB18647 Advanced Bash
Byte

Overview

The GATK Best Practices workflow for germline short variant discovery is the gold standard for SNP and indel calling from short-read DNA sequencing data. This tutorial follows the workflow exactly as described for diploid organisms.

Pipeline Steps
  1. FastQC quality check
  2. BWA-MEM alignment to reference
  3. Sort, mark duplicates (Picard)
  4. BQSR round 1 — base quality recalibration
  5. HaplotypeCaller — per-sample gVCF
  6. GenomicsDBImport + joint genotyping
  7. Variant quality score recalibration (VQSR)
Tools Required
  • BWA (bwa-mem2 or bwa)
  • samtools
  • Picard (MarkDuplicates)
  • GATK4 (BaseRecalibrator, HaplotypeCaller, GenotypeGVCFs)
  • bcftools / VCFtools (filtering)
GATK Workflow

GATK best practices workflow overview. Image from the Bioinformatics Workbook (Arun Seetharam).

Ideal Directory Structure

GATK workflows generate many intermediate files. Strict folder discipline prevents overwriting and makes debugging easier.

gatk_ahalleri_project/
gatk_ahalleri_project/ ├── 0_index/ # reference FASTA + BWA index + .dict + .fai ├── 1_data/ # raw FASTQ (read-only!) │ ├── ERR1760144_1.fastq.gz │ └── ... (8 files, 4 samples × 2) ├── 2_fastqc/ # FastQC + MultiQC reports ├── 3_pre-processing/ # sorted BAM, MarkDuplicates output │ ├── ERR1760144.sorted.bam │ └── ERR1760144.dedup.bam # duplicates marked ├── 4_gatk-round-1/ # BaseRecalibrator tables (pre-BQSR) ├── 5_recalibration/ # BQSR-recalibrated BAMs │ └── ERR1760144.bqsr.bam ├── 6_gatk_round-2/ # per-sample gVCF files (HaplotypeCaller output) │ └── ERR1760144.g.vcf.gz └── 7_filtering/ # joint-genotyped + hard-filtered VCF ├── all_samples.vcf.gz └── all_samples.filtered.vcf.gz
🧰
Why all these folders? GATK is a multi-step pipeline where each step takes the previous step's output. If a later step fails, you want to restart from the last successful step — not from scratch! Keeping outputs in separate numbered folders lets you pick up exactly where you left off.

Dataset

Four populations of Arabidopsis halleri subsp. halleri sequenced to estimate genomic diversity and population differentiation (Fischer et al., 2016). Reference genome: A. halleri Ahal2.2 (Ensembl Plants).

SampleENA IDRead LengthInsert SizeTotal Bases
Aha18ERR1760144202 bp250 bp21.4 Gb
AhaN1ERR1760145200 bp150 bp30.1 Gb
AhaN3ERR1760146200 bp150 bp29.6 Gb
AhaN4ERR1760147200 bp150 bp31.2 Gb

Setup & Directory Organization

Bash — Create directory structure and download data
mkdir -p {0_index,1_data,2_fastqc,3_pre-processing,4_gatk-round-1,5_recalibration,6_gatk_round-2,7_filtering}

# Download reads from ENA (faster than SRA)
cd 1_data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/004/ERR1760144/ERR1760144_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/004/ERR1760144/ERR1760144_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/005/ERR1760145/ERR1760145_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/005/ERR1760145/ERR1760145_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/006/ERR1760146/ERR1760146_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/006/ERR1760146/ERR1760146_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/007/ERR1760147/ERR1760147_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/007/ERR1760147/ERR1760147_2.fastq.gz
cd ..

# Download A. halleri reference genome
cd 0_index
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-43/fasta/arabidopsis_halleri/dna/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa.gz
gunzip Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa.gz
cd ..
Download:
Setup commands decoded
📁mkdir -p {0_index,...}Brace expansion creates all 8 directories in one command. The -p flag means "no error if they already exist". Running this at the start ensures your pipeline never fails due to a missing output folder.
🔒samtools faidxCreates a .fai index of the reference FASTA. GATK requires this to quickly jump to any genomic coordinate. Think of it as adding page numbers to a book.
📖gatk CreateSequenceDictionaryCreates a .dict file listing all chromosomes and their lengths. GATK uses this as a "table of contents" to validate that your BAM and reference have matching chromosomes before running expensive analyses.
🗺️bwa-mem2 indexBuilds a BWA-MEM2 index of the reference (faster modern version of BWA). Like HISAT2-build but for DNA — required before alignment. Takes ~15 min for a plant-sized genome.

Step 0 — Quality Check

Bash — FastQC on all samples
cd 2_fastqc
for fq in ../1_data/*.fastq.gz; do
    ln -s $fq
done

module load parallel
module load fastqc
parallel "fastqc {}" ::: *.fastq.gz
cd ..
Download:

Step 1 — BWA-MEM Alignment

Bash — BWA index + alignment with read groups (SLURM)
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=48:00:00
#SBATCH --job-name=bwa_align
#SBATCH --output=bwa_align.%j.out
#SBATCH --error=bwa_align.%j.err

module load bwa
module load samtools

REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
OUTDIR="3_pre-processing"

# Index the reference (run once)
bwa index ${REF}
samtools faidx ${REF}

# Declare sample metadata (name, ERR ID)
declare -A SAMPLES=(
    [Aha18]=ERR1760144
    [AhaN1]=ERR1760145
    [AhaN3]=ERR1760146
    [AhaN4]=ERR1760147
)

for SAMPLE in "${!SAMPLES[@]}"; do
    ERR="${SAMPLES[$SAMPLE]}"
    R1="1_data/${ERR}_1.fastq.gz"
    R2="1_data/${ERR}_2.fastq.gz"

    echo "Aligning: ${SAMPLE} (${ERR})"

    # Read Group tag is CRITICAL for GATK - identifies sample in multi-sample VCF
    RG="@RG\tID:${ERR}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:${SAMPLE}_lib1\tPU:${ERR}.1"

    bwa mem -t 16 \
        -R "${RG}" \
        ${REF} ${R1} ${R2} \
        | samtools view -bS - \
        | samtools sort -o ${OUTDIR}/${SAMPLE}.sorted.bam -

    samtools index ${OUTDIR}/${SAMPLE}.sorted.bam
done
Download:
BWA-MEM flags decoded
🧵-t 16Use 16 CPU threads. BWA-MEM scales well — 16 threads is ~12× faster than single-threaded for these large WGS datasets.
🏷️-R "@RG\t..."Read Group tag — the most important GATK requirement. It embeds sample metadata into every alignment record. SM:=sample name (appears in VCF header), ID:=run ID, PL:=platform (ILLUMINA), LB:=library. GATK will refuse to run HaplotypeCaller without these tags.
🔀pipe | to samtoolsBWA writes to stdout, which we pipe directly to samtools view (SAM→BAM) then samtools sort. This saves ~15 GB of temporary SAM disk space per sample by never writing the SAM at all.
🏷️
Read Group anatomy: @RG\tID:ERR1760144\tSM:Aha18\tPL:ILLUMINA\tLB:Aha18_lib1\tPU:ERR1760144.1 — think of it as a barcode label on every read saying "I came from sample Aha18, sequenced on Illumina, in run ERR1760144". Without it, GATK can't distinguish which sample a read belongs to during joint genotyping!
Read Groups are mandatory for GATK. Without @RG tags, HaplotypeCaller will crash or produce incorrect results. The SM: tag becomes the sample name in the final VCF. The ID: tag uniquely identifies each sequencing run. Always add read groups at the alignment step.

Step 2 — Mark Duplicates with Picard

Bash — Picard MarkDuplicates
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --time=24:00:00
#SBATCH --job-name=markdup
#SBATCH --output=markdup.%j.out
#SBATCH --mem=32G

module load picard
module load samtools

INDIR="3_pre-processing"

for SAMPLE in Aha18 AhaN1 AhaN3 AhaN4; do
    echo "Marking duplicates: ${SAMPLE}"

    picard MarkDuplicates \
        INPUT=${INDIR}/${SAMPLE}.sorted.bam \
        OUTPUT=${INDIR}/${SAMPLE}.dedup.bam \
        METRICS_FILE=${INDIR}/${SAMPLE}.dup_metrics.txt \
        REMOVE_DUPLICATES=false \
        ASSUME_SORTED=true \
        VALIDATION_STRINGENCY=SILENT

    samtools index ${INDIR}/${SAMPLE}.dedup.bam

    echo "  Duplication metrics:"
    grep -A2 "ESTIMATED_LIBRARY_SIZE" ${INDIR}/${SAMPLE}.dup_metrics.txt | tail -2
done
Download:
Picard MarkDuplicates parameters
📥INPUT / OUTPUTInput sorted BAM and output deduplicated BAM. The output BAM keeps all reads — duplicates are flagged (not deleted). GATK will automatically ignore flagged duplicates during variant calling.
📊METRICS_FILEWrites a duplication statistics file. Open this to see your library complexity — high duplication (>50%) means low-complexity library or over-sequencing, not a pipeline error.
🚫REMOVE_DUPLICATES=falseKeeps duplicates in the BAM with a flag bit set rather than physically removing them. This is the GATK-recommended approach — it preserves the BAM for other analyses and lets you change the duplicate-handling strategy later.
ASSUME_SORTED=trueSkips the sort check (we already sorted with samtools). Speeds up processing by ~20%.
🎭
What IS a duplicate? When the same DNA fragment gets sequenced more than once (common with PCR amplification), the copies are "duplicates". They look like extra evidence for a variant but are really the same molecule. GATK marks them so it only counts each unique molecule once — otherwise, a PCR artifact would look like a real high-frequency variant!

Step 3 — Base Quality Score Recalibration (BQSR)

BQSR corrects systematic errors in base quality scores assigned by the sequencer. It requires a set of known variant sites to identify non-variant positions for modeling.

Bash — GATK BaseRecalibrator + ApplyBQSR
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --job-name=bqsr
#SBATCH --output=bqsr.%j.out
#SBATCH --mem=32G

module load gatk/4.2.0.0
module load samtools

REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
INDIR="3_pre-processing"
OUTDIR="5_recalibration"

# Create GATK sequence dictionary (required)
gatk CreateSequenceDictionary -R ${REF}

# NOTE: For A. halleri, download known SNPs from Ensembl:
# wget http://ftp.ensemblgenomes.org/pub/plants/release-43/vcf/arabidopsis_halleri/Arabidopsis_halleri.vcf.gz
KNOWN_SITES="0_index/Arabidopsis_halleri.vcf.gz"

for SAMPLE in Aha18 AhaN1 AhaN3 AhaN4; do
    echo "BQSR for: ${SAMPLE}"

    # Step 1: Build recalibration table
    gatk BaseRecalibrator \
        -I ${INDIR}/${SAMPLE}.dedup.bam \
        -R ${REF} \
        --known-sites ${KNOWN_SITES} \
        -O ${OUTDIR}/${SAMPLE}.recal_data.table

    # Step 2: Apply the recalibration
    gatk ApplyBQSR \
        -I ${INDIR}/${SAMPLE}.dedup.bam \
        -R ${REF} \
        --bqsr-recal-file ${OUTDIR}/${SAMPLE}.recal_data.table \
        -O ${OUTDIR}/${SAMPLE}.bqsr.bam

    samtools index ${OUTDIR}/${SAMPLE}.bqsr.bam
    echo "  Done: ${SAMPLE}.bqsr.bam"
done
Download:
BQSR commands decoded
📐BaseRecalibratorScans every base in the BAM and builds a statistical model of quality score errors. It uses known variant sites as "ground truth" — positions where a mismatch from the reference is a real SNP, not a sequencing error. Output: a recalibration table.
🔧ApplyBQSRReads the recalibration table and rewrites the quality scores in the BAM. Bases that were systematically underestimated get higher scores; overestimated bases get lower scores. The result is a BAM where quality scores accurately reflect actual error probabilities.
📍--known-sitesA VCF of known variants for the species. These sites are excluded from error modeling because a mismatch there is a real SNP, not a sequencing error. Without this, GATK would "learn" from real variants and over-correct quality scores.
🔬
Why does BQSR matter? Illumina assigns base quality scores (Phred scale) during sequencing, but these scores have systematic biases — e.g., bases after a GGC context are consistently over-scored. BQSR fixes these biases, giving GATK more accurate error probabilities, which improves variant detection sensitivity and reduces false positives.

Step 4 — HaplotypeCaller (per-sample gVCF)

Bash — HaplotypeCaller in GVCF mode
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --time=72:00:00
#SBATCH --job-name=haplotypecaller
#SBATCH --output=hc.%j.out
#SBATCH --mem=32G

module load gatk/4.2.0.0

REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
INDIR="5_recalibration"
OUTDIR="4_gatk-round-1"

for SAMPLE in Aha18 AhaN1 AhaN3 AhaN4; do
    echo "HaplotypeCaller: ${SAMPLE}"

    # -ERC GVCF = emit all sites for later joint genotyping
    # This allows combining samples without re-running the slow HC step
    gatk HaplotypeCaller \
        -R ${REF} \
        -I ${INDIR}/${SAMPLE}.bqsr.bam \
        -O ${OUTDIR}/${SAMPLE}.g.vcf.gz \
        -ERC GVCF \
        --native-pair-hmm-threads 8

    echo "  gVCF created: ${SAMPLE}.g.vcf.gz"
done
Download:
HaplotypeCaller flags decoded
🧬-ERC GVCFEmit Reference Confidence in GVCF mode. Records evidence at every genomic position (not just variant sites). This enables efficient joint genotyping: add a new sample later by just adding its gVCF — no need to re-run the slow HC step on old samples.
🧵--native-pair-hmm-threads 8HaplotypeCaller's most compute-intensive step (pairHMM likelihood calculation) runs on 8 threads. This alone can 4–8× speed up the per-sample runtime.
📄-O sample.g.vcf.gzOutput a bgzip-compressed GVCF. The .g.vcf.gz extension convention signals this is a GVCF (not a regular VCF) to downstream tools.
Byte
Why GVCF mode? Running HaplotypeCaller with -ERC GVCF creates per-sample gVCF files that record evidence for ALL sites (variant and non-variant). This enables efficient joint genotyping by combining gVCFs without re-running the expensive HaplotypeCaller step when new samples are added later.

Step 5 — Joint Genotyping

Bash — GenomicsDBImport + GenotypeGVCFs
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=48:00:00
#SBATCH --job-name=joint_genotype
#SBATCH --mem=64G

module load gatk/4.2.0.0

REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
GVCF_DIR="4_gatk-round-1"
OUTDIR="6_gatk_round-2"
DB_DIR="genomicsdb"

# Get chromosome/scaffold list from reference dict
INTERVALS=$(grep "^@SQ" ${REF%.fa}.dict | awk '{print $2}' | sed 's/SN://' | tr '\n' ' ')

# Step 1: Consolidate gVCFs into GenomicsDB
gatk GenomicsDBImport \
    -V ${GVCF_DIR}/Aha18.g.vcf.gz \
    -V ${GVCF_DIR}/AhaN1.g.vcf.gz \
    -V ${GVCF_DIR}/AhaN3.g.vcf.gz \
    -V ${GVCF_DIR}/AhaN4.g.vcf.gz \
    --genomicsdb-workspace-path ${DB_DIR} \
    --intervals chr1 \
    --reader-threads 8

# Step 2: Joint genotyping across all samples
gatk GenotypeGVCFs \
    -R ${REF} \
    -V gendb://${DB_DIR} \
    -O ${OUTDIR}/all_samples.raw.vcf.gz

echo "Raw VCF created: ${OUTDIR}/all_samples.raw.vcf.gz"

# Quick stats
gatk CountVariants -V ${OUTDIR}/all_samples.raw.vcf.gz
Download:

Step 6 — Variant Filtering

For organisms without high-quality known variant databases for VQSR training, use hard filters as an alternative:

Bash — Hard Filtering (SNPs and Indels)
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=16G

module load gatk/4.2.0.0
module load bcftools

RAW="6_gatk_round-2/all_samples.raw.vcf.gz"
OUTDIR="7_filtering"

# Extract SNPs only
gatk SelectVariants -V ${RAW} --select-type-to-include SNP \
    -O ${OUTDIR}/raw.snps.vcf.gz

# Apply GATK recommended hard filters for SNPs
# QD < 2.0 : Variant Confidence/Quality by Depth
# MQ < 40.0 : RMS Mapping Quality
# FS > 60.0 : Fisher Strand bias
# SOR > 3.0 : Strand Odds Ratio
# MQRankSum < -12.5 : Mapping Quality Rank Sum Test
# ReadPosRankSum < -8.0 : Read Position Rank Sum Test
gatk VariantFiltration \
    -V ${OUTDIR}/raw.snps.vcf.gz \
    --filter-expression "QD < 2.0"            --filter-name "QD2" \
    --filter-expression "MQ < 40.0"           --filter-name "MQ40" \
    --filter-expression "FS > 60.0"           --filter-name "FS60" \
    --filter-expression "SOR > 3.0"           --filter-name "SOR3" \
    --filter-expression "MQRankSum < -12.5"   --filter-name "MQRankSum-12.5" \
    --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
    -O ${OUTDIR}/filtered.snps.vcf.gz

# Keep only PASS variants
bcftools view -f PASS ${OUTDIR}/filtered.snps.vcf.gz \
    -O z -o ${OUTDIR}/final.snps.pass.vcf.gz
bcftools index ${OUTDIR}/final.snps.pass.vcf.gz

echo "Final SNP count:"
bcftools stats ${OUTDIR}/final.snps.pass.vcf.gz | grep "^SN" | grep "SNPs"

# Also filter indels
gatk SelectVariants -V ${RAW} --select-type-to-include INDEL \
    -O ${OUTDIR}/raw.indels.vcf.gz

gatk VariantFiltration \
    -V ${OUTDIR}/raw.indels.vcf.gz \
    --filter-expression "QD < 2.0"   --filter-name "QD2" \
    --filter-expression "FS > 200.0" --filter-name "FS200" \
    --filter-expression "SOR > 10.0" --filter-name "SOR10" \
    --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
    -O ${OUTDIR}/filtered.indels.vcf.gz

bcftools view -f PASS ${OUTDIR}/filtered.indels.vcf.gz \
    -O z -o ${OUTDIR}/final.indels.pass.vcf.gz
Download:
Hard filter expressions decoded
📉QD < 2.0Quality by Depth — variant quality score divided by read depth. Low QD means weak variant signal relative to coverage. Real variants typically have QD > 10.
🧭MQ < 40.0Mapping Quality — RMS mapping quality of reads. Low MQ means reads aligned to repetitive regions where the mapping is uncertain. Variants there are unreliable.
🎣FS > 60.0Fisher Strand bias — measures if variant reads preferentially appear on one strand. PCR artifacts and errors often show extreme strand bias. Real SNPs are strand-balanced.
bcftools view -f PASSExtracts only variants where the FILTER column is "PASS" (i.e., passed all filter expressions). Variants that failed any filter have their filter name written instead of PASS.
🎯
Hard filters vs VQSR: VQSR (Variant Quality Score Recalibration) is more powerful but needs large variant databases for training. For non-model organisms or small cohorts (<30 samples), hard filters are the recommended alternative — simpler and still effective.

Expected Results

OutputFileExpected Value
Alignment rates*.log>95% for all 4 samples
Duplication rate*.dup_metrics.txt10–30% (normal for short reads)
Raw variantsall_samples.raw.vcf.gz~2–5M variants (SNPs + indels)
Filtered SNPsfinal.snps.pass.vcf.gz~1–3M high-quality SNPs
Filtered indelsfinal.indels.pass.vcf.gz~200–500k high-quality indels
Quality check your VCF. After filtering, plot the distribution of variant quality metrics (QD, FS, MQ) using R or Python to verify your filters were appropriate. A good SNP set should have a bimodal QD distribution with a clear separation between real variants (high QD) and noise (low QD).

Troubleshooting

HaplotypeCaller: "SAM/BAM/CRAM file is malformed" or missing read group

GATK requires read groups on all BAM files. Check with samtools view -H sample.bam | grep "^@RG". If missing, add with Picard's AddOrReplaceReadGroups.

GenomicsDBImport: runs out of memory or disk space

Process one chromosome/scaffold at a time using --intervals. GenomicsDB can consume 10–50× the original gVCF size in temporary disk space. Use --tmp-dir to point to a partition with sufficient space.

BQSR: "No known sites provided" error

For organisms without a dbSNP equivalent, do a "bootstrap" BQSR: (1) Call raw variants, (2) Hard-filter to high-confidence variants, (3) Use those as known sites for BQSR, (4) Re-call variants on the recalibrated BAMs.