Population genomics variant calling on four Arabidopsis halleri populations using the full GATK4 BQSR + HaplotypeCaller + joint genotyping pipeline.
The GATK Best Practices workflow for germline short variant discovery is the gold standard for SNP and indel calling from short-read DNA sequencing data. This tutorial follows the workflow exactly as described for diploid organisms.
GATK best practices workflow overview. Image from the Bioinformatics Workbook (Arun Seetharam).
GATK workflows generate many intermediate files. Strict folder discipline prevents overwriting and makes debugging easier.
Four populations of Arabidopsis halleri subsp. halleri sequenced to estimate genomic diversity and population differentiation (Fischer et al., 2016). Reference genome: A. halleri Ahal2.2 (Ensembl Plants).
| Sample | ENA ID | Read Length | Insert Size | Total Bases |
|---|---|---|---|---|
| Aha18 | ERR1760144 | 202 bp | 250 bp | 21.4 Gb |
| AhaN1 | ERR1760145 | 200 bp | 150 bp | 30.1 Gb |
| AhaN3 | ERR1760146 | 200 bp | 150 bp | 29.6 Gb |
| AhaN4 | ERR1760147 | 200 bp | 150 bp | 31.2 Gb |
mkdir -p {0_index,1_data,2_fastqc,3_pre-processing,4_gatk-round-1,5_recalibration,6_gatk_round-2,7_filtering}
# Download reads from ENA (faster than SRA)
cd 1_data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/004/ERR1760144/ERR1760144_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/004/ERR1760144/ERR1760144_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/005/ERR1760145/ERR1760145_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/005/ERR1760145/ERR1760145_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/006/ERR1760146/ERR1760146_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/006/ERR1760146/ERR1760146_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/007/ERR1760147/ERR1760147_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR176/007/ERR1760147/ERR1760147_2.fastq.gz
cd ..
# Download A. halleri reference genome
cd 0_index
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-43/fasta/arabidopsis_halleri/dna/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa.gz
gunzip Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa.gz
cd ..
| 📁mkdir -p {0_index,...} | Brace expansion creates all 8 directories in one command. The -p flag means "no error if they already exist". Running this at the start ensures your pipeline never fails due to a missing output folder. |
| 🔒samtools faidx | Creates a .fai index of the reference FASTA. GATK requires this to quickly jump to any genomic coordinate. Think of it as adding page numbers to a book. |
| 📖gatk CreateSequenceDictionary | Creates a .dict file listing all chromosomes and their lengths. GATK uses this as a "table of contents" to validate that your BAM and reference have matching chromosomes before running expensive analyses. |
| 🗺️bwa-mem2 index | Builds a BWA-MEM2 index of the reference (faster modern version of BWA). Like HISAT2-build but for DNA — required before alignment. Takes ~15 min for a plant-sized genome. |
cd 2_fastqc
for fq in ../1_data/*.fastq.gz; do
ln -s $fq
done
module load parallel
module load fastqc
parallel "fastqc {}" ::: *.fastq.gz
cd ..
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=48:00:00
#SBATCH --job-name=bwa_align
#SBATCH --output=bwa_align.%j.out
#SBATCH --error=bwa_align.%j.err
module load bwa
module load samtools
REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
OUTDIR="3_pre-processing"
# Index the reference (run once)
bwa index ${REF}
samtools faidx ${REF}
# Declare sample metadata (name, ERR ID)
declare -A SAMPLES=(
[Aha18]=ERR1760144
[AhaN1]=ERR1760145
[AhaN3]=ERR1760146
[AhaN4]=ERR1760147
)
for SAMPLE in "${!SAMPLES[@]}"; do
ERR="${SAMPLES[$SAMPLE]}"
R1="1_data/${ERR}_1.fastq.gz"
R2="1_data/${ERR}_2.fastq.gz"
echo "Aligning: ${SAMPLE} (${ERR})"
# Read Group tag is CRITICAL for GATK - identifies sample in multi-sample VCF
RG="@RG\tID:${ERR}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:${SAMPLE}_lib1\tPU:${ERR}.1"
bwa mem -t 16 \
-R "${RG}" \
${REF} ${R1} ${R2} \
| samtools view -bS - \
| samtools sort -o ${OUTDIR}/${SAMPLE}.sorted.bam -
samtools index ${OUTDIR}/${SAMPLE}.sorted.bam
done
| 🧵-t 16 | Use 16 CPU threads. BWA-MEM scales well — 16 threads is ~12× faster than single-threaded for these large WGS datasets. |
| 🏷️-R "@RG\t..." | Read Group tag — the most important GATK requirement. It embeds sample metadata into every alignment record. SM:=sample name (appears in VCF header), ID:=run ID, PL:=platform (ILLUMINA), LB:=library. GATK will refuse to run HaplotypeCaller without these tags. |
🔀pipe | to samtools | BWA writes to stdout, which we pipe directly to samtools view (SAM→BAM) then samtools sort. This saves ~15 GB of temporary SAM disk space per sample by never writing the SAM at all. |
@RG\tID:ERR1760144\tSM:Aha18\tPL:ILLUMINA\tLB:Aha18_lib1\tPU:ERR1760144.1 — think of it as a barcode label on every read saying "I came from sample Aha18, sequenced on Illumina, in run ERR1760144". Without it, GATK can't distinguish which sample a read belongs to during joint genotyping!@RG tags, HaplotypeCaller will crash or produce incorrect results. The SM: tag becomes the sample name in the final VCF. The ID: tag uniquely identifies each sequencing run. Always add read groups at the alignment step.#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --time=24:00:00
#SBATCH --job-name=markdup
#SBATCH --output=markdup.%j.out
#SBATCH --mem=32G
module load picard
module load samtools
INDIR="3_pre-processing"
for SAMPLE in Aha18 AhaN1 AhaN3 AhaN4; do
echo "Marking duplicates: ${SAMPLE}"
picard MarkDuplicates \
INPUT=${INDIR}/${SAMPLE}.sorted.bam \
OUTPUT=${INDIR}/${SAMPLE}.dedup.bam \
METRICS_FILE=${INDIR}/${SAMPLE}.dup_metrics.txt \
REMOVE_DUPLICATES=false \
ASSUME_SORTED=true \
VALIDATION_STRINGENCY=SILENT
samtools index ${INDIR}/${SAMPLE}.dedup.bam
echo " Duplication metrics:"
grep -A2 "ESTIMATED_LIBRARY_SIZE" ${INDIR}/${SAMPLE}.dup_metrics.txt | tail -2
done
| 📥INPUT / OUTPUT | Input sorted BAM and output deduplicated BAM. The output BAM keeps all reads — duplicates are flagged (not deleted). GATK will automatically ignore flagged duplicates during variant calling. |
| 📊METRICS_FILE | Writes a duplication statistics file. Open this to see your library complexity — high duplication (>50%) means low-complexity library or over-sequencing, not a pipeline error. |
| 🚫REMOVE_DUPLICATES=false | Keeps duplicates in the BAM with a flag bit set rather than physically removing them. This is the GATK-recommended approach — it preserves the BAM for other analyses and lets you change the duplicate-handling strategy later. |
| ✅ASSUME_SORTED=true | Skips the sort check (we already sorted with samtools). Speeds up processing by ~20%. |
BQSR corrects systematic errors in base quality scores assigned by the sequencer. It requires a set of known variant sites to identify non-variant positions for modeling.
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --time=48:00:00
#SBATCH --job-name=bqsr
#SBATCH --output=bqsr.%j.out
#SBATCH --mem=32G
module load gatk/4.2.0.0
module load samtools
REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
INDIR="3_pre-processing"
OUTDIR="5_recalibration"
# Create GATK sequence dictionary (required)
gatk CreateSequenceDictionary -R ${REF}
# NOTE: For A. halleri, download known SNPs from Ensembl:
# wget http://ftp.ensemblgenomes.org/pub/plants/release-43/vcf/arabidopsis_halleri/Arabidopsis_halleri.vcf.gz
KNOWN_SITES="0_index/Arabidopsis_halleri.vcf.gz"
for SAMPLE in Aha18 AhaN1 AhaN3 AhaN4; do
echo "BQSR for: ${SAMPLE}"
# Step 1: Build recalibration table
gatk BaseRecalibrator \
-I ${INDIR}/${SAMPLE}.dedup.bam \
-R ${REF} \
--known-sites ${KNOWN_SITES} \
-O ${OUTDIR}/${SAMPLE}.recal_data.table
# Step 2: Apply the recalibration
gatk ApplyBQSR \
-I ${INDIR}/${SAMPLE}.dedup.bam \
-R ${REF} \
--bqsr-recal-file ${OUTDIR}/${SAMPLE}.recal_data.table \
-O ${OUTDIR}/${SAMPLE}.bqsr.bam
samtools index ${OUTDIR}/${SAMPLE}.bqsr.bam
echo " Done: ${SAMPLE}.bqsr.bam"
done
| 📐BaseRecalibrator | Scans every base in the BAM and builds a statistical model of quality score errors. It uses known variant sites as "ground truth" — positions where a mismatch from the reference is a real SNP, not a sequencing error. Output: a recalibration table. |
| 🔧ApplyBQSR | Reads the recalibration table and rewrites the quality scores in the BAM. Bases that were systematically underestimated get higher scores; overestimated bases get lower scores. The result is a BAM where quality scores accurately reflect actual error probabilities. |
| 📍--known-sites | A VCF of known variants for the species. These sites are excluded from error modeling because a mismatch there is a real SNP, not a sequencing error. Without this, GATK would "learn" from real variants and over-correct quality scores. |
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --time=72:00:00
#SBATCH --job-name=haplotypecaller
#SBATCH --output=hc.%j.out
#SBATCH --mem=32G
module load gatk/4.2.0.0
REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
INDIR="5_recalibration"
OUTDIR="4_gatk-round-1"
for SAMPLE in Aha18 AhaN1 AhaN3 AhaN4; do
echo "HaplotypeCaller: ${SAMPLE}"
# -ERC GVCF = emit all sites for later joint genotyping
# This allows combining samples without re-running the slow HC step
gatk HaplotypeCaller \
-R ${REF} \
-I ${INDIR}/${SAMPLE}.bqsr.bam \
-O ${OUTDIR}/${SAMPLE}.g.vcf.gz \
-ERC GVCF \
--native-pair-hmm-threads 8
echo " gVCF created: ${SAMPLE}.g.vcf.gz"
done
| 🧬-ERC GVCF | Emit Reference Confidence in GVCF mode. Records evidence at every genomic position (not just variant sites). This enables efficient joint genotyping: add a new sample later by just adding its gVCF — no need to re-run the slow HC step on old samples. |
| 🧵--native-pair-hmm-threads 8 | HaplotypeCaller's most compute-intensive step (pairHMM likelihood calculation) runs on 8 threads. This alone can 4–8× speed up the per-sample runtime. |
| 📄-O sample.g.vcf.gz | Output a bgzip-compressed GVCF. The .g.vcf.gz extension convention signals this is a GVCF (not a regular VCF) to downstream tools. |
-ERC GVCF creates per-sample gVCF files that record evidence for ALL sites (variant and non-variant). This enables efficient joint genotyping by combining gVCFs without re-running the expensive HaplotypeCaller step when new samples are added later.#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=48:00:00
#SBATCH --job-name=joint_genotype
#SBATCH --mem=64G
module load gatk/4.2.0.0
REF="0_index/Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa"
GVCF_DIR="4_gatk-round-1"
OUTDIR="6_gatk_round-2"
DB_DIR="genomicsdb"
# Get chromosome/scaffold list from reference dict
INTERVALS=$(grep "^@SQ" ${REF%.fa}.dict | awk '{print $2}' | sed 's/SN://' | tr '\n' ' ')
# Step 1: Consolidate gVCFs into GenomicsDB
gatk GenomicsDBImport \
-V ${GVCF_DIR}/Aha18.g.vcf.gz \
-V ${GVCF_DIR}/AhaN1.g.vcf.gz \
-V ${GVCF_DIR}/AhaN3.g.vcf.gz \
-V ${GVCF_DIR}/AhaN4.g.vcf.gz \
--genomicsdb-workspace-path ${DB_DIR} \
--intervals chr1 \
--reader-threads 8
# Step 2: Joint genotyping across all samples
gatk GenotypeGVCFs \
-R ${REF} \
-V gendb://${DB_DIR} \
-O ${OUTDIR}/all_samples.raw.vcf.gz
echo "Raw VCF created: ${OUTDIR}/all_samples.raw.vcf.gz"
# Quick stats
gatk CountVariants -V ${OUTDIR}/all_samples.raw.vcf.gz
For organisms without high-quality known variant databases for VQSR training, use hard filters as an alternative:
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=16G
module load gatk/4.2.0.0
module load bcftools
RAW="6_gatk_round-2/all_samples.raw.vcf.gz"
OUTDIR="7_filtering"
# Extract SNPs only
gatk SelectVariants -V ${RAW} --select-type-to-include SNP \
-O ${OUTDIR}/raw.snps.vcf.gz
# Apply GATK recommended hard filters for SNPs
# QD < 2.0 : Variant Confidence/Quality by Depth
# MQ < 40.0 : RMS Mapping Quality
# FS > 60.0 : Fisher Strand bias
# SOR > 3.0 : Strand Odds Ratio
# MQRankSum < -12.5 : Mapping Quality Rank Sum Test
# ReadPosRankSum < -8.0 : Read Position Rank Sum Test
gatk VariantFiltration \
-V ${OUTDIR}/raw.snps.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "SOR > 3.0" --filter-name "SOR3" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O ${OUTDIR}/filtered.snps.vcf.gz
# Keep only PASS variants
bcftools view -f PASS ${OUTDIR}/filtered.snps.vcf.gz \
-O z -o ${OUTDIR}/final.snps.pass.vcf.gz
bcftools index ${OUTDIR}/final.snps.pass.vcf.gz
echo "Final SNP count:"
bcftools stats ${OUTDIR}/final.snps.pass.vcf.gz | grep "^SN" | grep "SNPs"
# Also filter indels
gatk SelectVariants -V ${RAW} --select-type-to-include INDEL \
-O ${OUTDIR}/raw.indels.vcf.gz
gatk VariantFiltration \
-V ${OUTDIR}/raw.indels.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 200.0" --filter-name "FS200" \
--filter-expression "SOR > 10.0" --filter-name "SOR10" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
-O ${OUTDIR}/filtered.indels.vcf.gz
bcftools view -f PASS ${OUTDIR}/filtered.indels.vcf.gz \
-O z -o ${OUTDIR}/final.indels.pass.vcf.gz
| 📉QD < 2.0 | Quality by Depth — variant quality score divided by read depth. Low QD means weak variant signal relative to coverage. Real variants typically have QD > 10. |
| 🧭MQ < 40.0 | Mapping Quality — RMS mapping quality of reads. Low MQ means reads aligned to repetitive regions where the mapping is uncertain. Variants there are unreliable. |
| 🎣FS > 60.0 | Fisher Strand bias — measures if variant reads preferentially appear on one strand. PCR artifacts and errors often show extreme strand bias. Real SNPs are strand-balanced. |
| ✅bcftools view -f PASS | Extracts only variants where the FILTER column is "PASS" (i.e., passed all filter expressions). Variants that failed any filter have their filter name written instead of PASS. |
| Output | File | Expected Value |
|---|---|---|
| Alignment rates | *.log | >95% for all 4 samples |
| Duplication rate | *.dup_metrics.txt | 10–30% (normal for short reads) |
| Raw variants | all_samples.raw.vcf.gz | ~2–5M variants (SNPs + indels) |
| Filtered SNPs | final.snps.pass.vcf.gz | ~1–3M high-quality SNPs |
| Filtered indels | final.indels.pass.vcf.gz | ~200–500k high-quality indels |
GATK requires read groups on all BAM files. Check with samtools view -H sample.bam | grep "^@RG". If missing, add with Picard's AddOrReplaceReadGroups.
Process one chromosome/scaffold at a time using --intervals. GenomicsDB can consume 10–50× the original gVCF size in temporary disk space. Use --tmp-dir to point to a partition with sufficient space.
For organisms without a dbSNP equivalent, do a "bootstrap" BQSR: (1) Call raw variants, (2) Hard-filter to high-confidence variants, (3) Use those as known sites for BQSR, (4) Re-call variants on the recalibrated BAMs.