Variant Calling with GATK4 & bcftools

Discover SNPs and indels from whole-genome or exome sequencing data. Full GATK4 best-practices pipeline with hard filtering thresholds rationalized, plus a parallel bcftools workflow.

~60 min Intermediate Bash / Command Line
Byte explains DNA

? What Is Variant Calling?

Variant calling identifies positions in the genome where your sample differs from a reference genome. These differences — SNPs (single nucleotide polymorphisms) and indels (insertions/deletions) — can explain phenotypic differences, disease susceptibility, or evolutionary adaptation.

The Pipeline at a Glance
  1. QC raw reads (FastQC, fastp)
  2. Align to reference (BWA-MEM2 / minimap2)
  3. Process BAM (sort, mark duplicates, BQSR)
  4. Call variants (GATK HaplotypeCaller / bcftools mpileup)
  5. Filter variants (VQSR or hard filters)
  6. Annotate variants (SnpEff, VEP, ANNOVAR)
Byte key point
GATK vs. bcftools: Both are excellent. GATK4 is the gold standard for clinical/human genomics with a sophisticated probabilistic model. bcftools is faster, lighter, and performs comparably well for many applications including non-model organisms. We cover both in this tutorial so you can choose.

0 Required Tools

Bash
# Install via conda (recommended)
conda create -n variant_calling -c bioconda -c conda-forge \
  fastqc fastp bwa-mem2 samtools gatk4 bcftools snpsift snpeff picard

conda activate variant_calling

# Verify installations
gatk --version
bcftools --version
bwa-mem2 version

You also need:

  • Reference genome (FASTA) — e.g., GRCh38 for human, GRCm39 for mouse
  • Known variant sites (for BQSR) — e.g., dbSNP, Mills indels, 1000G
  • Paired-end FASTQ files from your sequencing run

1 Read Quality Control & Trimming

Bash
# Step 1a: Quality check with FastQC
fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o qc_reports/ -t 4

# Step 1b: Adapter trimming + quality filtering with fastp
fastp \
  --in1 sample_R1.fastq.gz \
  --in2 sample_R2.fastq.gz \
  --out1 trimmed_R1.fastq.gz \
  --out2 trimmed_R2.fastq.gz \
  --qualified_quality_phred 20 \
  --length_required 50 \
  --detect_adapter_for_pe \
  --thread 8 \
  --html fastp_report.html
ParameterPurposeHow to Choose
--qualified_quality_phred 20Minimum base quality to keep20 = 1% error rate. Use 30 for stricter filtering (0.1% error). For variant calling, 20 is usually sufficient since GATK models base quality internally.
--length_required 50Discard reads shorter than this after trimming50 bp for 150 bp reads. For 100 bp reads, try 36. Very short reads map ambiguously.
--detect_adapter_for_peAuto-detect Illumina adapters for paired-end dataAlmost always use this — fastp detects adapters automatically without needing a sequence file.

2 Align Reads to the Reference Genome

Bash
# Index the reference (one-time step)
bwa-mem2 index reference.fa

# Create sequence dictionary and fai index (required by GATK)
samtools faidx reference.fa
gatk CreateSequenceDictionary -R reference.fa

# Align with BWA-MEM2 (reads → SAM → sorted BAM)
SAMPLE="SampleA"
RGID="flowcell1.lane1"

bwa-mem2 mem \
  -t 16 \
  -R "@RG\tID:${RGID}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:lib1\tPU:unit1" \
  reference.fa \
  trimmed_R1.fastq.gz trimmed_R2.fastq.gz \
| samtools sort -@ 8 -o ${SAMPLE}.sorted.bam

samtools index ${SAMPLE}.sorted.bam
Byte key point
Read Group Tags Are Critical! The -R flag adds read group information that GATK requires. Without it, GATK will refuse to run.
  • ID — unique read group ID (usually flowcell + lane)
  • SM — sample name (the biological sample)
  • PL — platform (ILLUMINA, ONT, PACBIO)
  • LB — library identifier
  • PU — platform unit (flowcell barcode + lane)

3 BAM Post-Processing

Mark Duplicates

PCR duplicates artificially inflate variant support. Marking (not removing) them tells GATK to ignore them during calling.

Bash
gatk MarkDuplicates \
  -I ${SAMPLE}.sorted.bam \
  -O ${SAMPLE}.dedup.bam \
  -M ${SAMPLE}.dedup_metrics.txt \
  --REMOVE_DUPLICATES false \
  --CREATE_INDEX true
Base Quality Score Recalibration (BQSR)

Machine-reported base qualities are often inaccurate. BQSR recalibrates them using known variant sites, improving downstream calling accuracy.

Bash
# Step 3a: Build recalibration model
gatk BaseRecalibrator \
  -I ${SAMPLE}.dedup.bam \
  -R reference.fa \
  --known-sites dbsnp_146.vcf.gz \
  --known-sites Mills_and_1000G.indels.vcf.gz \
  -O ${SAMPLE}.recal_table

# Step 3b: Apply recalibration
gatk ApplyBQSR \
  -I ${SAMPLE}.dedup.bam \
  -R reference.fa \
  --bqsr-recal-file ${SAMPLE}.recal_table \
  -O ${SAMPLE}.recal.bam
Byte warns
No Known Sites for Your Organism? BQSR requires known variant databases. For non-model organisms without dbSNP, you can either: (1) skip BQSR entirely (bcftools handles this gracefully), (2) perform a first-pass variant call, use the high-confidence SNPs as "known sites," and bootstrap BQSR, or (3) use tools like DeepVariant which don't need BQSR at all.

4 Variant Calling with GATK HaplotypeCaller

Bash
# Single-sample calling (produces GVCF for later joint genotyping)
gatk HaplotypeCaller \
  -I ${SAMPLE}.recal.bam \
  -R reference.fa \
  -O ${SAMPLE}.g.vcf.gz \
  -ERC GVCF \
  --native-pair-hmm-threads 8

# For a single sample, you can also call directly to VCF:
gatk HaplotypeCaller \
  -I ${SAMPLE}.recal.bam \
  -R reference.fa \
  -O ${SAMPLE}.raw.vcf.gz
Byte with networks
GVCF vs. VCF:
  • GVCF mode (-ERC GVCF): records confidence at every position, including reference-only sites. Required for multi-sample joint calling. Always use this for population studies.
  • Direct VCF: simpler, for single-sample analyses or when you just need variants quickly.
  • For cohorts, run HaplotypeCaller per sample in GVCF mode, then combine with GenomicsDBImport + GenotypeGVCFs.

5 Variant Filtering

Raw variant calls contain many false positives. Filtering is essential. Two approaches: VQSR (machine learning, needs large cohorts) and hard filtering (rule-based, works for any size).

Hard Filtering (Recommended for Small Cohorts)
Bash
# Separate SNPs and Indels
gatk SelectVariants -V ${SAMPLE}.raw.vcf.gz -select-type SNP -O snps.vcf.gz
gatk SelectVariants -V ${SAMPLE}.raw.vcf.gz -select-type INDEL -O indels.vcf.gz

# Hard filter SNPs (GATK recommended thresholds)
gatk VariantFiltration \
  -V snps.vcf.gz \
  -filter "QD < 2.0"                --filter-name "LowQD" \
  -filter "FS > 60.0"               --filter-name "StrandBias" \
  -filter "MQ < 40.0"               --filter-name "LowMQ" \
  -filter "MQRankSum < -12.5"       --filter-name "MQRankSum" \
  -filter "ReadPosRankSum < -8.0"   --filter-name "ReadPosRankSum" \
  -filter "SOR > 3.0"               --filter-name "HighSOR" \
  -O snps.filtered.vcf.gz

# Hard filter Indels
gatk VariantFiltration \
  -V indels.vcf.gz \
  -filter "QD < 2.0"                --filter-name "LowQD" \
  -filter "FS > 200.0"              --filter-name "StrandBias" \
  -filter "ReadPosRankSum < -20.0"  --filter-name "ReadPosRankSum" \
  -filter "SOR > 10.0"              --filter-name "HighSOR" \
  -O indels.filtered.vcf.gz
Understanding Each Hard Filter
FilterWhat It MeasuresThreshold Rationale
QD < 2.0Quality by Depth — variant quality normalized by read depthLow QD = poor quality call relative to coverage. 2.0 is GATK's recommendation; increase to 5.0 for higher stringency.
FS > 60.0Fisher Strand bias — are variant reads preferentially from one strand?High FS suggests PCR or mapping artifact. 60 for SNPs, 200 for indels (indels tolerate more strand bias).
MQ < 40.0Root mean square Mapping Quality of reads at the siteMQ 40 = average MAPQ of 40 (very confident mapping). Lower values suggest repetitive or ambiguous regions.
SOR > 3.0Strand Odds Ratio — similar to FS but handles large sample sizes better3.0 for SNPs, 10.0 for indels. More robust than FS for high-depth data.
MQRankSumCompares mapping quality of ref vs. alt readsLarge negative values mean alt reads map much worse than ref reads — likely mapping artifact.
ReadPosRankSumTests whether alt alleles cluster at read endsVariants near read ends are enriched for sequencing errors.
Bash
# Keep only PASS variants
bcftools view -f PASS snps.filtered.vcf.gz -Oz -o snps.pass.vcf.gz
bcftools view -f PASS indels.filtered.vcf.gz -Oz -o indels.pass.vcf.gz

# Count variants
echo "PASS SNPs: $(bcftools view -H snps.pass.vcf.gz | wc -l)"
echo "PASS Indels: $(bcftools view -H indels.pass.vcf.gz | wc -l)"

Alternative: bcftools Variant Calling

A lighter, faster pipeline that works excellently for non-human organisms or when GATK's infrastructure feels heavyweight.

Bash
# Pileup + Call in one pipeline
bcftools mpileup \
  --fasta-ref reference.fa \
  --max-depth 500 \
  --min-MQ 20 \
  --min-BQ 20 \
  --annotate FORMAT/AD,FORMAT/DP \
  --threads 8 \
  ${SAMPLE}.dedup.bam \
| bcftools call \
  --multiallelic-caller \
  --variants-only \
  --threads 8 \
  -Oz -o ${SAMPLE}.bcftools.raw.vcf.gz

bcftools index ${SAMPLE}.bcftools.raw.vcf.gz

# Quality filtering
bcftools filter \
  -i 'QUAL>=30 && DP>=10 && DP<=500 && MQ>=30' \
  -Oz -o ${SAMPLE}.bcftools.filtered.vcf.gz \
  ${SAMPLE}.bcftools.raw.vcf.gz

echo "Filtered variants: $(bcftools view -H ${SAMPLE}.bcftools.filtered.vcf.gz | wc -l)"
bcftools FilterMeaningGuidance
QUAL >= 30Phred-scaled variant quality30 = 99.9% confidence. Use 20 for relaxed, 50 for stringent.
DP >= 10Minimum read depth at the site10 is a good minimum. For WGS at 30X, you might use 15. For low-coverage, use 5.
DP <= 500Maximum depth (removes collapsed repeats)Set to ~3× your expected mean coverage. Abnormally high depth signals repetitive regions.
MQ >= 30Mapping quality30 = 99.9% probability the reads are correctly mapped.

6 Variant Annotation

Annotation tells you the biological impact of each variant: is it synonymous, missense, nonsense? Which gene, transcript, or regulatory element is affected?

Bash
# SnpEff annotation (fast, local)
snpEff download GRCh38.105   # download the database for your genome build

snpEff ann GRCh38.105 \
  snps.pass.vcf.gz \
  -stats snpeff_stats.html \
  > snps.annotated.vcf

# Extract high-impact variants
cat snps.annotated.vcf | \
  java -jar SnpSift.jar filter "(ANN[*].IMPACT = 'HIGH')" \
  > high_impact_variants.vcf
Byte tip
Annotation Tools Compared:
  • SnpEff — fast, local, great for quick annotation; provides impact categories (HIGH, MODERATE, LOW, MODIFIER)
  • Ensembl VEP — most comprehensive; supports plugins for pathogenicity scores (CADD, SIFT, PolyPhen); preferred for clinical genomics
  • ANNOVAR — flexible, supports many databases; requires license for some databases

Summary

Byte
What You Learned:
  1. QC and trim raw FASTQ reads with fastp
  2. Align reads to a reference genome with BWA-MEM2 and proper read groups
  3. Mark duplicates and recalibrate base qualities with GATK
  4. Call variants with GATK HaplotypeCaller (GVCF and direct modes)
  5. Apply hard filters with rationalized threshold choices for each metric
  6. Run an alternative pipeline using bcftools mpileup + call
  7. Annotate variants with SnpEff for biological impact interpretation