Variant Calling Pipeline with GATK4 & bcftools — Complete Guide

? What Is Variant Calling?

Variant calling identifies positions in the genome where your sample differs from a reference genome. These differences — SNPs (single nucleotide polymorphisms) and indels (insertions/deletions) — can explain phenotypic differences, disease susceptibility, or evolutionary adaptation.

The Pipeline at a Glance

QC raw reads (FastQC, fastp)
Align to reference (BWA-MEM2 / minimap2)
Process BAM (sort, mark duplicates, BQSR)
Call variants (GATK HaplotypeCaller / bcftools mpileup)
Filter variants (VQSR or hard filters)
Annotate variants (SnpEff, VEP, ANNOVAR)

GATK vs. bcftools: Both are excellent. GATK4 is the gold standard for clinical/human genomics with a sophisticated probabilistic model. bcftools is faster, lighter, and performs comparably well for many applications including non-model organisms. We cover both in this tutorial so you can choose.

0 Required Tools

Bash

# Install via conda (recommended)
conda create -n variant_calling -c bioconda -c conda-forge \
  fastqc fastp bwa-mem2 samtools gatk4 bcftools snpsift snpeff picard

conda activate variant_calling

# Verify installations
gatk --version
bcftools --version
bwa-mem2 version

You also need:

Reference genome (FASTA) — e.g., GRCh38 for human, GRCm39 for mouse
Known variant sites (for BQSR) — e.g., dbSNP, Mills indels, 1000G
Paired-end FASTQ files from your sequencing run

1 Read Quality Control & Trimming

Bash

# Step 1a: Quality check with FastQC
fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o qc_reports/ -t 4

# Step 1b: Adapter trimming + quality filtering with fastp
fastp \
  --in1 sample_R1.fastq.gz \
  --in2 sample_R2.fastq.gz \
  --out1 trimmed_R1.fastq.gz \
  --out2 trimmed_R2.fastq.gz \
  --qualified_quality_phred 20 \
  --length_required 50 \
  --detect_adapter_for_pe \
  --thread 8 \
  --html fastp_report.html

Parameter	Purpose	How to Choose
`--qualified_quality_phred 20`	Minimum base quality to keep	20 = 1% error rate. Use 30 for stricter filtering (0.1% error). For variant calling, 20 is usually sufficient since GATK models base quality internally.
`--length_required 50`	Discard reads shorter than this after trimming	50 bp for 150 bp reads. For 100 bp reads, try 36. Very short reads map ambiguously.
`--detect_adapter_for_pe`	Auto-detect Illumina adapters for paired-end data	Almost always use this — fastp detects adapters automatically without needing a sequence file.

2 Align Reads to the Reference Genome

Bash

# Index the reference (one-time step)
bwa-mem2 index reference.fa

# Create sequence dictionary and fai index (required by GATK)
samtools faidx reference.fa
gatk CreateSequenceDictionary -R reference.fa

# Align with BWA-MEM2 (reads → SAM → sorted BAM)
SAMPLE="SampleA"
RGID="flowcell1.lane1"

bwa-mem2 mem \
  -t 16 \
  -R "@RG\tID:${RGID}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:lib1\tPU:unit1" \
  reference.fa \
  trimmed_R1.fastq.gz trimmed_R2.fastq.gz \
| samtools sort -@ 8 -o ${SAMPLE}.sorted.bam

samtools index ${SAMPLE}.sorted.bam

Read Group Tags Are Critical! The -R flag adds read group information that GATK requires. Without it, GATK will refuse to run.

ID — unique read group ID (usually flowcell + lane)
SM — sample name (the biological sample)
PL — platform (ILLUMINA, ONT, PACBIO)
LB — library identifier
PU — platform unit (flowcell barcode + lane)

3 BAM Post-Processing

Mark Duplicates

PCR duplicates artificially inflate variant support. Marking (not removing) them tells GATK to ignore them during calling.

Bash

gatk MarkDuplicates \
  -I ${SAMPLE}.sorted.bam \
  -O ${SAMPLE}.dedup.bam \
  -M ${SAMPLE}.dedup_metrics.txt \
  --REMOVE_DUPLICATES false \
  --CREATE_INDEX true

Base Quality Score Recalibration (BQSR)

Machine-reported base qualities are often inaccurate. BQSR recalibrates them using known variant sites, improving downstream calling accuracy.

Bash

# Step 3a: Build recalibration model
gatk BaseRecalibrator \
  -I ${SAMPLE}.dedup.bam \
  -R reference.fa \
  --known-sites dbsnp_146.vcf.gz \
  --known-sites Mills_and_1000G.indels.vcf.gz \
  -O ${SAMPLE}.recal_table

# Step 3b: Apply recalibration
gatk ApplyBQSR \
  -I ${SAMPLE}.dedup.bam \
  -R reference.fa \
  --bqsr-recal-file ${SAMPLE}.recal_table \
  -O ${SAMPLE}.recal.bam

No Known Sites for Your Organism? BQSR requires known variant databases. For non-model organisms without dbSNP, you can either: (1) skip BQSR entirely (bcftools handles this gracefully), (2) perform a first-pass variant call, use the high-confidence SNPs as "known sites," and bootstrap BQSR, or (3) use tools like DeepVariant which don't need BQSR at all.

4 Variant Calling with GATK HaplotypeCaller

Bash

# Single-sample calling (produces GVCF for later joint genotyping)
gatk HaplotypeCaller \
  -I ${SAMPLE}.recal.bam \
  -R reference.fa \
  -O ${SAMPLE}.g.vcf.gz \
  -ERC GVCF \
  --native-pair-hmm-threads 8

# For a single sample, you can also call directly to VCF:
gatk HaplotypeCaller \
  -I ${SAMPLE}.recal.bam \
  -R reference.fa \
  -O ${SAMPLE}.raw.vcf.gz

GVCF vs. VCF:

GVCF mode (-ERC GVCF): records confidence at every position, including reference-only sites. Required for multi-sample joint calling. Always use this for population studies.
Direct VCF: simpler, for single-sample analyses or when you just need variants quickly.
For cohorts, run HaplotypeCaller per sample in GVCF mode, then combine with GenomicsDBImport + GenotypeGVCFs.

5 Variant Filtering

Raw variant calls contain many false positives. Filtering is essential. Two approaches: VQSR (machine learning, needs large cohorts) and hard filtering (rule-based, works for any size).

Hard Filtering (Recommended for Small Cohorts)

Bash

# Separate SNPs and Indels
gatk SelectVariants -V ${SAMPLE}.raw.vcf.gz -select-type SNP -O snps.vcf.gz
gatk SelectVariants -V ${SAMPLE}.raw.vcf.gz -select-type INDEL -O indels.vcf.gz

# Hard filter SNPs (GATK recommended thresholds)
gatk VariantFiltration \
  -V snps.vcf.gz \
  -filter "QD < 2.0"                --filter-name "LowQD" \
  -filter "FS > 60.0"               --filter-name "StrandBias" \
  -filter "MQ < 40.0"               --filter-name "LowMQ" \
  -filter "MQRankSum < -12.5"       --filter-name "MQRankSum" \
  -filter "ReadPosRankSum < -8.0"   --filter-name "ReadPosRankSum" \
  -filter "SOR > 3.0"               --filter-name "HighSOR" \
  -O snps.filtered.vcf.gz

# Hard filter Indels
gatk VariantFiltration \
  -V indels.vcf.gz \
  -filter "QD < 2.0"                --filter-name "LowQD" \
  -filter "FS > 200.0"              --filter-name "StrandBias" \
  -filter "ReadPosRankSum < -20.0"  --filter-name "ReadPosRankSum" \
  -filter "SOR > 10.0"              --filter-name "HighSOR" \
  -O indels.filtered.vcf.gz

Understanding Each Hard Filter

Filter	What It Measures	Threshold Rationale
`QD < 2.0`	Quality by Depth — variant quality normalized by read depth	Low QD = poor quality call relative to coverage. 2.0 is GATK's recommendation; increase to 5.0 for higher stringency.
`FS > 60.0`	Fisher Strand bias — are variant reads preferentially from one strand?	High FS suggests PCR or mapping artifact. 60 for SNPs, 200 for indels (indels tolerate more strand bias).
`MQ < 40.0`	Root mean square Mapping Quality of reads at the site	MQ 40 = average MAPQ of 40 (very confident mapping). Lower values suggest repetitive or ambiguous regions.
`SOR > 3.0`	Strand Odds Ratio — similar to FS but handles large sample sizes better	3.0 for SNPs, 10.0 for indels. More robust than FS for high-depth data.
`MQRankSum`	Compares mapping quality of ref vs. alt reads	Large negative values mean alt reads map much worse than ref reads — likely mapping artifact.
`ReadPosRankSum`	Tests whether alt alleles cluster at read ends	Variants near read ends are enriched for sequencing errors.

Bash

# Keep only PASS variants
bcftools view -f PASS snps.filtered.vcf.gz -Oz -o snps.pass.vcf.gz
bcftools view -f PASS indels.filtered.vcf.gz -Oz -o indels.pass.vcf.gz

# Count variants
echo "PASS SNPs: $(bcftools view -H snps.pass.vcf.gz | wc -l)"
echo "PASS Indels: $(bcftools view -H indels.pass.vcf.gz | wc -l)"

Alternative: bcftools Variant Calling

A lighter, faster pipeline that works excellently for non-human organisms or when GATK's infrastructure feels heavyweight.

Bash

# Pileup + Call in one pipeline
bcftools mpileup \
  --fasta-ref reference.fa \
  --max-depth 500 \
  --min-MQ 20 \
  --min-BQ 20 \
  --annotate FORMAT/AD,FORMAT/DP \
  --threads 8 \
  ${SAMPLE}.dedup.bam \
| bcftools call \
  --multiallelic-caller \
  --variants-only \
  --threads 8 \
  -Oz -o ${SAMPLE}.bcftools.raw.vcf.gz

bcftools index ${SAMPLE}.bcftools.raw.vcf.gz

# Quality filtering
bcftools filter \
  -i 'QUAL>=30 && DP>=10 && DP<=500 && MQ>=30' \
  -Oz -o ${SAMPLE}.bcftools.filtered.vcf.gz \
  ${SAMPLE}.bcftools.raw.vcf.gz

echo "Filtered variants: $(bcftools view -H ${SAMPLE}.bcftools.filtered.vcf.gz | wc -l)"

bcftools Filter	Meaning	Guidance
`QUAL >= 30`	Phred-scaled variant quality	30 = 99.9% confidence. Use 20 for relaxed, 50 for stringent.
`DP >= 10`	Minimum read depth at the site	10 is a good minimum. For WGS at 30X, you might use 15. For low-coverage, use 5.
`DP <= 500`	Maximum depth (removes collapsed repeats)	Set to ~3× your expected mean coverage. Abnormally high depth signals repetitive regions.
`MQ >= 30`	Mapping quality	30 = 99.9% probability the reads are correctly mapped.

6 Variant Annotation

Annotation tells you the biological impact of each variant: is it synonymous, missense, nonsense? Which gene, transcript, or regulatory element is affected?

Bash

# SnpEff annotation (fast, local)
snpEff download GRCh38.105   # download the database for your genome build

snpEff ann GRCh38.105 \
  snps.pass.vcf.gz \
  -stats snpeff_stats.html \
  > snps.annotated.vcf

# Extract high-impact variants
cat snps.annotated.vcf | \
  java -jar SnpSift.jar filter "(ANN[*].IMPACT = 'HIGH')" \
  > high_impact_variants.vcf

Annotation Tools Compared:

SnpEff — fast, local, great for quick annotation; provides impact categories (HIGH, MODERATE, LOW, MODIFIER)
Ensembl VEP — most comprehensive; supports plugins for pathogenicity scores (CADD, SIFT, PolyPhen); preferred for clinical genomics
ANNOVAR — flexible, supports many databases; requires license for some databases

Summary

What You Learned:

QC and trim raw FASTQ reads with fastp
Align reads to a reference genome with BWA-MEM2 and proper read groups
Mark duplicates and recalibrate base qualities with GATK
Call variants with GATK HaplotypeCaller (GVCF and direct modes)
Apply hard filters with rationalized threshold choices for each metric
Run an alternative pipeline using bcftools mpileup + call
Annotate variants with SnpEff for biological impact interpretation

Prev: scRNA-Seq Next: Metagenomics

Variant Calling with GATK4 & bcftools

On this page

? What Is Variant Calling?

The Pipeline at a Glance

0 Required Tools

1 Read Quality Control & Trimming

2 Align Reads to the Reference Genome

3 BAM Post-Processing

Mark Duplicates

Base Quality Score Recalibration (BQSR)

4 Variant Calling with GATK HaplotypeCaller

5 Variant Filtering

Hard Filtering (Recommended for Small Cohorts)

Understanding Each Hard Filter

Alternative: bcftools Variant Calling

6 Variant Annotation

Summary