Discover SNPs and indels from whole-genome or exome sequencing data. Full GATK4 best-practices pipeline with hard filtering thresholds rationalized, plus a parallel bcftools workflow.
Variant calling identifies positions in the genome where your sample differs from a reference genome. These differences — SNPs (single nucleotide polymorphisms) and indels (insertions/deletions) — can explain phenotypic differences, disease susceptibility, or evolutionary adaptation.
# Install via conda (recommended) conda create -n variant_calling -c bioconda -c conda-forge \ fastqc fastp bwa-mem2 samtools gatk4 bcftools snpsift snpeff picard conda activate variant_calling # Verify installations gatk --version bcftools --version bwa-mem2 version
You also need:
# Step 1a: Quality check with FastQC fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o qc_reports/ -t 4 # Step 1b: Adapter trimming + quality filtering with fastp fastp \ --in1 sample_R1.fastq.gz \ --in2 sample_R2.fastq.gz \ --out1 trimmed_R1.fastq.gz \ --out2 trimmed_R2.fastq.gz \ --qualified_quality_phred 20 \ --length_required 50 \ --detect_adapter_for_pe \ --thread 8 \ --html fastp_report.html
| Parameter | Purpose | How to Choose |
|---|---|---|
--qualified_quality_phred 20 | Minimum base quality to keep | 20 = 1% error rate. Use 30 for stricter filtering (0.1% error). For variant calling, 20 is usually sufficient since GATK models base quality internally. |
--length_required 50 | Discard reads shorter than this after trimming | 50 bp for 150 bp reads. For 100 bp reads, try 36. Very short reads map ambiguously. |
--detect_adapter_for_pe | Auto-detect Illumina adapters for paired-end data | Almost always use this — fastp detects adapters automatically without needing a sequence file. |
# Index the reference (one-time step)
bwa-mem2 index reference.fa
# Create sequence dictionary and fai index (required by GATK)
samtools faidx reference.fa
gatk CreateSequenceDictionary -R reference.fa
# Align with BWA-MEM2 (reads → SAM → sorted BAM)
SAMPLE="SampleA"
RGID="flowcell1.lane1"
bwa-mem2 mem \
-t 16 \
-R "@RG\tID:${RGID}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:lib1\tPU:unit1" \
reference.fa \
trimmed_R1.fastq.gz trimmed_R2.fastq.gz \
| samtools sort -@ 8 -o ${SAMPLE}.sorted.bam
samtools index ${SAMPLE}.sorted.bam
-R flag adds read group information that GATK requires. Without it, GATK will refuse to run.
ID — unique read group ID (usually flowcell + lane)SM — sample name (the biological sample)PL — platform (ILLUMINA, ONT, PACBIO)LB — library identifierPU — platform unit (flowcell barcode + lane)PCR duplicates artificially inflate variant support. Marking (not removing) them tells GATK to ignore them during calling.
gatk MarkDuplicates \
-I ${SAMPLE}.sorted.bam \
-O ${SAMPLE}.dedup.bam \
-M ${SAMPLE}.dedup_metrics.txt \
--REMOVE_DUPLICATES false \
--CREATE_INDEX true
Machine-reported base qualities are often inaccurate. BQSR recalibrates them using known variant sites, improving downstream calling accuracy.
# Step 3a: Build recalibration model
gatk BaseRecalibrator \
-I ${SAMPLE}.dedup.bam \
-R reference.fa \
--known-sites dbsnp_146.vcf.gz \
--known-sites Mills_and_1000G.indels.vcf.gz \
-O ${SAMPLE}.recal_table
# Step 3b: Apply recalibration
gatk ApplyBQSR \
-I ${SAMPLE}.dedup.bam \
-R reference.fa \
--bqsr-recal-file ${SAMPLE}.recal_table \
-O ${SAMPLE}.recal.bam
# Single-sample calling (produces GVCF for later joint genotyping)
gatk HaplotypeCaller \
-I ${SAMPLE}.recal.bam \
-R reference.fa \
-O ${SAMPLE}.g.vcf.gz \
-ERC GVCF \
--native-pair-hmm-threads 8
# For a single sample, you can also call directly to VCF:
gatk HaplotypeCaller \
-I ${SAMPLE}.recal.bam \
-R reference.fa \
-O ${SAMPLE}.raw.vcf.gz
-ERC GVCF): records confidence at every position, including reference-only sites. Required for multi-sample joint calling. Always use this for population studies.GenomicsDBImport + GenotypeGVCFs.Raw variant calls contain many false positives. Filtering is essential. Two approaches: VQSR (machine learning, needs large cohorts) and hard filtering (rule-based, works for any size).
# Separate SNPs and Indels
gatk SelectVariants -V ${SAMPLE}.raw.vcf.gz -select-type SNP -O snps.vcf.gz
gatk SelectVariants -V ${SAMPLE}.raw.vcf.gz -select-type INDEL -O indels.vcf.gz
# Hard filter SNPs (GATK recommended thresholds)
gatk VariantFiltration \
-V snps.vcf.gz \
-filter "QD < 2.0" --filter-name "LowQD" \
-filter "FS > 60.0" --filter-name "StrandBias" \
-filter "MQ < 40.0" --filter-name "LowMQ" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum" \
-filter "SOR > 3.0" --filter-name "HighSOR" \
-O snps.filtered.vcf.gz
# Hard filter Indels
gatk VariantFiltration \
-V indels.vcf.gz \
-filter "QD < 2.0" --filter-name "LowQD" \
-filter "FS > 200.0" --filter-name "StrandBias" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum" \
-filter "SOR > 10.0" --filter-name "HighSOR" \
-O indels.filtered.vcf.gz
| Filter | What It Measures | Threshold Rationale |
|---|---|---|
QD < 2.0 | Quality by Depth — variant quality normalized by read depth | Low QD = poor quality call relative to coverage. 2.0 is GATK's recommendation; increase to 5.0 for higher stringency. |
FS > 60.0 | Fisher Strand bias — are variant reads preferentially from one strand? | High FS suggests PCR or mapping artifact. 60 for SNPs, 200 for indels (indels tolerate more strand bias). |
MQ < 40.0 | Root mean square Mapping Quality of reads at the site | MQ 40 = average MAPQ of 40 (very confident mapping). Lower values suggest repetitive or ambiguous regions. |
SOR > 3.0 | Strand Odds Ratio — similar to FS but handles large sample sizes better | 3.0 for SNPs, 10.0 for indels. More robust than FS for high-depth data. |
MQRankSum | Compares mapping quality of ref vs. alt reads | Large negative values mean alt reads map much worse than ref reads — likely mapping artifact. |
ReadPosRankSum | Tests whether alt alleles cluster at read ends | Variants near read ends are enriched for sequencing errors. |
# Keep only PASS variants bcftools view -f PASS snps.filtered.vcf.gz -Oz -o snps.pass.vcf.gz bcftools view -f PASS indels.filtered.vcf.gz -Oz -o indels.pass.vcf.gz # Count variants echo "PASS SNPs: $(bcftools view -H snps.pass.vcf.gz | wc -l)" echo "PASS Indels: $(bcftools view -H indels.pass.vcf.gz | wc -l)"
A lighter, faster pipeline that works excellently for non-human organisms or when GATK's infrastructure feels heavyweight.
# Pileup + Call in one pipeline
bcftools mpileup \
--fasta-ref reference.fa \
--max-depth 500 \
--min-MQ 20 \
--min-BQ 20 \
--annotate FORMAT/AD,FORMAT/DP \
--threads 8 \
${SAMPLE}.dedup.bam \
| bcftools call \
--multiallelic-caller \
--variants-only \
--threads 8 \
-Oz -o ${SAMPLE}.bcftools.raw.vcf.gz
bcftools index ${SAMPLE}.bcftools.raw.vcf.gz
# Quality filtering
bcftools filter \
-i 'QUAL>=30 && DP>=10 && DP<=500 && MQ>=30' \
-Oz -o ${SAMPLE}.bcftools.filtered.vcf.gz \
${SAMPLE}.bcftools.raw.vcf.gz
echo "Filtered variants: $(bcftools view -H ${SAMPLE}.bcftools.filtered.vcf.gz | wc -l)"
| bcftools Filter | Meaning | Guidance |
|---|---|---|
QUAL >= 30 | Phred-scaled variant quality | 30 = 99.9% confidence. Use 20 for relaxed, 50 for stringent. |
DP >= 10 | Minimum read depth at the site | 10 is a good minimum. For WGS at 30X, you might use 15. For low-coverage, use 5. |
DP <= 500 | Maximum depth (removes collapsed repeats) | Set to ~3× your expected mean coverage. Abnormally high depth signals repetitive regions. |
MQ >= 30 | Mapping quality | 30 = 99.9% probability the reads are correctly mapped. |
Annotation tells you the biological impact of each variant: is it synonymous, missense, nonsense? Which gene, transcript, or regulatory element is affected?
# SnpEff annotation (fast, local) snpEff download GRCh38.105 # download the database for your genome build snpEff ann GRCh38.105 \ snps.pass.vcf.gz \ -stats snpeff_stats.html \ > snps.annotated.vcf # Extract high-impact variants cat snps.annotated.vcf | \ java -jar SnpSift.jar filter "(ANN[*].IMPACT = 'HIGH')" \ > high_impact_variants.vcf