Universal Variant Filtering Pipeline

VQSR for human WGS · hard filters for small cohorts & non-model organisms · RNA-seq variant filters · bcftools filter · genotype-level filtering · automated Python pipeline · decision tree guide

~100 min Intermediate–Advanced Bash / Python
Byte

Filtering Decision Tree

The biggest mistake in variant calling is applying the wrong filtering strategy for your data. Use this decision tree before writing a single filter command:

Your DataStrategyWhy
Human/mouse WGS, ≥30 samplesGATK VQSRMachine-learning on Gaussian mixture model trained on known variants. Most accurate for well-annotated genomes.
Human WES, ≥30 samplesGATK VQSR (reduced resource)Same as WGS but with reduced truth sensitivity — WES has sparser coverage and fewer SNPs per sample for training.
Any organism, <30 samples, OR non-model organism without dbSNP/HapMapHard filtersVQSR requires training data (dbSNP, HapMap, 1000G) and enough variants to fit a model. Without these, hard filters are more reliable.
Non-model organism (crop, wildlife, marine, insect)Hard filters + bcftoolsNo resource files exist. Hard filters with organism-appropriate thresholds are the standard approach.
RNA-seq variant callingRNA-seq specific hard filtersRNA editing artefacts, mapping bias near junctions, and strand-specific errors require different thresholds than DNA.
Single sample (solo) or triosHard filtersVQSR cannot be run on fewer than ~30 samples. Use hard filters + DeepVariant for best accuracy.

Key QC Metrics Explained

Every VCF INFO field used in filtering has a biological or statistical interpretation. Understanding these lets you tune thresholds intelligently rather than blindly copying defaults.

MetricFull NameWhat it measuresFilter directionGATK default (SNP)
QDQuality by DepthVariant quality / allele depth. Normalises QUAL by coverage — prevents high-depth sites from having spuriously high QUAL.REMOVE if QD < 2< 2.0
MQMapping QualityRMS mapping quality of reads at the variant site. Low MQ = reads are mapping poorly = likely false positive from mis-alignment.REMOVE if MQ < 40< 40.0
FSFisher Strand BiasPhred-scaled p-value from Fisher's exact test for strand bias. True variants should appear on both strands equally; artefacts tend to appear on only one strand.REMOVE if FS > 60> 60.0 (SNP) > 200.0 (indel)
SORStrand Odds RatioMore sophisticated strand bias metric; less sensitive to high coverage than FS. Measures ratio of reads on each strand for ref vs alt allele.REMOVE if SOR > 3> 3.0
MQRankSumMQ Rank Sum TestZ-score from Wilcoxon rank sum test comparing mapping qualities of ref-supporting vs alt-supporting reads. Very negative = alt reads map poorly = likely misalignment artefact.REMOVE if < −12.5< −12.5
ReadPosRankSumRead Position Rank Sum TestZ-score comparing position of ref vs alt allele within reads. Artefacts tend to appear at read ends (polymerase error); true variants appear throughout the read.REMOVE if < −8< −8.0
QUALPhred-scaled variant qualityLog-probability that the genotype call is wrong. QUAL 30 = 0.1% error rate.REMOVE if < 30Site-specific
DPDepthTotal read depth across all samples at this site. Very high depth = repetitive region (paralogous mapping); very low = insufficient evidence.REMOVE if DP < 5 or DP > mean×3Data-specific

Use the live spreadsheet below to check whether your variant's INFO field values would pass or fail standard hard filters. Enter the values from a single VCF line to see the decision for each metric instantly:

Step 1 — GATK VQSR (Human WGS ≥30 Samples)

Bash — GATK4 VariantRecalibrator + ApplyVQSR for SNPs and indels
######################################################################
## STEP 1: GATK Variant Quality Score Recalibration (VQSR)
## VQSR trains a Gaussian mixture model using known-variant truth sets
## (HapMap, Omni, 1000G, dbSNP) to assign a recalibration score to
## each variant. This score reflects the probability that the variant
## is a true polymorphism rather than a sequencing/mapping artefact.
##
## MUST run SNPs and indels SEPARATELY with different truth resources
## and different annotation sets. SNPs: 5 annotations; indels: 3.
######################################################################

module load gatk/4.4.0

## Resource files (download from GATK Resource Bundle gs://gatk-best-practices)
HAPMAP="hapmap_3.3.hg38.vcf.gz"          # truth=true, training=true, prior=15
OMNI="1000G_omni2.5.hg38.vcf.gz"         # truth=true, training=true, prior=12
G1000="1000G_phase1.snps.high_confidence.hg38.vcf.gz"  # truth=false, training=true, prior=10
DBSNP="dbsnp_146.hg38.vcf.gz"            # truth=false, training=false, prior=7 (known)
MILLS="Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"  # truth=true, training=true, prior=12 (indels)
AXIOM="Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz"  # indel training

JOINT_VCF="joint_genotyped.vcf.gz"
REF="Homo_sapiens_assembly38.fasta"

## ================================================================
## PART A: SNP RECALIBRATION
## ================================================================

## Step A1: Build the SNP recalibration model
## -resource:NAME,truth,training,known,prior=VALUE
##   truth    = used to evaluate sensitivity (are these truly real variants?)
##   training = used to train the Gaussian model
##   known    = used for titv calculation and reporting only
##   prior    = how much weight to give this resource (higher = more trusted)
## -an : annotations to use for the model
##   QD, MQ, MQRankSum, ReadPosRankSum, FS, SOR are the 6 key annotations
## --mode SNP : this run is for SNPs only
## -tranche : which sensitivity tranches to output
##   99.5% tranche = keeps 99.5% of true SNPs (5% of filtered variants are false positives)
## --max-gaussians 6 : fit 6 Gaussian components (reduce to 4 if you get convergence errors)
gatk VariantRecalibrator \
    -R ${REF} \
    -V ${JOINT_VCF} \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${HAPMAP} \
    -resource:omni,known=false,training=true,truth=true,prior=12.0 ${OMNI} \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 ${G1000} \
    -resource:dbsnp,known=true,training=false,truth=false,prior=7.0 ${DBSNP} \
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
    --mode SNP \
    --max-gaussians 6 \
    -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 \
    -O snp_recal.vcf.gz \
    --tranches-file snp_recal.tranches \
    --rscript-file snp_recal.R \
    2>&1 | tee snp_vqsr.log

## Step A2: Apply SNP recalibration
## --truth-sensitivity-filter-level 99.5 : keep variants in the top 99.5% sensitivity tranche
## This means: FILTER PASS = variant is called with 99.5% of true SNPs in training data
## Variants in lower tranches (90%, 99.0%) are marked with their tranche label, not removed
## --mode SNP : must match the mode used in VariantRecalibrator
gatk ApplyVQSR \
    -R ${REF} \
    -V ${JOINT_VCF} \
    --recal-file snp_recal.vcf.gz \
    --tranches-file snp_recal.tranches \
    --truth-sensitivity-filter-level 99.5 \
    --mode SNP \
    -O snp_vqsr_applied.vcf.gz

## ================================================================
## PART B: INDEL RECALIBRATION
## ================================================================
## Indels use fewer annotations (-an) because MQRankSum is not reliable for indels
gatk VariantRecalibrator \
    -R ${REF} \
    -V snp_vqsr_applied.vcf.gz \
    -resource:mills,known=false,training=true,truth=true,prior=12.0 ${MILLS} \
    -resource:axiom,known=false,training=true,truth=false,prior=10.0 ${AXIOM} \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${DBSNP} \
    -an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
    --mode INDEL \
    --max-gaussians 4 \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    -O indel_recal.vcf.gz \
    --tranches-file indel_recal.tranches

gatk ApplyVQSR \
    -R ${REF} \
    -V snp_vqsr_applied.vcf.gz \
    --recal-file indel_recal.vcf.gz \
    --tranches-file indel_recal.tranches \
    --truth-sensitivity-filter-level 99.0 \
    --mode INDEL \
    -O final_vqsr_filtered.vcf.gz

## Count PASS variants
echo "PASS SNPs after VQSR:"
bcftools view -f PASS --type snps -H final_vqsr_filtered.vcf.gz | wc -l
echo "PASS indels after VQSR:"
bcftools view -f PASS --type indels -H final_vqsr_filtered.vcf.gz | wc -l
Download:
VQSR tranche sensitivity decoded
📈99.5% vs 99.0% sensitivityThe tranche sensitivity is NOT the false positive rate. It means “99.5% of the variants in our truth set (HapMap + Omni) are called at or above this score.” By setting --truth-sensitivity-filter-level 99.5, you keep everything that scores as well as or better than the bottom 0.5% of truth-set variants. For SNPs: 99.5% is standard. For indels: 99.0% because indel calling is inherently less accurate.
🎯--max-gaussians convergence errorIf VariantRecalibrator fails with “NaN” or “convergence” errors, reduce --max-gaussians from 6 to 4 (SNPs) or from 4 to 2 (indels). This happens with too few variants (WES or small cohort) or when the variant set is very clean with little training data. If it still fails with 2 Gaussians, switch to hard filters instead.

Step 2 — Hard Filters (Non-model Organisms & Small Cohorts)

Bash — GATK hard filters for SNPs and indels; organism-specific thresholds
######################################################################
## STEP 2: Hard filters — the universal alternative to VQSR
## Use when: non-model organism, <30 samples, no resource files
## Strategy:
##   - Filter SNPs and indels separately (different annotation distributions)
##   - Use GATK's recommended thresholds as STARTING POINTS
##   - Always EXAMINE your annotation distributions first (histograms!)
##   - Then adjust thresholds based on where YOUR data has clear bimodal splits
##   NEVER copy-paste hard filter thresholds without examining your own data.
######################################################################

module load gatk/4.4.0
module load bcftools/1.17

JOINT_VCF="joint_genotyped.vcf.gz"
REF="genome.fasta"

## ================================================================
## CRITICAL: Examine annotation distributions BEFORE setting thresholds
## ================================================================
## Extract annotation values to histograms in R
gatk VariantFiltration \
    -V ${JOINT_VCF} \
    -R ${REF} \
    --filter-expression "true" \
    --filter-name "PLACEHOLDER" \
    -O /dev/null 2>/dev/null || true

## Extract INFO fields to a table for distribution plots
gatk VariantsToTable \
    -V ${JOINT_VCF} \
    -F CHROM -F POS -F TYPE -F QUAL \
    -F QD -F MQ -F FS -F SOR -F MQRankSum -F ReadPosRankSum \
    -O annotation_distributions.tsv

## THEN plot distributions in R to choose thresholds:
cat << 'RSCRIPT' > plot_distributions.R
library(tidyverse)
df <- read.delim("annotation_distributions.tsv") %>%
    separate(TYPE, into=c("TYPE"), sep=",", extra="drop")
## Plot each annotation for SNPs and indels
metrics <- c("QD","MQ","FS","SOR","MQRankSum","ReadPosRankSum")
for (m in metrics) {
    if (!m %in% colnames(df)) next
    df_plot <- df %>% filter(!is.na(.data[[m]]))
    p <- ggplot(df_plot, aes(.data[[m]], fill=TYPE)) +
        geom_histogram(bins=100, alpha=0.6, position="identity") +
        facet_wrap(~TYPE, scales="free") +
        labs(title=paste("Distribution:", m)) +
        theme_bw()
    ggsave(paste0("dist_", m, ".png"), p, width=8, height=4)
}
cat("Distribution plots saved.\n")
RSCRIPT
Rscript plot_distributions.R
echo "Open dist_*.png to examine annotation distributions before choosing thresholds"

## ================================================================
## APPLY HARD FILTERS — SNPs
## ================================================================
## Step 1: Extract SNPs
gatk SelectVariants -R ${REF} -V ${JOINT_VCF} --select-type-to-include SNP -O snps_raw.vcf.gz

## Step 2: Apply filters
## VariantFiltration: adds FILTER tags to failing variants (does NOT remove them)
## Each --filter-expression gets a --filter-name tag
## Variants not failing any filter remain: FILTER=PASS
## IMPORTANT: these thresholds are GATK defaults for human WGS
## For plant genomes (lower heterozygosity): consider MQ > 50 instead of 40
## For highly inbred lines: consider QD > 5 instead of 2
gatk VariantFiltration \
    -R ${REF} \
    -V snps_raw.vcf.gz \
    --filter-expression "QD < 2.0"            --filter-name "QD2" \
    --filter-expression "MQ < 40.0"           --filter-name "MQ40" \
    --filter-expression "FS > 60.0"           --filter-name "FS60" \
    --filter-expression "SOR > 3.0"           --filter-name "SOR3" \
    --filter-expression "MQRankSum < -12.5"   --filter-name "MQRankSum-12.5" \
    --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
    -O snps_hardfiltered.vcf.gz

## Step 3: Keep only PASS
bcftools view -f PASS snps_hardfiltered.vcf.gz -O z -o snps_PASS.vcf.gz
tabix -p vcf snps_PASS.vcf.gz
echo "PASS SNPs: $(bcftools view -H snps_PASS.vcf.gz | wc -l)"

## ================================================================
## APPLY HARD FILTERS — Indels
## ================================================================
gatk SelectVariants -R ${REF} -V ${JOINT_VCF} --select-type-to-include INDEL -O indels_raw.vcf.gz

## Indel-specific thresholds: stricter FS (200 not 60) because strand bias
## is more common for indels; MQRankSum not used (less reliable for indels)
gatk VariantFiltration \
    -R ${REF} \
    -V indels_raw.vcf.gz \
    --filter-expression "QD < 2.0"              --filter-name "QD2" \
    --filter-expression "FS > 200.0"            --filter-name "FS200" \
    --filter-expression "SOR > 10.0"            --filter-name "SOR10" \
    --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
    -O indels_hardfiltered.vcf.gz

bcftools view -f PASS indels_hardfiltered.vcf.gz -O z -o indels_PASS.vcf.gz
tabix -p vcf indels_PASS.vcf.gz
echo "PASS indels: $(bcftools view -H indels_PASS.vcf.gz | wc -l)"

## ================================================================
## MERGE FILTERED SNPs AND INDELS
## ================================================================
bcftools concat -a snps_PASS.vcf.gz indels_PASS.vcf.gz | \
    bcftools sort -O z -o final_hardfiltered.vcf.gz
tabix -p vcf final_hardfiltered.vcf.gz
echo "Final PASS variants (SNP + indel): $(bcftools view -H final_hardfiltered.vcf.gz | wc -l)"
Download:
📌
Never copy-paste thresholds blindly! The GATK hard filter thresholds were derived from human WGS data. For plants with very low heterozygosity (selfing crops like rice, soybean, barley), the FS distribution shifts and MQ distributions differ. Always plot dist_*.png first, look for bimodal distributions, and place your threshold at the valley between the two peaks — that's the optimal cutoff for your specific data.

Step 3 — Genotype-Level Filtering

Bash — Filter individual genotypes by depth and GQ, then remove low-call-rate sites
######################################################################
## STEP 3: Genotype-level filtering (per-sample, not per-site)
## Site-level filters (VQSR/hard filters) assess the VARIANT as a whole.
## Genotype-level filters assess EACH SAMPLE's genotype call:
##   - A site may PASS site-level filters but have some sample genotypes
##     supported by only 2 reads — those individual genotypes should be set to missing
##   - After masking low-quality genotypes, re-filter sites with too much missingness
##
## This two-step approach (genotype filter → missingness filter) is
## more powerful than site-only filtering.
######################################################################

## --- Step 3a: Mask low-quality individual genotypes ---
## gatk VariantFiltration --genotype-filter:
##   GQ < 20 : genotype quality < 20 (< 99% confident in this specific genotype)
##   DP < 5  : fewer than 5 reads supporting this genotype
##   These individual genotypes are set to MISSING (./.)
##   The variant SITE is NOT removed — just that sample's call is masked
gatk VariantFiltration \
    -V final_hardfiltered.vcf.gz \
    --genotype-filter-expression "GQ < 20"  --genotype-filter-name "lowGQ" \
    --genotype-filter-expression "DP < 5"   --genotype-filter-name "lowDP" \
    --set-filtered-gt-to-nocall \
    -O genotype_filtered.vcf.gz

## --- Step 3b: Remove sites with too much missingness ---
## After masking low-quality genotypes, some sites may now have:
##   - 30%+ of samples with missing genotype
##   - These sites are unreliable — too few samples support the allele frequency
## --max-missing 0.8 : keep sites where 80%+ of samples have a genotype
##   (= allow up to 20% missing data per site; adjust based on project)
## --maf 0.01 : remove variants with MAF < 1% (very rare, unreliable)
## --minQ 30  : remove sites with QUAL < 30
vcftools \
    --gzvcf genotype_filtered.vcf.gz \
    --max-missing 0.8 \
    --maf 0.01 \
    --minQ 30 \
    --recode --recode-INFO-all \
    --out final_cleaned
bgzip final_cleaned.recode.vcf
tabix -p vcf final_cleaned.recode.vcf.gz
echo "Final cleaned variants: $(bcftools view -H final_cleaned.recode.vcf.gz | wc -l)"

## --- Step 3c: Check per-sample missingness after filtering ---
## Any sample with >20% missing genotypes is likely a poor-quality sample
vcftools --gzvcf final_cleaned.recode.vcf.gz \
    --missing-indv \
    --out sample_missingness
echo "=== Samples with >20% missing genotypes (consider removing) ==="
awk 'NR>1 && $5 > 0.2 {print $1, "Missing:", $5*100"%"}' sample_missingness.imiss
Download:

Step 4 — RNA-seq Variant Filtering

Bash — RNA-seq-specific hard filters for SNPs called from RNA-seq data
######################################################################
## STEP 4: Special filters for RNA-seq-derived variants
## RNA-seq variant calling (e.g. GATK RNAseq best practices, GATK-SV)
## requires different filters than DNA because:
##   1. RNA editing (A-to-I): adenosine-to-inosine RNA editing creates
##      apparent A>G variants. Must be recognised and filtered.
##   2. Allele-specific expression: the REF allele may be expressed more
##      than ALT due to allelic imbalance, not true heterozygosity.
##   3. Splicing artefacts: reads spanning splice junctions have
##      soft-clipped bases that can look like insertions/deletions.
##   4. Strand bias: RNA-seq is directional — some strand bias is expected.
##      Use higher FS threshold than DNA (FS > 30 not FS > 60).
######################################################################

## GATK RNA-seq variant calling best practices:
## 1. STAR alignment with 2-pass mapping
## 2. SplitNCigarReads (handles spliced reads)
## 3. BQSR (same as DNA)
## 4. HaplotypeCaller with --dont-use-soft-clipped-bases
## 5. Hard filters (VQSR not recommended for RNA-seq)

## RNA-seq-specific hard filters:
## FS > 30.0          : lower strand bias tolerance (RNA-seq already has directionality)
## DP < 10            : require more evidence (RNA-seq has higher depth at expressed regions)
## QUAL < 20          : stricter quality threshold
## Cluster of variants (--cluster-size 3 --cluster-window-size 35):
##   Remove variants within 35 bp windows that contain 3+ SNPs
##   (clusters of variants are often RT-PCR/splicing artefacts in RNA)
gatk VariantFiltration \
    -V rnaseq_raw_variants.vcf.gz \
    --filter-expression "QD < 2.0"            --filter-name "QD2" \
    --filter-expression "FS > 30.0"           --filter-name "FS30" \
    --filter-expression "DP < 10"             --filter-name "DP10" \
    --cluster-size 3 \
    --cluster-window-size 35 \
    -O rnaseq_filtered.vcf.gz

## --- Remove known RNA editing sites ---
## Download RNA editing database (REDIportal for human):
## wget http://srv00.recas.ba.infn.it/atlas/hg38.txt.gz
## Convert to VCF format for filtering:
## (this is a simplified illustration — in practice, convert REDIportal to VCF)
if [ -f rna_editing_sites.vcf.gz ]; then
    bcftools isec \
        -C rnaseq_filtered.vcf.gz \         # remove sites IN the editing database
        rna_editing_sites.vcf.gz \
        -p rna_editing_removed/
    echo "Variants after RNA editing site removal:"
    bcftools view -H rna_editing_removed/0000.vcf | wc -l
fi

## --- Count biallelic SNPs (most useful for downstream analysis) ---
bcftools view \
    -f PASS \
    -v snps \
    --max-alleles 2 \
    rnaseq_filtered.vcf.gz \
    -O z -o rnaseq_biallelic_snps.vcf.gz
tabix -p vcf rnaseq_biallelic_snps.vcf.gz
echo "Biallelic PASS SNPs from RNA-seq: $(bcftools view -H rnaseq_biallelic_snps.vcf.gz | wc -l)"
Download:

Step 5 — bcftools Universal Filter (Works with Any Caller)

Bash — bcftools filter for FreeBayes, DeepVariant, Strelka2, GATK output
######################################################################
## STEP 5: bcftools filter — works with VCFs from ANY variant caller
## bcftools uses an expression language similar to SnpSift.
## Useful when you have output from:
##   - FreeBayes (does not use GATK annotation fields)
##   - DeepVariant (uses QUAL but different INFO fields)
##   - Strelka2 (uses EVS/SomaticEVS scores)
##   - GATK (when VQSR/VariantFiltration not available)
######################################################################

## --- Generic bcftools filter for any caller ---
## -s LOWQUAL   : tag failing variants with LOWQUAL (not remove)
## -m +         : use soft filter (add to existing FILTER column)
## -e           : exclude expression (variants matching this are FILTERED)
## --SnpGap 3   : filter SNPs within 3 bp of an indel (common artefact)
## --IndelGap 10: filter indels within 10 bp of each other
bcftools filter \
    -s "LOWQUAL" \
    -m "+" \
    -e "QUAL<30 || INFO/DP<5" \
    --SnpGap 3 \
    --IndelGap 10 \
    input.vcf.gz | \
  bcftools filter \
    -s "STDBIAS" \
    -m "+" \
    -e "INFO/MQ<40 || INFO/FS>60" \
    -O z -o filtered.vcf.gz
tabix -p vcf filtered.vcf.gz

## --- Keep only PASS variants ---
bcftools view -f PASS filtered.vcf.gz -O z -o PASS_only.vcf.gz
tabix -p vcf PASS_only.vcf.gz

## --- FreeBayes-specific filter ---
## FreeBayes uses different INFO fields:
##   QUAL         : variant quality
##   INFO/DP      : total depth
##   INFO/AF      : allele frequency (not in GATK standard output)
##   INFO/AO      : alt allele observations
##   INFO/QA      : sum of quality for alt observations
## Apply a minimum quality + depth + alt observation filter
bcftools filter \
    -e "QUAL<20 || INFO/DP<10 || INFO/AO<5" \
    freebayes_output.vcf.gz | \
  bcftools view -f PASS -O z -o freebayes_filtered.vcf.gz

## --- Remove multiallelic sites (often artefacts) ---
## --max-alleles 2 : keep only biallelic sites
## Multiallelic sites should be checked manually — they may be:
##   1. Real multiallelic variants (common in humans/outbred populations)
##   2. Alignment artefacts (spurious third allele from misalignment)
bcftools view --max-alleles 2 PASS_only.vcf.gz | \
    bcftools norm -m -any | \           # split multiallelic to biallelic
    bcftools view -O z -o biallelic_PASS.vcf.gz
tabix -p vcf biallelic_PASS.vcf.gz
Download:

Step 6 — Automated Python Filtering Pipeline

Python — Smart auto-detection of organism type + adaptive threshold selection
#!/usr/bin/env python3
######################################################################
## AUTOMATED VARIANT FILTERING PIPELINE
## auto_filter_vcf.py - intelligently chooses filtering strategy
## based on: number of samples, organism type, and available resource files
##
## Usage: python3 auto_filter_vcf.py --vcf input.vcf.gz \
##                 --organism non_model \
##                 --output filtered.vcf.gz
######################################################################
import argparse, subprocess, os, sys, json
from pathlib import Path

def count_samples(vcf):
    """Count number of samples in VCF."""
    result = subprocess.run(["bcftools", "query", "-l", vcf],
                            capture_output=True, text=True)
    return len(result.stdout.strip().split("\n")) if result.stdout.strip() else 0

def get_annotation_stats(vcf, n_sites=10000):
    """Extract annotation statistics from first N sites to auto-determine thresholds."""
    import statistics
    result = subprocess.run([
        "bcftools", "query", "-f",
        "%QD\t%MQ\t%FS\t%SOR\n",
        "-i", "QD!=. && MQ!=. && FS!=. && SOR!=.",
        "--regions-file", "/dev/stdin",
        vcf
    ], capture_output=True, text=True, input="")
    
    stats = {"QD": [], "MQ": [], "FS": [], "SOR": []}
    for line in result.stdout.strip().split("\n")[:n_sites]:
        vals = line.split("\t")
        if len(vals) == 4 and all(v != "." for v in vals):
            try:
                stats["QD"].append(float(vals[0]))
                stats["MQ"].append(float(vals[1]))
                stats["FS"].append(float(vals[2]))
                stats["SOR"].append(float(vals[3]))
            except ValueError:
                pass
    
    thresholds = {}
    for metric, values in stats.items():
        if values:
            mean_val = statistics.mean(values)
            sd_val   = statistics.stdev(values) if len(values) > 1 else 0
            thresholds[metric] = {"mean": round(mean_val, 2), "sd": round(sd_val, 2)}
    return thresholds

def build_filter_expression(organism_type, n_samples):
    """Return GATK VariantFiltration filter expressions tuned for organism type."""
    
    ## Base thresholds
    thresholds = {
        "human_wgs":    {"QD": 2.0, "MQ": 40.0, "FS": 60.0, "SOR": 3.0, "MQRankSum": -12.5, "ReadPosRankSum": -8.0},
        "human_wes":    {"QD": 2.0, "MQ": 40.0, "FS": 60.0, "SOR": 3.0, "MQRankSum": -12.5, "ReadPosRankSum": -8.0},
        "plant_inbred": {"QD": 2.0, "MQ": 50.0, "FS": 60.0, "SOR": 3.0, "MQRankSum": -12.5, "ReadPosRankSum": -8.0},
        "non_model":    {"QD": 2.0, "MQ": 40.0, "FS": 60.0, "SOR": 3.0, "MQRankSum": -12.5, "ReadPosRankSum": -8.0},
        "rnaseq":       {"QD": 2.0, "MQ": 40.0, "FS": 30.0, "SOR": 3.0, "MQRankSum": -12.5, "ReadPosRankSum": -8.0},
    }
    
    t = thresholds.get(organism_type, thresholds["non_model"])
    filters = [
        (f"QD < {t['QD']}",                 "QD_filter"),
        (f"MQ < {t['MQ']}",                 "MQ_filter"),
        (f"FS > {t['FS']}",                 "FS_filter"),
        (f"SOR > {t['SOR']}",               "SOR_filter"),
        (f"MQRankSum < {t['MQRankSum']}",   "MQRankSum_filter"),
        (f"ReadPosRankSum < {t['ReadPosRankSum']}", "ReadPosRankSum_filter"),
    ]
    return filters

def run_hard_filter(vcf_in, vcf_out, organism_type, n_samples, gatk_jar, ref):
    """Run GATK hard filters with organism-appropriate thresholds."""
    filters = build_filter_expression(organism_type, n_samples)
    cmd = ["java", "-Xmx4g", "-jar", gatk_jar,
           "VariantFiltration", "-R", ref, "-V", vcf_in]
    for expr, name in filters:
        cmd += ["--filter-expression", expr, "--filter-name", name]
    cmd += ["-O", vcf_out]
    print(f"Running: {' '.join(cmd[:8])}...")
    subprocess.run(cmd, check=True)
    return vcf_out

def main():
    parser = argparse.ArgumentParser(description="Automated VCF filtering pipeline")
    parser.add_argument("--vcf",      required=True, help="Input VCF.gz")
    parser.add_argument("--organism", default="non_model",
                        choices=["human_wgs","human_wes","plant_inbred","non_model","rnaseq"],
                        help="Organism/data type for threshold selection")
    parser.add_argument("--output",   required=True, help="Output filtered VCF.gz")
    parser.add_argument("--ref",      default="genome.fasta", help="Reference FASTA")
    parser.add_argument("--gatk",     default="gatk", help="GATK4 jar or executable")
    args = parser.parse_args()
    
    n_samples = count_samples(args.vcf)
    print(f"Samples in VCF: {n_samples}")
    print(f"Organism type:  {args.organism}")
    
    ## Determine strategy
    if args.organism in ("human_wgs", "human_wes") and n_samples >= 30:
        print("Strategy: VQSR recommended — run 01_gatk_vqsr.sh")
        print("Falling back to hard filters for now...")
    
    print("Strategy: Hard filters")
    filtered = run_hard_filter(
        vcf_in=args.vcf, vcf_out=args.output,
        organism_type=args.organism, n_samples=n_samples,
        gatk_jar=args.gatk, ref=args.ref
    )
    print(f"Done! Filtered VCF: {filtered}")

if __name__ == "__main__":
    main()
Download:

Step 7 — Validation: Ti/Tv Ratio & dbSNP Overlap

Bash + R — Validate filtering quality using Ti/Tv and known variant overlap
######################################################################
## STEP 7: Validate your filtering quality
## Two universal quality checks after filtering:
##   1. Ti/Tv ratio: Transition/Transversion ratio
##      Good filtering → Ti/Tv near expected value for your data type
##      Bad filtering → Ti/Tv deviates significantly from expected
##   2. dbSNP overlap: What % of your variants are in dbSNP?
##      Good filtering → 90-99% of SNPs are known for well-studied organisms
##      Bad filtering → low overlap indicates many false positives remain
######################################################################

## --- Ti/Tv ratio using bcftools stats ---
## Expected Ti/Tv values:
##   Human WGS:  ~2.0-2.1
##   Human WES:  ~2.5-3.0 (coding regions enriched for CpG transitions)
##   Mouse WGS:  ~2.0
##   Plant WGS:  ~2.0-2.5 (variable)
##   Non-model organisms: 1.5-2.5 (unknown; use raw unfiltered for baseline)
echo "=== Ti/Tv ratio BEFORE filtering ==="
bcftools stats joint_genotyped.vcf.gz | grep "Ts/Tv"

echo "=== Ti/Tv ratio AFTER filtering ==="
bcftools stats final_hardfiltered.vcf.gz | grep "Ts/Tv"
## Ti/Tv should INCREASE after filtering (removing artefacts enriches transitions)

## --- More detailed stats with bcftools stats ---
bcftools stats final_hardfiltered.vcf.gz > filtering_stats.txt
cat filtering_stats.txt | grep -E "^SN|^TSTV"

## --- dbSNP overlap check (for model organisms) ---
if [ -f dbsnp_146.hg38.vcf.gz ]; then
    ## Use bcftools isec to find intersection with dbSNP
    bcftools isec \
        final_hardfiltered.vcf.gz \
        dbsnp_146.hg38.vcf.gz \
        -p isec_dbsnp/ \
        -n "~11"   # variants in BOTH files

    N_TOTAL=$(bcftools view -H final_hardfiltered.vcf.gz | wc -l)
    N_KNOWN=$(bcftools view -H isec_dbsnp/0001.vcf | wc -l 2>/dev/null || echo 0)
    PCT=$(echo "scale=1; ${N_KNOWN} * 100 / ${N_TOTAL}" | bc)
    echo "=== dbSNP overlap ==="
    echo "Total variants: ${N_TOTAL}"
    echo "Known in dbSNP: ${N_KNOWN} (${PCT}%)"
    echo "Novel variants: $((N_TOTAL - N_KNOWN))"
    ## For human WGS: expect 85-95% known. <80% = likely too many false positives remain.
fi

## --- R: Plot filtering waterfall (how many variants removed at each step) ---
cat << 'RSCRIPT' > plot_filtering_waterfall.R
library(ggplot2)

## Record variant counts at each step (fill in your actual numbers)
steps <- data.frame(
    step = c("Raw genotyped", "Site-level filter", "Genotype GQ/DP filter",
             "Missingness filter", "MAF filter", "Final PASS"),
    n_variants = c(
        as.numeric(system("bcftools view -H joint_genotyped.vcf.gz | wc -l", intern=TRUE)),
        as.numeric(system("bcftools view -H snps_hardfiltered.vcf.gz | wc -l", intern=TRUE)),
        as.numeric(system("bcftools view -H genotype_filtered.vcf.gz | wc -l", intern=TRUE)),
        as.numeric(system("bcftools view -H final_cleaned.recode.vcf.gz | wc -l", intern=TRUE)),
        NA, ## fill in
        as.numeric(system("bcftools view -H final_hardfiltered.vcf.gz | wc -l", intern=TRUE))
    )
)
steps <- steps[!is.na(steps$n_variants), ]
ggplot(steps, aes(x=reorder(step, -n_variants), y=n_variants/1000, fill=step)) +
    geom_col(width=0.6, show.legend=FALSE) +
    geom_text(aes(label=paste0(round(n_variants/1000,1), "k")), vjust=-0.3, size=3.5) +
    labs(x=NULL, y="Variants (thousands)",
         title="Variant count at each filtering step") +
    theme_bw(base_size=12) + theme(axis.text.x=element_text(angle=30, hjust=1))
ggsave("filtering_waterfall.png", width=9, height=5, dpi=150)
RSCRIPT
Rscript plot_filtering_waterfall.R 2>/dev/null || echo "Run R script manually: Rscript plot_filtering_waterfall.R"
Download: