Build custom databases · annotate any VCF · understand impact tiers · avoid common pitfalls · SnpSift filter expressions · automated pipeline scripts · downstream analysis
SnpEff predicts the functional effect of every variant in a VCF by comparing its genomic position to a gene annotation database. It answers: “If this SNP changes amino acid X to Y in gene Z, what is the likely biological impact?” SnpSift is the companion tool for filtering and manipulating the annotated VCF using a powerful SQL-like expression language.
| SnpEff Impact Tier | Meaning | Examples | Action |
|---|---|---|---|
| HIGH | Presumed disruption of protein function | stop_gained (nonsense), frameshift, splice_site_acceptor/donor, start_lost | Strong candidate — validate experimentally |
| MODERATE | Non-disruptive variant — may change protein effectiveness | missense_variant, in-frame insertion/deletion, codon change | Context-dependent — check conservation + domain |
| LOW | Assumed mostly harmless | synonymous_variant, stop_retained, splice_region (non-core) | Usually ignored unless studying synonymous evolution |
| MODIFIER | Non-coding — effect unknown | intron_variant, upstream/downstream, 3'/5'UTR, intergenic | Filter out for coding-focused analyses; keep for regulatory studies |
We use the Rice 3000 Genomes Project (3K-RG) — the largest publicly available multi-sample VCF for a crop species. We demonstrate both a pre-built database (rice GFF3 from RAP-DB) and a custom database build from scratch to show the complete workflow for non-model organisms.
| Property | Value |
|---|---|
| Organism | Oryza sativa (Asian rice) |
| Accession | PRJEB6180 — 3000 Rice Genomes Project |
| Reference genome | IRGSP-1.0 (Os-Nipponbare-Reference-IRGSP-1.0) |
| Annotation | RAP-DB GFF3 — Rice Annotation Project |
| Why rice? | Well-validated annotation, known functional variants (e.g. Rc gene for red pericarp), and free public data |
###################################################################### ## STEP 1: Install SnpEff and build a custom annotation database ## SnpEff needs THREE things per organism: ## 1. Reference genome FASTA ## 2. Gene annotation (GFF3 or GTF) ## 3. A config entry pointing SnpEff to these files ## ## For 68+ common organisms (human, mouse, Arabidopsis, rice, maize...): ## java -jar snpEff.jar download## (downloads a pre-built database automatically) ## ## For any organism NOT in the pre-built list: ## You MUST build the database manually — shown below ###################################################################### ## --- Download SnpEff --- wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip unzip snpEff_latest_core.zip ## snpEff.jar and SnpSift.jar are in snpEff/ ## --- Check available pre-built databases --- java -jar snpEff/snpEff.jar databases | grep -i "oryza\|rice" ## Outputs: OryZa_sativa (Oryza_sativa) — IRGSP-1.0 ## --- Option A: Download pre-built rice database (simplest) --- java -jar snpEff/snpEff.jar download Oryza_sativa ## Downloads to snpEff/data/Oryza_sativa/ ## ================================================================= ## Option B: Build a CUSTOM database (required for non-model organisms) ## Use this if your organism is not in the pre-built list, ## OR if you want to use a more recent/specific annotation version. ## ================================================================= GENOME_NAME="Oryza_sativa_IRGSP1" # your database name (no spaces, no special chars) GENOME_FASTA="Oryza_sativa.IRGSP1.dna.toplevel.fa" ANNOTATION_GFF="Oryza_sativa.IRGSP1.60.gff3.gz" ## Step B1: Download the genome FASTA and GFF3 annotation wget -O ${GENOME_FASTA} \ "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP1.dna.toplevel.fa.gz" gunzip ${GENOME_FASTA}.gz wget -O ${ANNOTATION_GFF} \ "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/gff3/oryza_sativa/Oryza_sativa.IRGSP1.60.gff3.gz" ## Keep GFF3 gzipped — SnpEff reads gzipped GFF3 directly ## Step B2: Create the database directory structure ## SnpEff expects: ## snpEff/data/${GENOME_NAME}/sequences.fa (genome FASTA) ## snpEff/data/${GENOME_NAME}/genes.gff (annotation GFF3) mkdir -p snpEff/data/${GENOME_NAME}/ cp ${GENOME_FASTA} snpEff/data/${GENOME_NAME}/sequences.fa cp ${ANNOTATION_GFF} snpEff/data/${GENOME_NAME}/genes.gff.gz ## Step B3: Add a config entry to snpEff.config ## This tells SnpEff where to find the database echo "" >> snpEff/snpEff.config echo "# ${GENOME_NAME} - Custom Oryza sativa IRGSP-1.0" >> snpEff/snpEff.config echo "${GENOME_NAME}.genome : Oryza sativa" >> snpEff/snpEff.config echo "${GENOME_NAME}.chromosomes : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12" >> snpEff/snpEff.config ## CRITICAL: chromosome names in your config MUST match chromosome names in ## your FASTA and VCF. If your VCF has "chr1" but FASTA has "1", SnpEff ## will annotate ZERO variants. This is the #1 source of SnpEff failures. ## Step B4: Build the database ## -v : verbose output (shows progress and any errors in the GFF3) ## -gff3 : specify GFF3 format (not GTF) java -Xmx8g -jar snpEff/snpEff.jar build \ -gff3 \ -v \ ${GENOME_NAME} \ 2>&1 | tee snpEff_build.log ## Check the build log for warnings: ## "WARNING: Cannot find protein for transcript" = GFF3 lacks CDS features (problem!) ## "WARNING: Codon X is not a stop codon" = non-standard genetic code (set -codonTable) ## "Gene not found" = transcript lacks parent gene feature in GFF3 echo "Database build complete. Check snpEff_build.log for warnings." ls -lh snpEff/data/${GENOME_NAME}/
| 📌GFF3 vs GTF | SnpEff supports both GFF3 (-gff3) and GTF (-gtf22). GFF3 is the standard for Ensembl Plants/NCBI RefSeq. GTF is the standard for GENCODE (human/mouse). The critical difference: GFF3 uses ID=/Parent= attributes; GTF uses gene_id/transcript_id. If you see “gene not found” errors, check that your GFF3 has gene, mRNA, CDS, and exon features in the correct parent-child hierarchy. |
| 🎯Chromosome name matching | This is the single most common SnpEff problem. Your VCF, reference FASTA, and SnpEff config must all use identical chromosome names. UCSC uses “chr1”; Ensembl uses “1”; NCBI uses “NC_007112.7”. Check with: bcftools view -h your.vcf | grep contig and grep '>' genome.fa | head. Use bcftools annotate --rename-chrs to fix chromosome name mismatches. |
######################################################################
## STEP 2: Annotate variants with SnpEff
## SnpEff adds an ANN field to each VCF record containing:
## Allele | Effect | Impact | Gene | GeneID | Feature_type |
## Feature_ID | Transcript_biotype | Rank | HGVS.c | HGVS.p |
## cDNA_pos/cDNA_len | CDS_pos/CDS_len | AA_pos/AA_len | Distance | Errors
## Multiple transcripts per gene = multiple ANN entries (semicolon-separated)
######################################################################
DB="${GENOME_NAME}" # database name from Step 1
INPUT_VCF="variants_filtered.vcf.gz"
OUTPUT_VCF="variants_annotated.vcf.gz"
STATS_HTML="snpeff_stats.html"
## --- Run SnpEff ---
## -Xmx8g : give SnpEff 8 GB RAM (large genomes/many transcripts need more)
## ann : annotation subcommand
## -v : verbose (shows annotation progress)
## -stats : output HTML statistics report
## -csvStats : also output stats as CSV (for scripting)
## -canon : use only the canonical transcript per gene
## HIGHLY RECOMMENDED: without -canon, each variant gets annotated once per transcript
## A gene with 15 isoforms = 15 ANN entries per variant (massively inflates output)
## -canon picks the Ensembl canonical transcript (longest CDS / most exons)
## -noIntergenicVariants : skip annotation of intergenic variants
## Use this when you only care about protein-coding effects
## -ud 2000 : define upstream/downstream region size (default 5000 bp)
## Variants within this distance of a gene are called "upstream" or "downstream"
## Set smaller if your annotation is dense (plant genomes with small intergenic regions)
## | bgzip : compress the output VCF
## | tee : write to file AND show stdout simultaneously
java -Xmx8g -jar snpEff/snpEff.jar ann \
-v \
-stats ${STATS_HTML} \
-csvStats snpeff_stats.csv \
-canon \
-ud 2000 \
${DB} \
${INPUT_VCF} \
2> snpeff_run.log | \
bgzip > ${OUTPUT_VCF}
tabix -p vcf ${OUTPUT_VCF}
echo "Annotation complete."
echo "Open ${STATS_HTML} in a browser to see the effect summary."
## --- Inspect the ANN field ---
## Show the first few annotated variants
bcftools view -H ${OUTPUT_VCF} | head -5 | \
cut -f8 | tr ';' '\n' | grep "^ANN=" | head -3
## ANN field format (pipe-delimited):
## ANN=C|missense_variant|MODERATE|Os01g0100200|Os01g0100200|transcript|Os01t0100200-01|mRNA|3/6|c.452G>A|p.Arg151Gln|452/1890|452/1233|151/410||
## ↑ effect ↑ impact ↑ gene name ↑ gene ID ↑ transcript ↑ exon ↑ cDNA change ↑ protein change
## --- Count variants by impact ---
echo "=== Variant counts by impact ==="
bcftools view -H ${OUTPUT_VCF} | \
grep -oP 'ANN=[^;]+' | \
grep -oP '\|[A-Z]+\|' | \
sort | uniq -c | sort -rn
## --- Count variants by effect type ---
echo "=== Top 20 effect types ==="
bcftools view -H ${OUTPUT_VCF} | \
grep -oP 'ANN=[^\t]+' | \
awk -F'|' '{print $2}' | \
sort | uniq -c | sort -rn | head -20
Before diving into parsing, use the interactive spreadsheet below to understand how SnpEff categorises variants and what a healthy annotation run looks like. The Ti/Tv ratio (transitions vs transversions) is the first sanity check — edit the counts to see it recalculate instantly:
######################################################################
## STEP 3: Extract and interpret SnpEff impact predictions
## The ANN field is complex — here we parse it into a clean table
## and explain what each impact level really means biologically.
######################################################################
## --- Python script to parse ANN field into a clean TSV ---
cat << 'PYEOF' > parse_snpeff_ann.py
#!/usr/bin/env python3
"""
Parse SnpEff ANN field from a VCF into a clean, analysis-ready TSV.
Handles multiple transcripts per variant (takes highest-impact annotation).
"""
import gzip, sys, csv
## Impact priority: HIGH > MODERATE > LOW > MODIFIER
IMPACT_PRIORITY = {"HIGH": 4, "MODERATE": 3, "LOW": 2, "MODIFIER": 1}
def parse_ann(ann_str):
"""Parse the ANN field and return the highest-impact annotation."""
if not ann_str or ann_str == ".":
return None
## Multiple transcripts are comma-separated within ANN=
entries = ann_str.replace("ANN=", "").split(",")
best = None
best_priority = 0
for entry in entries:
fields = entry.split("|")
if len(fields) < 4:
continue
allele, effect, impact, gene = fields[0], fields[1], fields[2], fields[3]
gene_id = fields[4] if len(fields) > 4 else ""
transcript = fields[6] if len(fields) > 6 else ""
hgvs_c = fields[9] if len(fields) > 9 else ""
hgvs_p = fields[10] if len(fields) > 10 else ""
aa_change = fields[10] if len(fields) > 10 else ""
priority = IMPACT_PRIORITY.get(impact, 0)
if priority > best_priority:
best_priority = priority
best = {
"allele": allele, "effect": effect, "impact": impact,
"gene": gene, "gene_id": gene_id, "transcript": transcript,
"hgvs_c": hgvs_c, "hgvs_p": hgvs_p
}
return best
vcf_file = sys.argv[1] # input annotated VCF
out_file = sys.argv[2] # output TSV
opener = gzip.open if vcf_file.endswith(".gz") else open
with opener(vcf_file, "rt") as vcf, open(out_file, "w", newline="") as out:
writer = csv.writer(out, delimiter="\t")
writer.writerow(["CHROM","POS","REF","ALT","QUAL","FILTER",
"IMPACT","EFFECT","GENE","GENE_ID","TRANSCRIPT",
"HGVS_C","HGVS_P"])
for line in vcf:
if line.startswith("#"):
continue
cols = line.strip().split("\t")
if len(cols) < 8:
continue
chrom, pos, vid, ref, alt, qual, filt, info = cols[:8]
## Extract ANN field from INFO
ann_str = ""
for field in info.split(";"):
if field.startswith("ANN="):
ann_str = field
break
ann = parse_ann(ann_str)
if ann:
writer.writerow([chrom, pos, ref, alt, qual, filt,
ann["impact"], ann["effect"], ann["gene"],
ann["gene_id"], ann["transcript"],
ann["hgvs_c"], ann["hgvs_p"]])
else:
writer.writerow([chrom, pos, ref, alt, qual, filt,
".", ".", ".", ".", ".", ".", "."])
print(f"Parsed VCF written to: {out_file}")
PYEOF
python3 parse_snpeff_ann.py variants_annotated.vcf.gz variants_annotated_parsed.tsv
echo "Parsed annotation table: variants_annotated_parsed.tsv"
## --- Summarise the parsed table ---
echo "=== Impact summary ==="
tail -n+2 variants_annotated_parsed.tsv | cut -f7 | sort | uniq -c | sort -rn
echo ""
echo "=== HIGH impact variants ==="
awk -F'\t' '$7=="HIGH" {print $1,$2,$3,$4,$8,$9,$13}' \
variants_annotated_parsed.tsv | head -20 | column -t
######################################################################
## STEP 4: Common SnpEff pitfalls and fixes
##
## PITFALL 1: All variants annotated as "intergenic" or MODIFIER
## Cause: Chromosome name mismatch between VCF and database
## Symptom: SnpEff reports 0 coding variants
## Fix: Check and fix chromosome names (shown below)
##
## PITFALL 2: "WARNING: Cannot find protein for transcript XXXX"
## Cause: GFF3 is missing CDS features (only has exon/mRNA features)
## Symptom: SnpEff cannot predict amino acid changes
## Fix: Use a GFF3 that includes CDS features, or download GTF instead
##
## PITFALL 3: Massive output — hundreds of ANN entries per variant
## Cause: -canon not used; annotating all transcripts
## Fix: Always use -canon flag (canonical transcript only)
##
## PITFALL 4: Ti/Tv ratio way off (e.g. Ti/Tv = 0.5 or 4.0)
## Cause: VCF was not quality-filtered before SnpEff
## Fix: Apply VQSR or hard filters BEFORE annotation
##
## PITFALL 5: SnpEff runs for hours and never finishes
## Cause: Not enough RAM (-Xmx too small); genome/annotation is huge
## Fix: Increase -Xmx; use -noIntergenicVariants; split by chromosome
##
## PITFALL 6: Wrong protein change — p.? or p.Met1? for coding variants
## Cause: Wrong reference FASTA used (different assembly version)
## Fix: Ensure VCF, FASTA, and GFF3 all use the same assembly version
######################################################################
## =================================================================
## FIX 1: Chromosome name mismatch
## =================================================================
## Check chromosome names in VCF
echo "=== Chromosome names in VCF ==="
bcftools view -H variants_raw.vcf.gz | cut -f1 | sort -u | head -20
## Check chromosome names in FASTA
echo "=== Chromosome names in FASTA ==="
grep '>' genome.fa | head -20
## Check chromosome names SnpEff is using
echo "=== Chromosome names in SnpEff database ==="
cat snpEff/snpEff.config | grep "${DB}.chromosomes"
## Common mismatch: VCF has "chr1", FASTA/SnpEff has "1"
## Fix: strip "chr" prefix from VCF
bcftools annotate \
--rename-chrs <(for i in $(seq 1 22) X Y MT; do echo "chr${i} ${i}"; done) \
-O z -o variants_renamed.vcf.gz \
variants_raw.vcf.gz
tabix -p vcf variants_renamed.vcf.gz
## OR: add "chr" prefix (if VCF has "1" but SnpEff needs "chr1")
bcftools annotate \
--rename-chrs <(for i in $(seq 1 22) X Y MT; do echo "${i} chr${i}"; done) \
-O z -o variants_chr_prefix.vcf.gz \
variants_raw.vcf.gz
## =================================================================
## FIX 2: GFF3 missing CDS features
## =================================================================
## Check whether your GFF3 has CDS features
echo "=== GFF3 feature type counts ==="
zcat annotation.gff3.gz | grep -v '^#' | cut -f3 | sort | uniq -c | sort -rn
## You MUST see "CDS" in the output. If not, download from a different source.
## Ensembl Plants GFF3 always has CDS features.
## Some NCBI GFF3 files omit CDS — use the _genomic.gff.gz version instead.
## =================================================================
## FIX 3: GFF3 with non-standard transcript biotypes
## =================================================================
## Some GFF3 files use "transcript" instead of "mRNA" for the feature type.
## SnpEff requires "mRNA" or "transcript" depending on the version.
## Fix: replace "transcript" with "mRNA" if needed
zcat annotation.gff3.gz | \
sed 's/\ttranscript\t/\tmRNA\t/' | \
bgzip > annotation_fixed.gff3.gz
## =================================================================
## FIX 4: SnpEff running out of memory
## =================================================================
## Check how much memory SnpEff needs for your database:
java -Xmx4g -jar snpEff/snpEff.jar databases | grep -i "memory"
## Typical: human/large genomes need 6-8 GB
## bacteria/small genomes: 1-2 GB
## polyploid crops (wheat 17 Gb genome): 16-32 GB
## =================================================================
## FIX 5: Split large VCF by chromosome for parallel SnpEff
## =================================================================
## For VCFs with millions of variants, annotate each chromosome separately:
CHROMS=$(bcftools view -H variants.vcf.gz | cut -f1 | sort -u)
for CHR in ${CHROMS}; do
bcftools view -r ${CHR} -O z -o temp_${CHR}.vcf.gz variants.vcf.gz
java -Xmx4g -jar snpEff/snpEff.jar ann -canon ${DB} temp_${CHR}.vcf.gz | \
bgzip > annotated_${CHR}.vcf.gz &
done
wait # wait for all background jobs to finish
## Merge all annotated chromosomes back into one VCF
bcftools concat -O z -o variants_annotated_all.vcf.gz annotated_chr*.vcf.gz
tabix -p vcf variants_annotated_all.vcf.gz
rm temp_chr*.vcf.gz annotated_chr*.vcf.gz
######################################################################
## STEP 5: SnpSift — powerful VCF filtering and field extraction
## SnpSift filter uses a SQL-like expression language to filter variants.
## You can access ANY field in the VCF INFO or FORMAT columns.
## Special syntax for SnpEff ANN field: ANN[*].IMPACT, ANN[0].GENE etc.
## ANN[*] = "any annotation" (at least one matches)
## ANN[0] = "first annotation" (use with -canon to get the canonical one)
######################################################################
## ================================================================
## COMMON FILTER RECIPES
## ================================================================
## --- Recipe 1: Extract only HIGH + MODERATE impact variants ---
java -jar snpEff/SnpSift.jar filter \
"(ANN[*].IMPACT = 'HIGH') | (ANN[*].IMPACT = 'MODERATE')" \
variants_annotated.vcf.gz | \
bgzip > variants_HIGH_MOD.vcf.gz
## --- Recipe 2: Extract HIGH impact only (loss-of-function candidates) ---
java -jar snpEff/SnpSift.jar filter \
"ANN[*].IMPACT = 'HIGH'" \
variants_annotated.vcf.gz | \
bgzip > variants_HIGH_only.vcf.gz
## --- Recipe 3: Filter by specific effect type ---
## Get only stop_gained (nonsense) mutations
java -jar snpEff/SnpSift.jar filter \
"ANN[*].EFFECT has 'stop_gained'" \
variants_annotated.vcf.gz | \
bgzip > variants_nonsense.vcf.gz
## --- Recipe 4: Filter by quality + depth + impact (combined) ---
## This is the most practically useful recipe for downstream analyses:
## - QUAL > 30 : Phred-scaled variant quality
## - FORMAT DP > 10 : minimum read depth per sample (genotype field)
## - HIGH or MODERATE impact
## NOTE: For multi-sample VCFs, GEN[0] = first sample, GEN[*] = any sample
java -jar snpEff/SnpSift.jar filter \
"(QUAL >= 30) & \
(GEN[0].DP >= 10) & \
((ANN[*].IMPACT = 'HIGH') | (ANN[*].IMPACT = 'MODERATE'))" \
variants_annotated.vcf.gz | \
bgzip > variants_filtered_annotated.vcf.gz
tabix -p vcf variants_filtered_annotated.vcf.gz
## --- Recipe 5: Filter by gene name ---
## Extract all variants in a specific gene of interest
java -jar snpEff/SnpSift.jar filter \
"ANN[*].GENE = 'Os01g0700300'" \
variants_annotated.vcf.gz > gene_of_interest_variants.vcf
## --- Recipe 6: Filter by functional class (missense, nonsense, silent) ---
## MISSENSE = missense variants only (MODERATE impact coding)
java -jar snpEff/SnpSift.jar filter \
"ANN[*].EFFECT = 'missense_variant'" \
variants_annotated.vcf.gz | \
bgzip > missense_only.vcf.gz
## ================================================================
## SNPSIFT EXTRACTFIELDS — convert VCF to a TSV table
## ================================================================
## This is extremely useful for R/Python analysis — convert annotated VCF
## to a flat table that's easy to read in any analysis tool.
##
## Field syntax:
## CHROM, POS, ID, REF, ALT, QUAL, FILTER = standard VCF columns
## INFO.FIELD = any INFO field (e.g. INFO.DP, INFO.AF)
## GEN[0].FIELD = FORMAT field for sample 0 (e.g. GEN[0].GT, GEN[0].DP)
## ANN[0].IMPACT = highest-priority annotation impact (with -canon)
## ANN[0].EFFECT = effect type
## ANN[0].GENE = gene symbol
## ANN[0].HGVS_C = cDNA change (e.g. c.452G>A)
## ANN[0].HGVS_P = protein change (e.g. p.Arg151Gln)
java -jar snpEff/SnpSift.jar extractFields \
-s "," \
-e "." \
variants_filtered_annotated.vcf.gz \
CHROM POS REF ALT QUAL FILTER \
"ANN[0].IMPACT" "ANN[0].EFFECT" "ANN[0].GENE" \
"ANN[0].HGVS_C" "ANN[0].HGVS_P" \
"GEN[0].GT" "GEN[0].DP" \
> variants_table.tsv
echo "Flat table created: variants_table.tsv"
echo "Rows: $(wc -l < variants_table.tsv)"
head -3 variants_table.tsv | column -t
## ================================================================
## ANNOTATE WITH EXTERNAL DATABASES (dbSNP, ClinVar, gnomAD)
## ================================================================
## SnpSift annotate: add ID and INFO fields from an external VCF database
## The external VCF must be bgzipped and tabix-indexed
## Add dbSNP rsIDs to your variants
java -jar snpEff/SnpSift.jar annotate \
dbsnp_146.hg38.vcf.gz \
variants_annotated.vcf.gz | \
bgzip > variants_annotated_dbsnp.vcf.gz
## Add gnomAD allele frequencies
java -jar snpEff/SnpSift.jar annotate \
-info AF,AF_afr,AF_eas,AF_nfe \
gnomad.genomes.v3.1.2.sites.vcf.gz \
variants_annotated_dbsnp.vcf.gz | \
bgzip > variants_annotated_full.vcf.gz
#!/usr/bin/env bash
######################################################################
## AUTOMATED SNPEFF ANNOTATION PIPELINE
## Usage: bash snpeff_pipeline.sh -v input.vcf.gz -d DATABASE -o outdir
## This script:
## 1. Validates input (file exists, chromosome name check)
## 2. Runs SnpEff annotation
## 3. Runs SnpSift filter (HIGH + MODERATE)
## 4. Extracts a flat TSV table
## 5. Generates a summary report
## Designed to work with any organism supported by SnpEff.
######################################################################
set -euo pipefail # exit on error, undefined variable, or pipe failure
## ================================================================
## Parse command-line arguments
## ================================================================
usage() { echo "Usage: $0 -v VCF.gz -d DATABASE -o OUTDIR [-m MEMORY_GB]"; exit 1; }
VCF=""; DB=""; OUTDIR=""; MEM="8"
while getopts "v:d:o:m:" opt; do
case ${opt} in
v) VCF="${OPTARG}" ;;
d) DB="${OPTARG}" ;;
o) OUTDIR="${OPTARG}" ;;
m) MEM="${OPTARG}" ;;
*) usage ;;
esac
done
[[ -z "${VCF}" || -z "${DB}" || -z "${OUTDIR}" ]] && usage
mkdir -p "${OUTDIR}"
LOG="${OUTDIR}/pipeline.log"
exec > >(tee -a "${LOG}") 2>&1 # log all output to file + screen
echo "[$(date)] Starting SnpEff annotation pipeline"
echo " VCF: ${VCF}"
echo " Database: ${DB}"
echo " Output: ${OUTDIR}"
echo " Memory: ${MEM}g"
## ================================================================
## STEP 1: Validate inputs
## ================================================================
echo "[$(date)] STEP 1: Validating inputs..."
[[ ! -f "${VCF}" ]] && echo "ERROR: VCF not found: ${VCF}" && exit 1
## Check bcftools is available
command -v bcftools &>/dev/null || { echo "ERROR: bcftools not in PATH"; exit 1; }
command -v java &>/dev/null || { echo "ERROR: java not in PATH"; exit 1; }
## Check VCF chromosome names vs SnpEff database
VCF_CHROMS=$(bcftools view -H "${VCF}" | cut -f1 | sort -u | head -5 | tr '\n' ',')
echo " VCF chromosomes (first 5): ${VCF_CHROMS}"
## Sanity check: count variants
N_VAR=$(bcftools view -H "${VCF}" | wc -l)
echo " Total variants in input: ${N_VAR}"
## ================================================================
## STEP 2: Run SnpEff
## ================================================================
echo "[$(date)] STEP 2: Running SnpEff annotation..."
ANN_VCF="${OUTDIR}/annotated.vcf.gz"
java -Xmx${MEM}g -jar snpEff/snpEff.jar ann \
-v \
-canon \
-ud 2000 \
-noIntergenicVariants \
-stats "${OUTDIR}/snpeff_stats.html" \
-csvStats "${OUTDIR}/snpeff_stats.csv" \
"${DB}" \
"${VCF}" \
2> "${OUTDIR}/snpeff_ann.log" | \
bgzip > "${ANN_VCF}"
tabix -p vcf "${ANN_VCF}"
N_ANN=$(bcftools view -H "${ANN_VCF}" | wc -l)
echo " Annotated variants: ${N_ANN}"
## ================================================================
## STEP 3: SnpSift — filter HIGH + MODERATE impact
## ================================================================
echo "[$(date)] STEP 3: Filtering HIGH + MODERATE impact variants..."
FILT_VCF="${OUTDIR}/annotated_HIGH_MOD.vcf.gz"
java -jar snpEff/SnpSift.jar filter \
"(ANN[*].IMPACT = 'HIGH') | (ANN[*].IMPACT = 'MODERATE')" \
"${ANN_VCF}" | \
bgzip > "${FILT_VCF}"
tabix -p vcf "${FILT_VCF}"
N_FILT=$(bcftools view -H "${FILT_VCF}" | wc -l)
echo " HIGH + MODERATE variants: ${N_FILT}"
## ================================================================
## STEP 4: Extract flat TSV table
## ================================================================
echo "[$(date)] STEP 4: Extracting TSV annotation table..."
java -jar snpEff/SnpSift.jar extractFields \
-s "," -e "." \
"${FILT_VCF}" \
CHROM POS REF ALT QUAL FILTER \
"ANN[0].IMPACT" "ANN[0].EFFECT" "ANN[0].GENE" "ANN[0].GENE_ID" \
"ANN[0].HGVS_C" "ANN[0].HGVS_P" \
> "${OUTDIR}/variants_table.tsv"
echo " Table rows: $(wc -l < "${OUTDIR}/variants_table.tsv")"
## ================================================================
## STEP 5: Generate summary report
## ================================================================
echo "[$(date)] STEP 5: Generating summary report..."
python3 << PYEOF
import pandas as pd, sys
df = pd.read_csv("${OUTDIR}/variants_table.tsv", sep="\t")
print("\n=== IMPACT SUMMARY ===")
print(df["ANN[0].IMPACT"].value_counts().to_string())
print("\n=== TOP 10 EFFECTS ===")
print(df["ANN[0].EFFECT"].value_counts().head(10).to_string())
print("\n=== TOP 10 AFFECTED GENES ===")
print(df["ANN[0].GENE"].value_counts().head(10).to_string())
df.to_csv("${OUTDIR}/variants_table_clean.tsv", sep="\t", index=False)
PYEOF
echo "[$(date)] Pipeline complete. Results in: ${OUTDIR}/"
echo " annotated.vcf.gz : full annotated VCF"
echo " annotated_HIGH_MOD.vcf.gz : HIGH + MODERATE variants only"
echo " variants_table.tsv : flat TSV for R/Excel"
echo " snpeff_stats.html : open in browser for summary charts"
bash snpeff_pipeline.sh -v mydata.vcf.gz -d Oryza_sativa -o results/ — works for any organism in the SnpEff database. Change -d to match your species (e.g. GRCh38.p14 for human, Arabidopsis_thaliana for Arabidopsis, Zea_mays for maize).######################################################################
## STEP 7: Downstream analysis of annotated variants in R
## The variants_table.tsv from SnpSift extractFields is clean TSV —
## read it directly into R for statistical analysis and plotting.
######################################################################
library(tidyverse)
library(ggplot2)
## --- Load the annotation table ---
## Read the TSV output from SnpSift extractFields
df <- read.delim("results/variants_table.tsv",
header=TRUE, sep="\t", na.strings=".")
colnames(df) <- c("CHROM","POS","REF","ALT","QUAL","FILTER",
"IMPACT","EFFECT","GENE","GENE_ID","HGVS_C","HGVS_P")
cat("Total variants loaded:", nrow(df), "\n")
cat("Columns:", paste(colnames(df), collapse=", "), "\n")
## --- Figure 1: Variant impact distribution ---
impact_counts <- df %>%
count(IMPACT) %>%
filter(!is.na(IMPACT)) %>%
arrange(desc(n))
ggplot(impact_counts, aes(x=reorder(IMPACT, n), y=n, fill=IMPACT)) +
geom_col(width=0.6) +
geom_text(aes(label=scales::comma(n)), hjust=-0.1, size=3.5) +
scale_fill_manual(values=c(
HIGH="firebrick", MODERATE="orange", LOW="steelblue", MODIFIER="grey70")) +
coord_flip() +
scale_y_continuous(expand=expansion(mult=c(0,0.15))) +
labs(title="Variant counts by functional impact (SnpEff)",
x=NULL, y="Number of variants") +
theme_bw(base_size=13) + theme(legend.position="none")
ggsave("impact_distribution.png", width=7, height=4, dpi=150)
## --- Figure 2: Effect type breakdown (top 15) ---
top_effects <- df %>%
count(EFFECT) %>% arrange(desc(n)) %>% head(15)
ggplot(top_effects, aes(x=reorder(EFFECT, n), y=n)) +
geom_col(fill="steelblue", width=0.6) +
coord_flip() +
labs(title="Top 15 variant effect types", x=NULL, y="Count") +
theme_bw(base_size=12)
ggsave("effect_types.png", width=9, height=6, dpi=150)
## --- Table: Genes with most HIGH-impact variants ---
## Useful for prioritising candidate genes for experimental follow-up
high_impact_genes <- df %>%
filter(IMPACT == "HIGH") %>%
count(GENE, sort=TRUE) %>%
head(20)
cat("\nTop 20 genes with HIGH-impact variants:\n")
print(high_impact_genes)
write.csv(high_impact_genes, "top_high_impact_genes.csv", row.names=FALSE)
## --- Export to Excel-friendly CSV ---
## Remove | characters that confuse Excel
df_clean <- df %>%
mutate(across(everything(), ~gsub("\\|", " ", as.character(.))))
write.csv(df_clean, "variants_for_excel.csv", row.names=FALSE)
cat("\nExcel-ready CSV: variants_for_excel.csv\n")