Build a custom Kraken2 database for nematode-associated bacteria and viruses, classify reads, estimate abundances with Bracken, and visualize with Krona — a workflow applied to nematode RNA-seq data.
Kraken2 uses exact k-mer matches against a reference database to classify DNA (or RNA) reads to taxonomic nodes. It is extremely fast — capable of processing millions of reads per minute — and highly accurate when the database contains the organism of interest.
| Tool | Purpose |
|---|---|
| Kraken2 | K-mer based read classification |
| Bracken | Bayesian abundance re-estimation at species level |
| KrakenTools | Filter, merge, and manipulate Kraken outputs |
| Krona | Interactive taxonomic pie chart visualization |
Kraken2 databases are large (8–60 GB). Keep them separate from your project data so you can reuse the same database across multiple projects.
df -h . before starting a database build!conda create -n kraken2_env -c bioconda -c conda-forge \
kraken2 bracken krakentools krona
conda activate kraken2_env
# Verify
kraken2 --version
bracken --version
ktImportTaxonomy --help | head -3
Kraken2 provides standard databases (bacteria, viral, archaea, human) but for specialized applications like nematode microbiome studies, you need a custom database that includes appropriate reference genomes.
mkdir -p kraken2_db
# Step 1: Download NCBI taxonomy (required for all databases)
kraken2-build --download-taxonomy --db kraken2_db
# IMPORTANT: The script uses ftp.ncbi.nih.gov
# If your cluster blocks FTP, edit kraken2-build's download_taxonomy.sh:
# Change ftp:// to https://
# Example fix:
DB_DIR="${KRAKEN2_DEFAULT_DB:-kraken2_db}"
sed -i 's|ftp://ftp.ncbi.nih.gov|https://ftp.ncbi.nih.gov|g' \
$(which kraken2-build) # edit the installed script if needed
# Step 2: Add standard bacterial library
kraken2-build --download-library bacteria --db kraken2_db
# Step 3: Add viral library
kraken2-build --download-library viral --db kraken2_db
# Step 4: Add human library (to remove host contamination)
kraken2-build --download-library human --db kraken2_db
echo "Standard libraries downloaded. Directory size:"
du -sh kraken2_db/library/
| 🌳--download-taxonomy | Downloads NCBI taxonomy files (nodes.dmp, names.dmp, accession2taxid). These map sequence accession numbers to taxonomic IDs. Must be run first — all subsequent steps depend on it. |
| 🧫--download-library bacteria | Downloads all complete bacterial RefSeq genomes. Takes 1–6 hours and uses ~50 GB. Available libraries: bacteria, viral, human, fungi, protozoa, plant, archaea, UniVec. |
| 🔨kraken2-build --build | Compiles all library sequences into the Kraken2 k-mer hash index. This is the most compute-intensive step — uses all available threads and requires large RAM (up to 100 GB for the full database). The resulting hash.k2d is the file Kraken2 loads at runtime. |
The key step: add genome assemblies of bacteria known to associate with nematodes. Each genome needs a NCBI taxonomy ID in the FASTA header.
mkdir -p custom_genomes
# Download specific bacterial genomes associated with nematodes
# Example: Wolbachia (common nematode endosymbiont)
# Taxon ID must be added to FASTA headers as: |kraken:taxid|XXXX
# Download Wolbachia genome (TaxID: 953)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/025/GCF_000008025.1_ASM802v1/GCF_000008025.1_ASM802v1_genomic.fna.gz
gzip -d GCF_000008025.1_ASM802v1_genomic.fna.gz
# Add taxonomy ID to FASTA headers
# Format: >seq_name|kraken:taxid|953
sed 's/^>/&|kraken:taxid|953 /' GCF_000008025.1_ASM802v1_genomic.fna \
> custom_genomes/Wolbachia_taxid953.fna
# Repeat for other organisms:
# Xanthomonas (TaxID: 338) - common soil bacterium
# Bacillus thuringiensis (TaxID: 1428) - nematode pathogen
# Microbotryum (TaxID: 5936) - plant fungal pathogen
# Add all custom genomes to the Kraken2 library
for genome in custom_genomes/*.fna; do
echo "Adding: $genome"
kraken2-build --add-to-library ${genome} --db kraken2_db
done
# Build the final database index
# --kmer-len 35 : k-mer length (default 35 for Kraken2)
# --minimizer-len 31 : minimizer length
# --minimizer-spaces 6 : spaces in minimizer (reduces memory)
kraken2-build --build \
--db kraken2_db \
--kmer-len 35 \
--minimizer-len 31 \
--minimizer-spaces 6 \
--threads 24
echo "Database build complete!"
kraken2-build --db kraken2_db --report-minimizer-data
| 🏷️sed 's/^>/&|kraken:taxid|953 /' | Adds the Kraken taxon ID tag to every FASTA header line (lines starting with >). The |kraken:taxid|953 format is the magic string Kraken2's --add-to-library command reads to associate sequences with a NCBI taxonomy node. Without it, Kraken2 can't classify reads against your custom genome. |
| 💉--add-to-library | Registers a FASTA file into the Kraken2 library folder without building the index yet. Run this for each custom genome. Then run --build once at the end to compile everything together. |
| 🔢--kmer-len 35 | K-mer length for the hash table. Longer k-mers = more specific but require more memory. Default 35 is optimal for 150 bp reads. For very short reads (<100 bp) use 25. |
| 💾--minimizer-spaces 6 | Number of positions to skip in the minimizer window. Higher value = smaller database footprint with minor accuracy tradeoff. Set to 6 to reduce the standard database from ~70 GB to ~40 GB. |
--memory-mapping flag during classification to load from disk instead of RAM — 2–3× slower but uses no RAM.conda activate kraken2_env
DB="kraken2_db"
OUTDIR="kraken2_results"
mkdir -p ${OUTDIR}
# Classify single-end RNA-seq reads
for fq in data/*.fastq.gz; do
SAMPLE=$(basename ${fq} .fastq.gz)
echo "Classifying: ${SAMPLE}"
kraken2 \
--db ${DB} \
--threads 16 \
--output ${OUTDIR}/${SAMPLE}.kraken \
--report ${OUTDIR}/${SAMPLE}.report \
--gzip-compressed \
${fq}
echo " Report: ${OUTDIR}/${SAMPLE}.report"
done
# For paired-end reads:
# kraken2 --db ${DB} --threads 16 \
# --paired reads_R1.fastq.gz reads_R2.fastq.gz \
# --output sample.kraken --report sample.report
# Show top classifications from one sample
echo "Top 20 classifications:"
sort -rn -k3,3 ${OUTDIR}/${SAMPLE}.report | head -20
# Report format columns:
# 1: % reads covered by clade
# 2: # reads covered by clade (direct + children)
# 3: # reads directly assigned to taxon
# 4: taxonomic rank code (U,R,D,P,C,O,F,G,S)
# 5: taxonomy ID
# 6: scientific name
| 📄--output sample.kraken | Per-read output: one line per read with classification status (C/U), read ID, taxonomy ID, read length, and k-mer hit string. Large file! You can omit this with --output /dev/null if you only need the summary report. |
| 📊--report sample.report | Summary report: one line per taxon, showing percentage and read counts. This is the key output — used by Bracken for abundance re-estimation and by Krona for visualization. |
| 🔐--gzip-compressed | Tells Kraken2 to decompress .gz files on the fly. Without this flag, Kraken2 will fail or misread gzipped FASTQs. Use --bzip2-compressed for .bz2 files. |
| 👫--paired R1 R2 | Process paired-end reads together. Kraken2 concatenates the two reads with an N spacer and classifies them as one unit, improving classification accuracy over single-end mode. |
% reads, reads in clade, reads assigned directly, rank code (U=unclassified, D=domain, P=phylum, C=class, O=order, F=family, G=genus, S=species), taxID, name. Sort by column 3 (direct assignments) to find the most abundant specific taxa.Kraken2 assigns reads to the lowest common ancestor (LCA) — many reads land at genus or family level. Bracken redistributes these reads using a Bayesian model to give species-level abundance estimates.
# Step 1: Build Bracken database (once per Kraken2 database)
bracken-build \
-d kraken2_db \
-t 16 \
-l 150 \
-k 35
# -l = read length (match your actual read length)
# -k = k-mer length used for Kraken2 database
# Step 2: Run Bracken on each sample report
for report in kraken2_results/*.report; do
SAMPLE=$(basename ${report} .report)
echo "Bracken: ${SAMPLE}"
bracken \
-d kraken2_db \
-i ${report} \
-o bracken_results/${SAMPLE}.bracken \
-r 150 \
-l S \
-t 10
# -l S = species level
# -t 10 = minimum read threshold for re-estimation
done
# Combine all Bracken results into a single count table
python3 $(which combine_bracken_outputs.py) \
--files bracken_results/*.bracken \
--output combined_bracken.txt
echo "Combined table: combined_bracken.txt"
head -5 combined_bracken.txt
| 🔨bracken-build -l 150 | Builds Bracken's distribution files for a specific read length (150 bp here). Bracken needs to know the expected k-mer distribution for each species at your read length. Run once per database per read length. |
| 🎯-l S (species level) | Re-estimate abundances at Species level. Options: D=Domain, P=Phylum, C=Class, O=Order, F=Family, G=Genus, S=Species. Species is usually the goal. |
| 📉-t 10 | Minimum read threshold — taxa with fewer than 10 reads are not re-estimated (too unreliable). Increase to 50–100 for stricter confidence. Low-read taxa keep their Kraken2 assignment. |
| 🔀combine_bracken_outputs.py | Merges per-sample Bracken files into a single matrix (samples × species). This table is directly importable into R (DESeq2, vegan) for downstream microbiome statistics. |
mkdir -p krona_plots
# Convert Kraken2 report format to Krona input
for report in kraken2_results/*.report; do
SAMPLE=$(basename ${report} .report)
# Convert Kraken2 report to Krona format
kreport2krona.py \
-r ${report} \
-o krona_plots/${SAMPLE}.krona.txt
# Generate interactive HTML Krona chart
ktImportText \
krona_plots/${SAMPLE}.krona.txt \
-o krona_plots/${SAMPLE}.krona.html
echo "Krona chart: krona_plots/${SAMPLE}.krona.html"
done
# Or combine all samples into one multi-sample Krona chart
# (allows switching between samples in the browser)
ALL_INPUTS=$(for r in kraken2_results/*.report; do
SAMPLE=$(basename ${r} .report)
kreport2krona.py -r ${r} -o /tmp/${SAMPLE}.krona.txt
echo "/tmp/${SAMPLE}.krona.txt,${SAMPLE}"
done | tr '\n' ' ')
# Create multi-sample Krona HTML
ktImportText ${ALL_INPUTS} -o krona_all_samples.html
echo "Multi-sample Krona chart: krona_all_samples.html"
echo "Open in any browser to explore the interactive taxonomy wheel."
Many HPC clusters block outbound FTP. Edit the download_taxonomy.sh script inside the Kraken2 installation to replace all ftp://ftp.ncbi.nih.gov URLs with https://ftp.ncbi.nih.gov. Find the script path with which kraken2-build then look for the scripts directory alongside it.
Common causes: (1) Wrong read type — Kraken2 by default assumes DNA; for RNA-seq reads add no extra flags, but make sure your database includes transcriptome or mRNA sequences if reads are spliced. (2) The organism is simply not in your database. (3) Reads are very short — Kraken2 performs poorly on reads <50 bp.
Bracken database files (*.kmer_distrib) must match the read length and k-mer size used in Kraken2 build. Run bracken-build with -l matching your actual read length after building the Kraken2 database.