Taxonomic Classification with Kraken2

Overview

Kraken2 uses exact k-mer matches against a reference database to classify DNA (or RNA) reads to taxonomic nodes. It is extremely fast — capable of processing millions of reads per minute — and highly accurate when the database contains the organism of interest.

Application: This tutorial addresses a practical challenge — nematode RNA-seq libraries often contain reads from bacterial endosymbionts or viral pathogens that co-exist with the nematode. Identifying these contaminants (or genuine biological signals) requires a custom database including nematode-clade bacteria and viruses, not just the standard RefSeq bacterial database. This workflow was developed at Iowa State University GIF.

Tool	Purpose
Kraken2	K-mer based read classification
Bracken	Bayesian abundance re-estimation at species level
KrakenTools	Filter, merge, and manipulate Kraken outputs
Krona	Interactive taxonomic pie chart visualization

Ideal Directory Structure

Kraken2 databases are large (8–60 GB). Keep them separate from your project data so you can reuse the same database across multiple projects.

kraken2_nematode_project/

kraken2_nematode_project/ ├── kraken2_db/ # Kraken2 database (shared, reusable) │ ├── taxonomy/ # NCBI taxonomy nodes + names │ ├── library/ # downloaded reference sequences │ ├── hash.k2d # compiled k-mer hash (largest file!) │ ├── opts.k2d │ └── taxo.k2d ├── custom_genomes/ # FASTA files for non-standard genomes ├── reads/ # input FASTQ files (read-only!) ├── kraken2_output/ # per-sample Kraken2 outputs │ ├── sample1.kraken2 # per-read classification │ └── sample1.report # summary report (used by Bracken) ├── bracken_output/ # Bracken re-estimated abundance tables │ └── sample1.bracken └── krona/ # interactive Krona HTML charts └── taxonomy.krona.html

💾

Database size reality check: The standard bacteria+viral+human Kraken2 database is ~55 GB (standard) or ~8 GB (minikraken). Building it requires ~100 GB temporary disk space during compilation. Always check available disk space with df -h . before starting a database build!

Step 1 — Install Kraken2 and Dependencies

Bash — Kraken2 installation

conda create -n kraken2_env -c bioconda -c conda-forge \
    kraken2 bracken krakentools krona
conda activate kraken2_env

# Verify
kraken2 --version
bracken --version
ktImportTaxonomy --help | head -3

Download:

Step 2 — Build a Custom Kraken2 Database

Kraken2 provides standard databases (bacteria, viral, archaea, human) but for specialized applications like nematode microbiome studies, you need a custom database that includes appropriate reference genomes.

Bash — Download taxonomy and standard libraries

mkdir -p kraken2_db

# Step 1: Download NCBI taxonomy (required for all databases)
kraken2-build --download-taxonomy --db kraken2_db

# IMPORTANT: The script uses ftp.ncbi.nih.gov
# If your cluster blocks FTP, edit kraken2-build's download_taxonomy.sh:
# Change ftp:// to https://
# Example fix:
DB_DIR="${KRAKEN2_DEFAULT_DB:-kraken2_db}"
sed -i 's|ftp://ftp.ncbi.nih.gov|https://ftp.ncbi.nih.gov|g' \
    $(which kraken2-build)   # edit the installed script if needed

# Step 2: Add standard bacterial library
kraken2-build --download-library bacteria --db kraken2_db

# Step 3: Add viral library
kraken2-build --download-library viral --db kraken2_db

# Step 4: Add human library (to remove host contamination)
kraken2-build --download-library human --db kraken2_db

echo "Standard libraries downloaded. Directory size:"
du -sh kraken2_db/library/

Download:

Kraken2 database build commands decoded

🌳--download-taxonomy	Downloads NCBI taxonomy files (`nodes.dmp`, `names.dmp`, `accession2taxid`). These map sequence accession numbers to taxonomic IDs. Must be run first — all subsequent steps depend on it.
🧫--download-library bacteria	Downloads all complete bacterial RefSeq genomes. Takes 1–6 hours and uses ~50 GB. Available libraries: `bacteria`, `viral`, `human`, `fungi`, `protozoa`, `plant`, `archaea`, `UniVec`.
🔨kraken2-build --build	Compiles all library sequences into the Kraken2 k-mer hash index. This is the most compute-intensive step — uses all available threads and requires large RAM (up to 100 GB for the full database). The resulting `hash.k2d` is the file Kraken2 loads at runtime.

🔍

How Kraken2 classifies reads: It extracts all k-mers (default k=35) from a read, looks each one up in the pre-built hash table, and finds the lowest common ancestor (LCA) of all the matching taxa. If 90% of k-mers match E. coli, the read is classified as E. coli. Fast because hash lookups are O(1)!

Step 3 — Add Custom Nematode-Associated Genomes

The key step: add genome assemblies of bacteria known to associate with nematodes. Each genome needs a NCBI taxonomy ID in the FASTA header.

Bash — Add custom genomes to database

mkdir -p custom_genomes

# Download specific bacterial genomes associated with nematodes
# Example: Wolbachia (common nematode endosymbiont)
# Taxon ID must be added to FASTA headers as: |kraken:taxid|XXXX

# Download Wolbachia genome (TaxID: 953)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/025/GCF_000008025.1_ASM802v1/GCF_000008025.1_ASM802v1_genomic.fna.gz
gzip -d GCF_000008025.1_ASM802v1_genomic.fna.gz

# Add taxonomy ID to FASTA headers
# Format: >seq_name|kraken:taxid|953
sed 's/^>/&|kraken:taxid|953 /' GCF_000008025.1_ASM802v1_genomic.fna \
    > custom_genomes/Wolbachia_taxid953.fna

# Repeat for other organisms:
# Xanthomonas (TaxID: 338) - common soil bacterium
# Bacillus thuringiensis (TaxID: 1428) - nematode pathogen
# Microbotryum (TaxID: 5936) - plant fungal pathogen

# Add all custom genomes to the Kraken2 library
for genome in custom_genomes/*.fna; do
    echo "Adding: $genome"
    kraken2-build --add-to-library ${genome} --db kraken2_db
done

# Build the final database index
# --kmer-len 35  : k-mer length (default 35 for Kraken2)
# --minimizer-len 31 : minimizer length
# --minimizer-spaces 6 : spaces in minimizer (reduces memory)
kraken2-build --build \
    --db kraken2_db \
    --kmer-len 35 \
    --minimizer-len 31 \
    --minimizer-spaces 6 \
    --threads 24

echo "Database build complete!"
kraken2-build --db kraken2_db --report-minimizer-data

Download:

Custom genome addition decoded

🏷️sed 's/^>/&\|kraken:taxid\|953 /'	Adds the Kraken taxon ID tag to every FASTA header line (lines starting with `>`). The `\|kraken:taxid\|953` format is the magic string Kraken2's `--add-to-library` command reads to associate sequences with a NCBI taxonomy node. Without it, Kraken2 can't classify reads against your custom genome.
💉--add-to-library	Registers a FASTA file into the Kraken2 library folder without building the index yet. Run this for each custom genome. Then run `--build` once at the end to compile everything together.
🔢--kmer-len 35	K-mer length for the hash table. Longer k-mers = more specific but require more memory. Default 35 is optimal for 150 bp reads. For very short reads (<100 bp) use 25.
💾--minimizer-spaces 6	Number of positions to skip in the minimizer window. Higher value = smaller database footprint with minor accuracy tradeoff. Set to 6 to reduce the standard database from ~70 GB to ~40 GB.

Memory requirement: A standard bacterial + viral + human Kraken2 database is ~70–100 GB when loaded into RAM. Use --memory-mapping flag during classification to load from disk instead of RAM — 2–3× slower but uses no RAM.

Step 4 — Classify Reads

Bash — Kraken2 classification (single and paired)

conda activate kraken2_env

DB="kraken2_db"
OUTDIR="kraken2_results"
mkdir -p ${OUTDIR}

# Classify single-end RNA-seq reads
for fq in data/*.fastq.gz; do
    SAMPLE=$(basename ${fq} .fastq.gz)
    echo "Classifying: ${SAMPLE}"

    kraken2 \
        --db ${DB} \
        --threads 16 \
        --output ${OUTDIR}/${SAMPLE}.kraken \
        --report ${OUTDIR}/${SAMPLE}.report \
        --gzip-compressed \
        ${fq}

    echo "  Report: ${OUTDIR}/${SAMPLE}.report"
done

# For paired-end reads:
# kraken2 --db ${DB} --threads 16 \
#   --paired reads_R1.fastq.gz reads_R2.fastq.gz \
#   --output sample.kraken --report sample.report

# Show top classifications from one sample
echo "Top 20 classifications:"
sort -rn -k3,3 ${OUTDIR}/${SAMPLE}.report | head -20

# Report format columns:
# 1: % reads covered by clade
# 2: # reads covered by clade (direct + children)
# 3: # reads directly assigned to taxon
# 4: taxonomic rank code (U,R,D,P,C,O,F,G,S)
# 5: taxonomy ID
# 6: scientific name

Download:

Kraken2 classification flags decoded

📄--output sample.kraken	Per-read output: one line per read with classification status (C/U), read ID, taxonomy ID, read length, and k-mer hit string. Large file! You can omit this with `--output /dev/null` if you only need the summary report.
📊--report sample.report	Summary report: one line per taxon, showing percentage and read counts. This is the key output — used by Bracken for abundance re-estimation and by Krona for visualization.
🔐--gzip-compressed	Tells Kraken2 to decompress `.gz` files on the fly. Without this flag, Kraken2 will fail or misread gzipped FASTQs. Use `--bzip2-compressed` for `.bz2` files.
👫--paired R1 R2	Process paired-end reads together. Kraken2 concatenates the two reads with an N spacer and classifies them as one unit, improving classification accuracy over single-end mode.

📊

Reading the Kraken2 report: The columns are: % reads, reads in clade, reads assigned directly, rank code (U=unclassified, D=domain, P=phylum, C=class, O=order, F=family, G=genus, S=species), taxID, name. Sort by column 3 (direct assignments) to find the most abundant specific taxa.

Step 5 — Bracken: Abundance Re-estimation

Kraken2 assigns reads to the lowest common ancestor (LCA) — many reads land at genus or family level. Bracken redistributes these reads using a Bayesian model to give species-level abundance estimates.

Bash — Bracken species-level abundance

# Step 1: Build Bracken database (once per Kraken2 database)
bracken-build \
    -d kraken2_db \
    -t 16 \
    -l 150 \
    -k 35
# -l = read length (match your actual read length)
# -k = k-mer length used for Kraken2 database

# Step 2: Run Bracken on each sample report
for report in kraken2_results/*.report; do
    SAMPLE=$(basename ${report} .report)
    echo "Bracken: ${SAMPLE}"

    bracken \
        -d kraken2_db \
        -i ${report} \
        -o bracken_results/${SAMPLE}.bracken \
        -r 150 \
        -l S \
        -t 10
    # -l S = species level
    # -t 10 = minimum read threshold for re-estimation
done

# Combine all Bracken results into a single count table
python3 $(which combine_bracken_outputs.py) \
    --files bracken_results/*.bracken \
    --output combined_bracken.txt

echo "Combined table: combined_bracken.txt"
head -5 combined_bracken.txt

Download:

Bracken parameters decoded

🔨bracken-build -l 150	Builds Bracken's distribution files for a specific read length (150 bp here). Bracken needs to know the expected k-mer distribution for each species at your read length. Run once per database per read length.
🎯-l S (species level)	Re-estimate abundances at Species level. Options: `D`=Domain, `P`=Phylum, `C`=Class, `O`=Order, `F`=Family, `G`=Genus, `S`=Species. Species is usually the goal.
📉-t 10	Minimum read threshold — taxa with fewer than 10 reads are not re-estimated (too unreliable). Increase to 50–100 for stricter confidence. Low-read taxa keep their Kraken2 assignment.
🔀combine_bracken_outputs.py	Merges per-sample Bracken files into a single matrix (samples × species). This table is directly importable into R (DESeq2, vegan) for downstream microbiome statistics.

🌡️

Kraken2 vs Bracken: Think of Kraken2 as the classifier and Bracken as the statistician. Kraken2 assigns each read to its best LCA — many reads pile up at genus level. Bracken says "given how many reads landed at Genus X, and given the genome sizes of each species in that genus, here's my best estimate of actual species abundances." Use Bracken counts for comparative analysis!

Step 6 — Krona Interactive Visualization

Bash — Krona taxonomic wheel charts

mkdir -p krona_plots

# Convert Kraken2 report format to Krona input
for report in kraken2_results/*.report; do
    SAMPLE=$(basename ${report} .report)

    # Convert Kraken2 report to Krona format
    kreport2krona.py \
        -r ${report} \
        -o krona_plots/${SAMPLE}.krona.txt

    # Generate interactive HTML Krona chart
    ktImportText \
        krona_plots/${SAMPLE}.krona.txt \
        -o krona_plots/${SAMPLE}.krona.html

    echo "Krona chart: krona_plots/${SAMPLE}.krona.html"
done

# Or combine all samples into one multi-sample Krona chart
# (allows switching between samples in the browser)
ALL_INPUTS=$(for r in kraken2_results/*.report; do
    SAMPLE=$(basename ${r} .report)
    kreport2krona.py -r ${r} -o /tmp/${SAMPLE}.krona.txt
    echo "/tmp/${SAMPLE}.krona.txt,${SAMPLE}"
done | tr '\n' ' ')

# Create multi-sample Krona HTML
ktImportText ${ALL_INPUTS} -o krona_all_samples.html
echo "Multi-sample Krona chart: krona_all_samples.html"
echo "Open in any browser to explore the interactive taxonomy wheel."

Download:

Interpreting Krona charts: The interactive wheel shows taxonomic hierarchy from root (center) to species (outer ring). Click any wedge to zoom in. Larger wedges = more reads classified there. Unclassified reads are shown as a grey outer segment. For nematode RNA-seq, you may find 5–30% of reads classify to bacteria — most of these are true endosymbionts or environmental contamination captured in the RNA extraction.

Troubleshooting

kraken2-build: "download_taxonomy.sh" fails with FTP errors

Many HPC clusters block outbound FTP. Edit the download_taxonomy.sh script inside the Kraken2 installation to replace all ftp://ftp.ncbi.nih.gov URLs with https://ftp.ncbi.nih.gov. Find the script path with which kraken2-build then look for the scripts directory alongside it.

Very low classification rate (<20%) despite a large database

Common causes: (1) Wrong read type — Kraken2 by default assumes DNA; for RNA-seq reads add no extra flags, but make sure your database includes transcriptome or mRNA sequences if reads are spliced. (2) The organism is simply not in your database. (3) Reads are very short — Kraken2 performs poorly on reads <50 bp.

Bracken: "No database found" or "Incompatible database"

Bracken database files (*.kmer_distrib) must match the read length and k-mer size used in Kraken2 build. Run bracken-build with -l matching your actual read length after building the Kraken2 database.

Taxonomic Classification with Kraken2

On this page

Overview

Ideal Directory Structure

Step 1 — Install Kraken2 and Dependencies

Step 2 — Build a Custom Kraken2 Database

Step 3 — Add Custom Nematode-Associated Genomes

Step 4 — Classify Reads

Step 5 — Bracken: Abundance Re-estimation

Step 6 — Krona Interactive Visualization

Troubleshooting

kraken2-build: "download_taxonomy.sh" fails with FTP errors

Very low classification rate (<20%) despite a large database

Bracken: "No database found" or "Incompatible database"