Metagenomics & Microbiome Analysis

Profile microbial communities from 16S rRNA amplicon data with QIIME2 and from shotgun metagenomics with Kraken2. Diversity metrics, taxonomic composition, and differential abundance — all explained.

~60 min Intermediate Bash / Python / R
Byte with networks

? Amplicon vs. Shotgun Metagenomics

Metagenomics studies the genetic material from entire microbial communities without culturing individual organisms. Two main approaches exist:

16S rRNA Amplicon
  • Sequences a single marker gene (16S rRNA for bacteria, ITS for fungi)
  • Cheaper, lower computational demand
  • Reveals who is there (taxonomy) and relative abundances
  • Cannot tell you what they're doing (functional potential)
  • Tool of choice: QIIME2
Shotgun Metagenomics
  • Sequences all DNA in the sample (every organism, every gene)
  • More expensive, computationally intensive
  • Reveals who is there AND what they can do (functional pathways)
  • Better species/strain-level resolution
  • Tools: Kraken2, MetaPhlAn4, HUMAnN3
Byte key point
Which Should You Use? If your question is purely "what microbes are present and how do they differ between groups?" — 16S is cost-effective and well-validated. If you need species-level resolution, strain tracking, antibiotic resistance genes, or metabolic pathway profiling — invest in shotgun metagenomics.

16S rRNA Analysis with QIIME2

1 Install QIIME2 & Import Data

Bash
# Install QIIME2 via conda (check qiime2.org for latest version)
wget https://data.qiime2.org/distro/amplicon/qiime2-amplicon-2024.5-py39-linux-conda.yml
conda env create -n qiime2 --file qiime2-amplicon-2024.5-py39-linux-conda.yml
conda activate qiime2

# Verify
qiime --version

QIIME2 uses artifacts (.qza files) and visualizations (.qzv files). Data must be imported into this artifact system before processing.

Bash
# Import paired-end demultiplexed FASTQ files
# Your files should be in Casava 1.8 format in a directory
qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path raw_reads/ \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path demux-paired.qza

# Visualize read quality
qiime demux summarize \
  --i-data demux-paired.qza \
  --o-visualization demux-summary.qzv

# View the visualization at https://view.qiime2.org or:
qiime tools view demux-summary.qzv
Byte tip
Quality Profile is Critical: The demux-summary.qzv shows per-base quality scores for your forward and reverse reads. You'll use this to decide where to truncate reads in the denoising step. Look for where the median quality drops below Q20 — that's your truncation point.

2 Denoising with DADA2

DADA2 corrects sequencing errors and produces Amplicon Sequence Variants (ASVs) — exact sequences resolved to single-nucleotide differences. ASVs have largely replaced OTUs (97% similarity clusters) as the modern standard.

Bash
qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux-paired.qza \
  --p-trunc-len-f 240 \
  --p-trunc-len-r 200 \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-max-ee-f 2.0 \
  --p-max-ee-r 2.0 \
  --p-n-threads 8 \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza

# Check how many reads passed each step
qiime metadata tabulate \
  --m-input-file denoising-stats.qza \
  --o-visualization denoising-stats.qzv
How to Pick DADA2 Truncation Parameters
ParameterWhat It DoesHow to Choose
--p-trunc-len-fTruncate forward reads at this positionFrom the quality plot: where does median quality drop below ~Q25? Truncate there. For 300bp V3-V4 reads, typically 240–280.
--p-trunc-len-rTruncate reverse reads at this positionReverse reads are lower quality. Usually truncate 20–50bp shorter than forward. Must retain enough overlap for merging (~20bp minimum).
--p-trim-left-f/rTrim bases from the start of readsUse to remove primers if they weren't already removed. Set to primer length (e.g., 17–21bp for common 16S primers).
--p-max-eeMaximum expected errors — reads exceeding this are discarded2.0 is standard. Decrease to 1.0 for stricter quality; increase to 3.0 if you're losing too many reads.

Critical: After truncation, forward + reverse reads must still overlap by ≥ 20bp for DADA2 to merge them. Calculate: trunc_f + trunc_r - amplicon_length ≥ 20. For the V3-V4 region (~460bp), truncating at 240+200 = 440 gives about no overlap — you may need to adjust.

Byte waiting
DADA2 is slow! For a typical 16S dataset with millions of reads, expect 30 minutes to several hours. Using --p-n-threads 8 helps significantly. If you have many samples, consider running on a cluster.

3 Taxonomic Classification

Bash
# Download a pre-trained classifier (Silva 138, V3-V4 region)
# Check https://docs.qiime2.org for the latest classifiers
wget https://data.qiime2.org/2024.5/common/silva-138-99-nb-classifier.qza

# Classify ASVs
qiime feature-classifier classify-sklearn \
  --i-classifier silva-138-99-nb-classifier.qza \
  --i-reads rep-seqs.qza \
  --p-n-jobs 4 \
  --o-classification taxonomy.qza

# Visualize taxonomy
qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

# Create a barplot of taxonomic composition
qiime taxa barplot \
  --i-table table.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization taxa-bar-plots.qzv
Byte key point
Classifier Choice Matters:
  • Silva 138 — most comprehensive for bacteria and archaea; standard choice for 16S
  • Greengenes2 — newer, phylogenetically consistent; gaining popularity
  • UNITE — for ITS (fungal) amplicon data
  • Always use a classifier trained on your specific primer region (e.g., V3-V4, V4 only). A full-length 16S classifier on V4-only reads will perform worse.
Filter Out Non-Target Sequences
Bash
# Remove mitochondria, chloroplast, and unassigned sequences
qiime taxa filter-table \
  --i-table table.qza \
  --i-taxonomy taxonomy.qza \
  --p-exclude "mitochondria,chloroplast,Unassigned" \
  --o-filtered-table table-filtered.qza

4 Diversity Analysis

Diversity analysis answers two fundamental ecological questions:

  • Alpha diversity — how diverse is each sample? (within-sample richness & evenness)
  • Beta diversity — how different are communities between samples? (between-sample distance)
Rarefaction & Core Metrics
Bash
# First, check the sampling depth distribution
qiime feature-table summarize \
  --i-table table-filtered.qza \
  --o-visualization table-summary.qzv

# Run core diversity metrics (rarefies to even depth)
qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted-tree.qza \
  --i-table table-filtered.qza \
  --p-sampling-depth 10000 \
  --m-metadata-file sample-metadata.tsv \
  --output-dir core-metrics-results
How to Choose Rarefaction Depth
  • Look at the table-summary.qzv — it shows the read count per sample
  • Goal: pick a depth that retains the most samples while capturing most of the diversity
  • Rule of thumb: use the lowest sample's read count, but drop samples below ~1,000 reads (they're unreliable)
  • Typical range: 5,000–50,000 depending on sequencing depth
  • Use a rarefaction curve (qiime diversity alpha-rarefaction) to confirm your chosen depth captures the diversity plateau
Statistical Tests
Bash
# Alpha diversity: compare richness between groups
qiime diversity alpha-group-significance \
  --i-alpha-diversity core-metrics-results/shannon_vector.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization shannon-significance.qzv

# Beta diversity: PERMANOVA test (are communities different between groups?)
qiime diversity beta-group-significance \
  --i-distance-matrix core-metrics-results/weighted_unifrac_distance_matrix.qza \
  --m-metadata-column treatment \
  --m-metadata-file sample-metadata.tsv \
  --p-method permanova \
  --p-pairwise \
  --o-visualization permanova-results.qzv
Byte explains
Alpha Diversity Metrics Explained:
  • Observed Features — simple richness: how many unique ASVs are present
  • Shannon Index — accounts for both richness and evenness; higher = more diverse
  • Faith's PD — phylogenetic diversity; sums branch lengths on the phylogenetic tree; captures evolutionary breadth
Beta Diversity Metrics:
  • Bray-Curtis — abundance-based, non-phylogenetic; most common for general use
  • Weighted UniFrac — incorporates phylogeny and abundance; best for detecting changes in dominant taxa
  • Unweighted UniFrac — phylogenetic but presence/absence only; sensitive to rare taxa

Shotgun Metagenomics with Kraken2

5 Taxonomic Classification with Kraken2

Bash
# Install
conda install -c bioconda kraken2 bracken

# Download a standard database (requires ~100GB disk for standard, or ~8GB for minikraken)
kraken2-build --standard --db kraken2_standard_db --threads 16

# Or use a smaller pre-built database
# wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz

# Run Kraken2 classification
kraken2 \
  --db kraken2_standard_db \
  --paired \
  --threads 12 \
  --output sample1.kraken.out \
  --report sample1.kraken.report \
  --minimum-hit-groups 3 \
  --confidence 0.1 \
  trimmed_R1.fastq.gz trimmed_R2.fastq.gz
ParameterPurposeHow to Choose
--confidence 0.1Minimum fraction of k-mers that must map to a taxon0 = classify everything (more false positives). 0.1 = moderate. 0.2–0.5 = stringent. Start at 0.1 and increase if you see too many low-abundance false hits.
--minimum-hit-groups 3Minimum distinct k-mer groups matching a taxonHigher values reduce false positives at the cost of sensitivity. 2–3 is standard. Use 5 for high-confidence results.
Refine Abundances with Bracken

Kraken2 assigns reads to the lowest common ancestor, which inflates higher taxonomic levels. Bracken redistributes reads to species level using a Bayesian approach.

Bash
# Build Bracken database (one-time, needs the Kraken2 DB)
bracken-build -d kraken2_standard_db -t 16 -l 150

# Run Bracken to estimate species-level abundances
bracken \
  -d kraken2_standard_db \
  -i sample1.kraken.report \
  -o sample1.bracken.out \
  -w sample1.bracken.report \
  -r 150 \
  -l S \
  -t 10
Bracken Parameters
  • -r 150 — read length (match to your actual read length)
  • -l S — level to estimate: S = species, G = genus, F = family. Species-level requires sufficient database coverage.
  • -t 10 — minimum reads at the taxonomic level to report. Increase to 50–100 to remove very low-abundance noise.

6 Visualization with Krona & R

Bash
# Interactive Krona chart
conda install -c bioconda krona
ktUpdateTaxonomy.sh

# Convert Kraken report to Krona format
cut -f2,3 sample1.kraken.out > sample1.krona.input
ktImportTaxonomy sample1.krona.input -o sample1.krona.html

# Open the interactive HTML in a browser
Stacked Bar Chart in R
R
library(tidyverse)

# Read Bracken output for multiple samples (combine into one table)
files <- list.files(pattern = "*.bracken.out", full.names = TRUE)
bracken_df <- map_dfr(files, ~ read_tsv(.x) %>%
  mutate(sample = gsub(".bracken.out", "", basename(.x))))

# Top 15 species by mean abundance
top_species <- bracken_df %>%
  group_by(name) %>%
  summarize(mean_frac = mean(fraction_total_reads)) %>%
  slice_max(mean_frac, n = 15) %>%
  pull(name)

bracken_df %>%
  mutate(taxon = ifelse(name %in% top_species, name, "Other")) %>%
  group_by(sample, taxon) %>%
  summarize(abundance = sum(fraction_total_reads), .groups = "drop") %>%
  ggplot(aes(x = sample, y = abundance, fill = taxon)) +
  geom_col() +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Taxonomic Composition (Species Level)",
       x = NULL, y = "Relative Abundance", fill = "Species")

7 Differential Abundance Testing

Which taxa are significantly enriched or depleted between your conditions? This requires compositional-aware methods because microbiome data is inherently relative (proportions that sum to 1).

R
# ANCOM-BC2 (recommended for microbiome differential abundance)
BiocManager::install("ANCOMBC")
library(ANCOMBC)
library(phyloseq)

# Build a phyloseq object from your data
# (OTU/ASV table, taxonomy table, sample metadata)
ps <- phyloseq(
  otu_table(count_matrix, taxa_are_rows = TRUE),
  tax_table(taxonomy_matrix),
  sample_data(sample_metadata)
)

# Run ANCOM-BC2
result <- ancombc2(
  data     = ps,
  fix_formula = "treatment",
  p_adj_method = "BH",
  alpha    = 0.05,
  group    = "treatment",
  struc_zero = TRUE,     # handle structural zeros
  neg_lb   = TRUE
)

# View significant taxa
sig_taxa <- result$res %>%
  filter(diff_treatment == TRUE)
print(sig_taxa)
Byte warns
Avoid These Common Mistakes:
  • Don't use standard t-tests or DESeq2 directly on relative abundance data — microbiome data is compositional; an increase in one taxon forces a decrease in others.
  • Use compositional methods: ANCOM-BC2, ALDEx2, MaAsLin2, or LinDA — all handle the compositionality issue.
  • Don't over-interpret rare taxa (< 0.1% abundance) — they're below the noise floor of most sequencing experiments.

Summary

Byte
What You Learned:
  1. The difference between 16S amplicon and shotgun metagenomics approaches
  2. Import, denoise (DADA2), and classify 16S data with QIIME2
  3. Choose DADA2 truncation parameters from quality profiles
  4. Perform alpha and beta diversity analyses with appropriate metrics
  5. Run Kraken2 + Bracken for shotgun metagenomic classification
  6. Visualize taxonomic composition with Krona and ggplot2
  7. Perform differential abundance testing with ANCOM-BC2
  8. Avoid common pitfalls like using non-compositional statistics on microbiome data