Profile microbial communities from 16S rRNA amplicon data with QIIME2 and from shotgun metagenomics with Kraken2. Diversity metrics, taxonomic composition, and differential abundance — all explained.
Metagenomics studies the genetic material from entire microbial communities without culturing individual organisms. Two main approaches exist:
# Install QIIME2 via conda (check qiime2.org for latest version) wget https://data.qiime2.org/distro/amplicon/qiime2-amplicon-2024.5-py39-linux-conda.yml conda env create -n qiime2 --file qiime2-amplicon-2024.5-py39-linux-conda.yml conda activate qiime2 # Verify qiime --version
QIIME2 uses artifacts (.qza files) and visualizations (.qzv files). Data must be imported into this artifact system before processing.
# Import paired-end demultiplexed FASTQ files # Your files should be in Casava 1.8 format in a directory qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path raw_reads/ \ --input-format CasavaOneEightSingleLanePerSampleDirFmt \ --output-path demux-paired.qza # Visualize read quality qiime demux summarize \ --i-data demux-paired.qza \ --o-visualization demux-summary.qzv # View the visualization at https://view.qiime2.org or: qiime tools view demux-summary.qzv
demux-summary.qzv shows per-base quality scores for your forward and reverse reads. You'll use this to decide where to truncate reads in the denoising step. Look for where the median quality drops below Q20 — that's your truncation point.
DADA2 corrects sequencing errors and produces Amplicon Sequence Variants (ASVs) — exact sequences resolved to single-nucleotide differences. ASVs have largely replaced OTUs (97% similarity clusters) as the modern standard.
qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux-paired.qza \ --p-trunc-len-f 240 \ --p-trunc-len-r 200 \ --p-trim-left-f 0 \ --p-trim-left-r 0 \ --p-max-ee-f 2.0 \ --p-max-ee-r 2.0 \ --p-n-threads 8 \ --o-table table.qza \ --o-representative-sequences rep-seqs.qza \ --o-denoising-stats denoising-stats.qza # Check how many reads passed each step qiime metadata tabulate \ --m-input-file denoising-stats.qza \ --o-visualization denoising-stats.qzv
| Parameter | What It Does | How to Choose |
|---|---|---|
--p-trunc-len-f | Truncate forward reads at this position | From the quality plot: where does median quality drop below ~Q25? Truncate there. For 300bp V3-V4 reads, typically 240–280. |
--p-trunc-len-r | Truncate reverse reads at this position | Reverse reads are lower quality. Usually truncate 20–50bp shorter than forward. Must retain enough overlap for merging (~20bp minimum). |
--p-trim-left-f/r | Trim bases from the start of reads | Use to remove primers if they weren't already removed. Set to primer length (e.g., 17–21bp for common 16S primers). |
--p-max-ee | Maximum expected errors — reads exceeding this are discarded | 2.0 is standard. Decrease to 1.0 for stricter quality; increase to 3.0 if you're losing too many reads. |
Critical: After truncation, forward + reverse reads must still overlap by ≥ 20bp for DADA2 to merge them. Calculate: trunc_f + trunc_r - amplicon_length ≥ 20. For the V3-V4 region (~460bp), truncating at 240+200 = 440 gives about no overlap — you may need to adjust.
--p-n-threads 8 helps significantly. If you have many samples, consider running on a cluster.
# Download a pre-trained classifier (Silva 138, V3-V4 region) # Check https://docs.qiime2.org for the latest classifiers wget https://data.qiime2.org/2024.5/common/silva-138-99-nb-classifier.qza # Classify ASVs qiime feature-classifier classify-sklearn \ --i-classifier silva-138-99-nb-classifier.qza \ --i-reads rep-seqs.qza \ --p-n-jobs 4 \ --o-classification taxonomy.qza # Visualize taxonomy qiime metadata tabulate \ --m-input-file taxonomy.qza \ --o-visualization taxonomy.qzv # Create a barplot of taxonomic composition qiime taxa barplot \ --i-table table.qza \ --i-taxonomy taxonomy.qza \ --m-metadata-file sample-metadata.tsv \ --o-visualization taxa-bar-plots.qzv
# Remove mitochondria, chloroplast, and unassigned sequences qiime taxa filter-table \ --i-table table.qza \ --i-taxonomy taxonomy.qza \ --p-exclude "mitochondria,chloroplast,Unassigned" \ --o-filtered-table table-filtered.qza
Diversity analysis answers two fundamental ecological questions:
# First, check the sampling depth distribution qiime feature-table summarize \ --i-table table-filtered.qza \ --o-visualization table-summary.qzv # Run core diversity metrics (rarefies to even depth) qiime diversity core-metrics-phylogenetic \ --i-phylogeny rooted-tree.qza \ --i-table table-filtered.qza \ --p-sampling-depth 10000 \ --m-metadata-file sample-metadata.tsv \ --output-dir core-metrics-results
table-summary.qzv — it shows the read count per sampleqiime diversity alpha-rarefaction) to confirm your chosen depth captures the diversity plateau# Alpha diversity: compare richness between groups qiime diversity alpha-group-significance \ --i-alpha-diversity core-metrics-results/shannon_vector.qza \ --m-metadata-file sample-metadata.tsv \ --o-visualization shannon-significance.qzv # Beta diversity: PERMANOVA test (are communities different between groups?) qiime diversity beta-group-significance \ --i-distance-matrix core-metrics-results/weighted_unifrac_distance_matrix.qza \ --m-metadata-column treatment \ --m-metadata-file sample-metadata.tsv \ --p-method permanova \ --p-pairwise \ --o-visualization permanova-results.qzv
# Install conda install -c bioconda kraken2 bracken # Download a standard database (requires ~100GB disk for standard, or ~8GB for minikraken) kraken2-build --standard --db kraken2_standard_db --threads 16 # Or use a smaller pre-built database # wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz # Run Kraken2 classification kraken2 \ --db kraken2_standard_db \ --paired \ --threads 12 \ --output sample1.kraken.out \ --report sample1.kraken.report \ --minimum-hit-groups 3 \ --confidence 0.1 \ trimmed_R1.fastq.gz trimmed_R2.fastq.gz
| Parameter | Purpose | How to Choose |
|---|---|---|
--confidence 0.1 | Minimum fraction of k-mers that must map to a taxon | 0 = classify everything (more false positives). 0.1 = moderate. 0.2–0.5 = stringent. Start at 0.1 and increase if you see too many low-abundance false hits. |
--minimum-hit-groups 3 | Minimum distinct k-mer groups matching a taxon | Higher values reduce false positives at the cost of sensitivity. 2–3 is standard. Use 5 for high-confidence results. |
Kraken2 assigns reads to the lowest common ancestor, which inflates higher taxonomic levels. Bracken redistributes reads to species level using a Bayesian approach.
# Build Bracken database (one-time, needs the Kraken2 DB) bracken-build -d kraken2_standard_db -t 16 -l 150 # Run Bracken to estimate species-level abundances bracken \ -d kraken2_standard_db \ -i sample1.kraken.report \ -o sample1.bracken.out \ -w sample1.bracken.report \ -r 150 \ -l S \ -t 10
-r 150 — read length (match to your actual read length)-l S — level to estimate: S = species, G = genus, F = family. Species-level requires sufficient database coverage.-t 10 — minimum reads at the taxonomic level to report. Increase to 50–100 to remove very low-abundance noise.# Interactive Krona chart conda install -c bioconda krona ktUpdateTaxonomy.sh # Convert Kraken report to Krona format cut -f2,3 sample1.kraken.out > sample1.krona.input ktImportTaxonomy sample1.krona.input -o sample1.krona.html # Open the interactive HTML in a browser
library(tidyverse)
# Read Bracken output for multiple samples (combine into one table)
files <- list.files(pattern = "*.bracken.out", full.names = TRUE)
bracken_df <- map_dfr(files, ~ read_tsv(.x) %>%
mutate(sample = gsub(".bracken.out", "", basename(.x))))
# Top 15 species by mean abundance
top_species <- bracken_df %>%
group_by(name) %>%
summarize(mean_frac = mean(fraction_total_reads)) %>%
slice_max(mean_frac, n = 15) %>%
pull(name)
bracken_df %>%
mutate(taxon = ifelse(name %in% top_species, name, "Other")) %>%
group_by(sample, taxon) %>%
summarize(abundance = sum(fraction_total_reads), .groups = "drop") %>%
ggplot(aes(x = sample, y = abundance, fill = taxon)) +
geom_col() +
scale_fill_brewer(palette = "Set3") +
theme_minimal(base_size = 14) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Taxonomic Composition (Species Level)",
x = NULL, y = "Relative Abundance", fill = "Species")
Which taxa are significantly enriched or depleted between your conditions? This requires compositional-aware methods because microbiome data is inherently relative (proportions that sum to 1).
# ANCOM-BC2 (recommended for microbiome differential abundance)
BiocManager::install("ANCOMBC")
library(ANCOMBC)
library(phyloseq)
# Build a phyloseq object from your data
# (OTU/ASV table, taxonomy table, sample metadata)
ps <- phyloseq(
otu_table(count_matrix, taxa_are_rows = TRUE),
tax_table(taxonomy_matrix),
sample_data(sample_metadata)
)
# Run ANCOM-BC2
result <- ancombc2(
data = ps,
fix_formula = "treatment",
p_adj_method = "BH",
alpha = 0.05,
group = "treatment",
struc_zero = TRUE, # handle structural zeros
neg_lb = TRUE
)
# View significant taxa
sig_taxa <- result$res %>%
filter(diff_treatment == TRUE)
print(sig_taxa)