Amplicon Metagenomics with QIIME2

Overview

QIIME2 is the leading platform for amplicon microbiome analysis. Unlike QIIME1 (OTUs), QIIME2 uses Amplicon Sequence Variants (ASVs) via DADA2 — exact sequences with single-nucleotide resolution.

OTUs vs ASVs: OTUs cluster sequences at 97% similarity. ASVs are exact sequences inferred by DADA2's error model — they can distinguish strains differing by a single base, are reproducible across studies, and do not require re-clustering when new samples are added. Tutorial original author: Adam Rivers (USDA-ARS).

Ideal Directory Structure

QIIME2 produces .qza (artifact) and .qzv (visualization) files at every step. Keep them organized by step so you can re-run from any point and track the provenance chain.

qiime2_arabidopsis_microbiome/

qiime2_arabidopsis_microbiome/ ├── data/ # raw FASTQ files (read-only!) │ ├── sample1_R1.fastq.gz │ └── ... (146 samples × 2) ├── manifest.csv # sample-id, fwd path, rev path ├── metadata.tsv # sample metadata (required for diversity) ├── 1_import/ │ ├── paired_end_demux.qza # imported reads artifact │ └── paired_end_demux.qzv # quality summary → view at view.qiime2.org ├── 2_dada2/ │ ├── table.qza # ASV feature table (samples × ASVs) │ ├── rep-seqs.qza # representative ASV sequences │ └── denoising-stats.qza # reads lost at each DADA2 step ├── 3_taxonomy/ │ └── taxonomy.qza # taxonomic assignments for each ASV ├── 4_phylogeny/ │ └── rooted-tree.qza # for UniFrac distances └── 5_diversity/ ├── alpha-diversity.qzv └── beta-diversity.qzv

🔍

What are .qza and .qzv files? They're just ZIP archives! Rename any .qza to .zip and unzip it to see the underlying data (BIOM tables, FASTA, etc.) and the complete provenance log of every command that produced it. .qzv files are visualizations — drag them to view.qiime2.org to see interactive charts.

Dataset

Data from Castrillo et al. (2017, Nature): Arabidopsis phosphate stress mutants grown in natural soil, 16S V4 sequencing of root microbiome. 146 samples, subsampled to 10,000 reads each.

Property	Value
BioProject	PRJEB15671
Amplicon region	16S rRNA V4 (primers 515F/806R)
Samples	146 (WT + pho1, pho2, phf1, nla mutants)
Platform	Illumina MiSeq paired-end

Step 1 — Install QIIME2

Bash — QIIME2 Conda install

wget https://data.qiime2.org/distro/amplicon/qiime2-amplicon-2024.10-py310-linux-conda.yml
conda env create -n qiime2-2024.10 \
    --file qiime2-amplicon-2024.10-py310-linux-conda.yml
conda activate qiime2-2024.10
qiime --version

Download:

Step 2 — Import Data

Bash — Manifest file creation and import

conda activate qiime2-2024.10

# Create manifest.csv (paths to your paired FASTQ files)
# Format: sample-id, forward path, reverse path
# Header line required
echo "sample-id,absolute-filepath-forward,absolute-filepath-reverse" > manifest.csv
for r1 in data/*_R1.fastq.gz; do
    r2="${r1/_R1.fastq.gz/_R2.fastq.gz}"
    sample=$(basename $r1 _R1.fastq.gz)
    echo "${sample},$(realpath $r1),$(realpath $r2)" >> manifest.csv
done

# Import into QIIME2 artifact format
qiime tools import \
    --type "SampleData[PairedEndSequencesWithQuality]" \
    --input-path manifest.csv \
    --output-path paired_end_demux.qza \
    --input-format PairedEndFastqManifestPhred33V2

# Summarize quality (open .qzv at https://view.qiime2.org)
qiime demux summarize \
    --i-data paired_end_demux.qza \
    --o-visualization paired_end_demux.qzv

Download:

QIIME2 import commands decoded

📝manifest.csv	A CSV telling QIIME2 where each sample's FASTQ files live. The `sample-id` column becomes the sample name used in all downstream artifacts. `realpath` converts relative paths to absolute paths — QIIME2 requires absolute paths in the manifest.
📦qiime tools import --type	The `--type` must exactly match a QIIME2 semantic type. `SampleData[PairedEndSequencesWithQuality]` = paired-end FASTQ with quality scores. QIIME2 validates this type at import time — wrong type = immediate error.
📱--input-format PairedEndFastqManifestPhred33V2	Specifies the manifest file format AND the quality encoding. Phred33 is the standard Illumina encoding since ~2012. If you have older Illumina data, use Phred64.
🔍qiime demux summarize	Creates a `.qzv` visualization of per-sample read counts and per-position quality scores. Always inspect this before DADA2 to decide your truncation lengths.

🌐

view.qiime2.org is your best friend. Every .qzv file can be drag-and-dropped there for an interactive visualization — no software needed. Bookmark it now!

Step 3 — DADA2 Denoising

Choose truncation lengths from the quality visualization. Aim to truncate where quality drops below Q25. Forward and reverse reads must still overlap by ≥20 bp.

Bash — DADA2 paired-end denoising

qiime dada2 denoise-paired \
    --i-demultiplexed-seqs paired_end_demux.qza \
    --p-trim-left-f 0 \
    --p-trim-left-r 0 \
    --p-trunc-len-f 220 \
    --p-trunc-len-r 200 \
    --p-n-threads 16 \
    --o-table feature_table.qza \
    --o-representative-sequences rep_seqs.qza \
    --o-denoising-stats denoising_stats.qza

# Summarize results
qiime metadata tabulate \
    --m-input-file denoising_stats.qza \
    --o-visualization denoising_stats.qzv

qiime feature-table summarize \
    --i-table feature_table.qza \
    --m-sample-metadata-file metadata.txt \
    --o-visualization feature_table.qzv

Download:

DADA2 parameters decoded

✂️--p-trim-left-f/r 0	Trim this many bases from the 5' (left) end of forward/reverse reads. Set to the primer length if primers weren't removed upstream. Here 0 because the dataset is already primer-trimmed.
✂️--p-trunc-len-f 220	Truncate forward reads to 220 bp. DADA2 discards reads shorter than this. Set based on where quality drops in your `.qzv` inspection — aim to keep Q>25. F reads are usually high quality longer than R reads.
✂️--p-trunc-len-r 200	Truncate reverse reads to 200 bp. R reads degrade faster on Illumina. These truncated reads must still overlap the forward reads by ≥20 bp to merge correctly.
🧵--p-n-threads 16	DADA2 runs on 16 threads. It's the slowest step in the pipeline (error model learning) — on 146 samples expect 2–6 hours. Use all available cores.
📊denoising-stats.qza	Reports reads passing each DADA2 step: input → filtered → denoised → merged → non-chimeric. If you lose >50% at any step, your truncation lengths are wrong or primer trimming was missed.

Truncation rule for paired-end V4: Forward reads are typically good quality for 230–250 bp. Reverse reads degrade faster — truncate at 200 bp or where quality drops. The combined (merged) read must cover the V4 amplicon (~253 bp) with 20 bp overlap. For 2x250 MiSeq runs, trunc-len-f 230 trunc-len-r 200 is a common starting point.

Step 4 — Taxonomy Classification

Bash — SILVA-based taxonomy + barplots

# Download pre-trained SILVA 138 V4 classifier
wget https://data.qiime2.org/2024.10/common/silva-138-99-seqs-515-806.qza

# Classify ASVs (Naive Bayes, trained on V4 region)
qiime feature-classifier classify-sklearn \
    --i-classifier silva-138-99-seqs-515-806.qza \
    --i-reads rep_seqs.qza \
    --o-classification taxonomy.qza \
    --p-n-jobs -1

# Remove chloroplasts and mitochondria (plant study contaminants)
qiime taxa filter-table \
    --i-table feature_table.qza \
    --i-taxonomy taxonomy.qza \
    --p-exclude "Chloroplast,Mitochondria,Unassigned" \
    --o-filtered-table feature_table_filtered.qza

# Taxonomy barplot (interactive, open at view.qiime2.org)
qiime taxa barplot \
    --i-table feature_table_filtered.qza \
    --i-taxonomy taxonomy.qza \
    --m-metadata-file metadata.txt \
    --o-visualization taxa_barplot.qzv

Download:

Taxonomy classification decoded

🧠classify-sklearn	Naive Bayes classifier trained on the SILVA 138 database trimmed to V4 region (515F–806R). It classifies each ASV by comparing its k-mer profile to trained taxon profiles. Region-specific training is critical — a full-length 16S classifier on V4 reads gives worse results.
🚫--p-exclude Chloroplast,Mitochondria	Filters out ASVs classified as plant chloroplast or mitochondrial 16S (both have 16S-like rRNA). In plant root/leaf studies, these can be 10–40% of reads! Always remove them for a true bacterial community profile.
📊taxa barplot	Creates a stacked bar chart showing taxonomic composition per sample. Color = taxonomy level, height = relative abundance. Interactive in the browser — you can switch between phylum, class, order, family, genus levels.

🌱

Why remove Unassigned? ASVs with no classification could be: (1) novel bacteria genuinely not in SILVA, (2) chimeric sequences DADA2 missed, or (3) non-bacterial sequences. For a cleaned analysis, filter them out. For discovery-focused work, keep them — they might be interesting!

Step 5 — Diversity Analysis

Bash — Alpha and Beta diversity

# First build a phylogenetic tree (needed for UniFrac distances)
qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences rep_seqs.qza \
    --o-alignment aligned_rep_seqs.qza \
    --o-masked-alignment masked_aligned_rep_seqs.qza \
    --o-tree unrooted_tree.qza \
    --o-rooted-tree rooted_tree.qza \
    --p-n-threads 16

# Core diversity metrics (rarefied to equal depth)
# Choose sampling depth by inspecting feature_table.qzv
# Rule: keep depth where most samples are retained (e.g., 5000 reads)
SAMPLING_DEPTH=5000

qiime diversity core-metrics-phylogenetic \
    --i-phylogeny rooted_tree.qza \
    --i-table feature_table_filtered.qza \
    --p-sampling-depth ${SAMPLING_DEPTH} \
    --m-metadata-file metadata.txt \
    --output-dir diversity_core/

# This generates:
# Alpha diversity: observed_features, shannon, evenness, faith_pd
# Beta diversity: unweighted_unifrac, weighted_unifrac, bray_curtis, jaccard
# PCoA ordination plots for each beta diversity metric

ls diversity_core/

Download:

Diversity analysis decoded

🌳align-to-tree-mafft-fasttree	One-command pipeline: aligns ASV sequences with MAFFT, masks poorly aligned columns, builds an unrooted tree with FastTree2, then roots it at the midpoint. The rooted tree is required for Faith's PD and UniFrac distances.
💰--p-sampling-depth 5000	Rarefies all samples to 5000 reads before diversity calculation. Rarefaction equalizes sequencing effort across samples. Samples with fewer reads are dropped — choose a depth that keeps most samples. Check `feature_table.qzv` first.
🌍core-metrics-phylogenetic	Calculates 8 diversity metrics in one command: 4 alpha (Shannon, Observed Features, Evenness, Faith PD) and 4 beta (Bray-Curtis, Jaccard, Unweighted UniFrac, Weighted UniFrac) plus PCoA plots for each beta metric.

📌

Alpha vs Beta diversity: Alpha = diversity within a sample (how many species? how even?). Beta = diversity between samples (how different are the communities?). UniFrac is phylogeny-aware — it weights differences by evolutionary distance. Weighted UniFrac also considers abundance; unweighted treats all taxa equally present/absent.

Step 6 — Statistical Testing (PERMANOVA)

Bash — PERMANOVA beta diversity + alpha group significance

# Alpha diversity: test if Shannon diversity differs between genotypes
qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity_core/shannon_vector.qza \
    --m-metadata-file metadata.txt \
    --o-visualization alpha_shannon_significance.qzv

# Beta diversity: PERMANOVA test
# Does genotype explain variation in community composition?
qiime diversity beta-group-significance \
    --i-distance-matrix diversity_core/unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file metadata.txt \
    --m-metadata-column genotype \
    --o-visualization uw_unifrac_permanova.qzv \
    --p-method permanova \
    --p-pairwise

# Weighted UniFrac (accounts for abundance)
qiime diversity beta-group-significance \
    --i-distance-matrix diversity_core/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file metadata.txt \
    --m-metadata-column genotype \
    --o-visualization w_unifrac_permanova.qzv \
    --p-method permanova

# Expected result (Castrillo et al.):
# pho1 and pho2 mutants have significantly altered root microbiome (p < 0.001)
# phf1 mutant shows intermediate effect

Download:

Interpreting PERMANOVA: A significant p-value (<0.05) means your grouping variable (e.g., genotype) explains a significant portion of community variation. The R² value tells you how much variation is explained (R²=0.15 means genotype explains 15% of beta diversity). Always check the homogeneity of dispersion test (PERMDISP) — significant PERMDISP means groups differ in variability, not just composition.

Troubleshooting

DADA2: "Not enough reads passed the filters" for many samples

Your truncation length is too aggressive, or reads don't overlap after truncation. Check denoising_stats.qzv to see where reads are lost. Try shorter truncation lengths or remove the strictest filter. For V4 amplicons, merged reads must be ~253 bp so forward+reverse-truncation must exceed ~273 bp.

Taxonomy classification: many ASVs unassigned (>30%)

Check: (1) Did you use the correct region-trained classifier? A full-length SILVA classifier on V4 data performs poorly. Always use a primer-specific trained classifier. (2) If studying environmental or non-16S targets, retrain the classifier on the appropriate reference database.

core-metrics fails: "sampling depth exceeds sample size"

Some samples have fewer reads than your chosen sampling depth. Either lower the depth or filter out low-read-count samples first: qiime feature-table filter-samples --p-min-frequency 5000. View the read count distribution in feature_table.qzv to pick an appropriate rarefaction depth.

Amplicon Metagenomics with QIIME2

On this page

Overview

Ideal Directory Structure

Dataset

Step 1 — Install QIIME2

Step 2 — Import Data

Step 3 — DADA2 Denoising

Step 4 — Taxonomy Classification

Step 5 — Diversity Analysis

Step 6 — Statistical Testing (PERMANOVA)

Troubleshooting

DADA2: "Not enough reads passed the filters" for many samples

Taxonomy classification: many ASVs unassigned (>30%)

core-metrics fails: "sampling depth exceeds sample size"