Amplicon Metagenomics with QIIME2

16S rRNA V4 amplicon analysis of the Arabidopsis thaliana root microbiome (Castrillo et al., Nature 2017) — DADA2 ASVs, taxonomy, diversity metrics, and PERMANOVA.

~120 min PRJEB15671 Intermediate Bash
Byte

Overview

QIIME2 is the leading platform for amplicon microbiome analysis. Unlike QIIME1 (OTUs), QIIME2 uses Amplicon Sequence Variants (ASVs) via DADA2 — exact sequences with single-nucleotide resolution.

Byte
OTUs vs ASVs: OTUs cluster sequences at 97% similarity. ASVs are exact sequences inferred by DADA2's error model — they can distinguish strains differing by a single base, are reproducible across studies, and do not require re-clustering when new samples are added. Tutorial original author: Adam Rivers (USDA-ARS).

Ideal Directory Structure

QIIME2 produces .qza (artifact) and .qzv (visualization) files at every step. Keep them organized by step so you can re-run from any point and track the provenance chain.

qiime2_arabidopsis_microbiome/
qiime2_arabidopsis_microbiome/ ├── data/ # raw FASTQ files (read-only!) │ ├── sample1_R1.fastq.gz │ └── ... (146 samples × 2) ├── manifest.csv # sample-id, fwd path, rev path ├── metadata.tsv # sample metadata (required for diversity) ├── 1_import/ │ ├── paired_end_demux.qza # imported reads artifact │ └── paired_end_demux.qzv # quality summary → view at view.qiime2.org ├── 2_dada2/ │ ├── table.qza # ASV feature table (samples × ASVs) │ ├── rep-seqs.qza # representative ASV sequences │ └── denoising-stats.qza # reads lost at each DADA2 step ├── 3_taxonomy/ │ └── taxonomy.qza # taxonomic assignments for each ASV ├── 4_phylogeny/ │ └── rooted-tree.qza # for UniFrac distances └── 5_diversity/ ├── alpha-diversity.qzv └── beta-diversity.qzv
🔍
What are .qza and .qzv files? They're just ZIP archives! Rename any .qza to .zip and unzip it to see the underlying data (BIOM tables, FASTA, etc.) and the complete provenance log of every command that produced it. .qzv files are visualizations — drag them to view.qiime2.org to see interactive charts.

Dataset

Data from Castrillo et al. (2017, Nature): Arabidopsis phosphate stress mutants grown in natural soil, 16S V4 sequencing of root microbiome. 146 samples, subsampled to 10,000 reads each.

PropertyValue
BioProjectPRJEB15671
Amplicon region16S rRNA V4 (primers 515F/806R)
Samples146 (WT + pho1, pho2, phf1, nla mutants)
PlatformIllumina MiSeq paired-end

Step 1 — Install QIIME2

Bash — QIIME2 Conda install
wget https://data.qiime2.org/distro/amplicon/qiime2-amplicon-2024.10-py310-linux-conda.yml
conda env create -n qiime2-2024.10 \
    --file qiime2-amplicon-2024.10-py310-linux-conda.yml
conda activate qiime2-2024.10
qiime --version
Download:

Step 2 — Import Data

Bash — Manifest file creation and import
conda activate qiime2-2024.10

# Create manifest.csv (paths to your paired FASTQ files)
# Format: sample-id, forward path, reverse path
# Header line required
echo "sample-id,absolute-filepath-forward,absolute-filepath-reverse" > manifest.csv
for r1 in data/*_R1.fastq.gz; do
    r2="${r1/_R1.fastq.gz/_R2.fastq.gz}"
    sample=$(basename $r1 _R1.fastq.gz)
    echo "${sample},$(realpath $r1),$(realpath $r2)" >> manifest.csv
done

# Import into QIIME2 artifact format
qiime tools import \
    --type "SampleData[PairedEndSequencesWithQuality]" \
    --input-path manifest.csv \
    --output-path paired_end_demux.qza \
    --input-format PairedEndFastqManifestPhred33V2

# Summarize quality (open .qzv at https://view.qiime2.org)
qiime demux summarize \
    --i-data paired_end_demux.qza \
    --o-visualization paired_end_demux.qzv
Download:
QIIME2 import commands decoded
📝manifest.csvA CSV telling QIIME2 where each sample's FASTQ files live. The sample-id column becomes the sample name used in all downstream artifacts. realpath converts relative paths to absolute paths — QIIME2 requires absolute paths in the manifest.
📦qiime tools import --typeThe --type must exactly match a QIIME2 semantic type. SampleData[PairedEndSequencesWithQuality] = paired-end FASTQ with quality scores. QIIME2 validates this type at import time — wrong type = immediate error.
📱--input-format PairedEndFastqManifestPhred33V2Specifies the manifest file format AND the quality encoding. Phred33 is the standard Illumina encoding since ~2012. If you have older Illumina data, use Phred64.
🔍qiime demux summarizeCreates a .qzv visualization of per-sample read counts and per-position quality scores. Always inspect this before DADA2 to decide your truncation lengths.
🌐
view.qiime2.org is your best friend. Every .qzv file can be drag-and-dropped there for an interactive visualization — no software needed. Bookmark it now!

Step 3 — DADA2 Denoising

Choose truncation lengths from the quality visualization. Aim to truncate where quality drops below Q25. Forward and reverse reads must still overlap by ≥20 bp.

Bash — DADA2 paired-end denoising
qiime dada2 denoise-paired \
    --i-demultiplexed-seqs paired_end_demux.qza \
    --p-trim-left-f 0 \
    --p-trim-left-r 0 \
    --p-trunc-len-f 220 \
    --p-trunc-len-r 200 \
    --p-n-threads 16 \
    --o-table feature_table.qza \
    --o-representative-sequences rep_seqs.qza \
    --o-denoising-stats denoising_stats.qza

# Summarize results
qiime metadata tabulate \
    --m-input-file denoising_stats.qza \
    --o-visualization denoising_stats.qzv

qiime feature-table summarize \
    --i-table feature_table.qza \
    --m-sample-metadata-file metadata.txt \
    --o-visualization feature_table.qzv
Download:
DADA2 parameters decoded
✂️--p-trim-left-f/r 0Trim this many bases from the 5' (left) end of forward/reverse reads. Set to the primer length if primers weren't removed upstream. Here 0 because the dataset is already primer-trimmed.
✂️--p-trunc-len-f 220Truncate forward reads to 220 bp. DADA2 discards reads shorter than this. Set based on where quality drops in your .qzv inspection — aim to keep Q>25. F reads are usually high quality longer than R reads.
✂️--p-trunc-len-r 200Truncate reverse reads to 200 bp. R reads degrade faster on Illumina. These truncated reads must still overlap the forward reads by ≥20 bp to merge correctly.
🧵--p-n-threads 16DADA2 runs on 16 threads. It's the slowest step in the pipeline (error model learning) — on 146 samples expect 2–6 hours. Use all available cores.
📊denoising-stats.qzaReports reads passing each DADA2 step: input → filtered → denoised → merged → non-chimeric. If you lose >50% at any step, your truncation lengths are wrong or primer trimming was missed.
Truncation rule for paired-end V4: Forward reads are typically good quality for 230–250 bp. Reverse reads degrade faster — truncate at 200 bp or where quality drops. The combined (merged) read must cover the V4 amplicon (~253 bp) with 20 bp overlap. For 2x250 MiSeq runs, trunc-len-f 230 trunc-len-r 200 is a common starting point.

Step 4 — Taxonomy Classification

Bash — SILVA-based taxonomy + barplots
# Download pre-trained SILVA 138 V4 classifier
wget https://data.qiime2.org/2024.10/common/silva-138-99-seqs-515-806.qza

# Classify ASVs (Naive Bayes, trained on V4 region)
qiime feature-classifier classify-sklearn \
    --i-classifier silva-138-99-seqs-515-806.qza \
    --i-reads rep_seqs.qza \
    --o-classification taxonomy.qza \
    --p-n-jobs -1

# Remove chloroplasts and mitochondria (plant study contaminants)
qiime taxa filter-table \
    --i-table feature_table.qza \
    --i-taxonomy taxonomy.qza \
    --p-exclude "Chloroplast,Mitochondria,Unassigned" \
    --o-filtered-table feature_table_filtered.qza

# Taxonomy barplot (interactive, open at view.qiime2.org)
qiime taxa barplot \
    --i-table feature_table_filtered.qza \
    --i-taxonomy taxonomy.qza \
    --m-metadata-file metadata.txt \
    --o-visualization taxa_barplot.qzv
Download:
Taxonomy classification decoded
🧠classify-sklearnNaive Bayes classifier trained on the SILVA 138 database trimmed to V4 region (515F–806R). It classifies each ASV by comparing its k-mer profile to trained taxon profiles. Region-specific training is critical — a full-length 16S classifier on V4 reads gives worse results.
🚫--p-exclude Chloroplast,MitochondriaFilters out ASVs classified as plant chloroplast or mitochondrial 16S (both have 16S-like rRNA). In plant root/leaf studies, these can be 10–40% of reads! Always remove them for a true bacterial community profile.
📊taxa barplotCreates a stacked bar chart showing taxonomic composition per sample. Color = taxonomy level, height = relative abundance. Interactive in the browser — you can switch between phylum, class, order, family, genus levels.
🌱
Why remove Unassigned? ASVs with no classification could be: (1) novel bacteria genuinely not in SILVA, (2) chimeric sequences DADA2 missed, or (3) non-bacterial sequences. For a cleaned analysis, filter them out. For discovery-focused work, keep them — they might be interesting!

Step 5 — Diversity Analysis

Bash — Alpha and Beta diversity
# First build a phylogenetic tree (needed for UniFrac distances)
qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences rep_seqs.qza \
    --o-alignment aligned_rep_seqs.qza \
    --o-masked-alignment masked_aligned_rep_seqs.qza \
    --o-tree unrooted_tree.qza \
    --o-rooted-tree rooted_tree.qza \
    --p-n-threads 16

# Core diversity metrics (rarefied to equal depth)
# Choose sampling depth by inspecting feature_table.qzv
# Rule: keep depth where most samples are retained (e.g., 5000 reads)
SAMPLING_DEPTH=5000

qiime diversity core-metrics-phylogenetic \
    --i-phylogeny rooted_tree.qza \
    --i-table feature_table_filtered.qza \
    --p-sampling-depth ${SAMPLING_DEPTH} \
    --m-metadata-file metadata.txt \
    --output-dir diversity_core/

# This generates:
# Alpha diversity: observed_features, shannon, evenness, faith_pd
# Beta diversity: unweighted_unifrac, weighted_unifrac, bray_curtis, jaccard
# PCoA ordination plots for each beta diversity metric

ls diversity_core/
Download:
Diversity analysis decoded
🌳align-to-tree-mafft-fasttreeOne-command pipeline: aligns ASV sequences with MAFFT, masks poorly aligned columns, builds an unrooted tree with FastTree2, then roots it at the midpoint. The rooted tree is required for Faith's PD and UniFrac distances.
💰--p-sampling-depth 5000Rarefies all samples to 5000 reads before diversity calculation. Rarefaction equalizes sequencing effort across samples. Samples with fewer reads are dropped — choose a depth that keeps most samples. Check feature_table.qzv first.
🌍core-metrics-phylogeneticCalculates 8 diversity metrics in one command: 4 alpha (Shannon, Observed Features, Evenness, Faith PD) and 4 beta (Bray-Curtis, Jaccard, Unweighted UniFrac, Weighted UniFrac) plus PCoA plots for each beta metric.
📌
Alpha vs Beta diversity: Alpha = diversity within a sample (how many species? how even?). Beta = diversity between samples (how different are the communities?). UniFrac is phylogeny-aware — it weights differences by evolutionary distance. Weighted UniFrac also considers abundance; unweighted treats all taxa equally present/absent.

Step 6 — Statistical Testing (PERMANOVA)

Bash — PERMANOVA beta diversity + alpha group significance
# Alpha diversity: test if Shannon diversity differs between genotypes
qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity_core/shannon_vector.qza \
    --m-metadata-file metadata.txt \
    --o-visualization alpha_shannon_significance.qzv

# Beta diversity: PERMANOVA test
# Does genotype explain variation in community composition?
qiime diversity beta-group-significance \
    --i-distance-matrix diversity_core/unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file metadata.txt \
    --m-metadata-column genotype \
    --o-visualization uw_unifrac_permanova.qzv \
    --p-method permanova \
    --p-pairwise

# Weighted UniFrac (accounts for abundance)
qiime diversity beta-group-significance \
    --i-distance-matrix diversity_core/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file metadata.txt \
    --m-metadata-column genotype \
    --o-visualization w_unifrac_permanova.qzv \
    --p-method permanova

# Expected result (Castrillo et al.):
# pho1 and pho2 mutants have significantly altered root microbiome (p < 0.001)
# phf1 mutant shows intermediate effect
Download:
Interpreting PERMANOVA: A significant p-value (<0.05) means your grouping variable (e.g., genotype) explains a significant portion of community variation. The R² value tells you how much variation is explained (R²=0.15 means genotype explains 15% of beta diversity). Always check the homogeneity of dispersion test (PERMDISP) — significant PERMDISP means groups differ in variability, not just composition.

Troubleshooting

DADA2: "Not enough reads passed the filters" for many samples

Your truncation length is too aggressive, or reads don't overlap after truncation. Check denoising_stats.qzv to see where reads are lost. Try shorter truncation lengths or remove the strictest filter. For V4 amplicons, merged reads must be ~253 bp so forward+reverse-truncation must exceed ~273 bp.

Taxonomy classification: many ASVs unassigned (>30%)

Check: (1) Did you use the correct region-trained classifier? A full-length SILVA classifier on V4 data performs poorly. Always use a primer-specific trained classifier. (2) If studying environmental or non-16S targets, retrain the classifier on the appropriate reference database.

core-metrics fails: "sampling depth exceeds sample size"

Some samples have fewer reads than your chosen sampling depth. Either lower the depth or filter out low-read-count samples first: qiime feature-table filter-samples --p-min-frequency 5000. View the read count distribution in feature_table.qzv to pick an appropriate rarefaction depth.