16S rRNA V4 amplicon analysis of the Arabidopsis thaliana root microbiome (Castrillo et al., Nature 2017) — DADA2 ASVs, taxonomy, diversity metrics, and PERMANOVA.
QIIME2 is the leading platform for amplicon microbiome analysis. Unlike QIIME1 (OTUs), QIIME2 uses Amplicon Sequence Variants (ASVs) via DADA2 — exact sequences with single-nucleotide resolution.
QIIME2 produces .qza (artifact) and .qzv (visualization) files at every step. Keep them organized by step so you can re-run from any point and track the provenance chain.
.qza to .zip and unzip it to see the underlying data (BIOM tables, FASTA, etc.) and the complete provenance log of every command that produced it. .qzv files are visualizations — drag them to view.qiime2.org to see interactive charts.Data from Castrillo et al. (2017, Nature): Arabidopsis phosphate stress mutants grown in natural soil, 16S V4 sequencing of root microbiome. 146 samples, subsampled to 10,000 reads each.
| Property | Value |
|---|---|
| BioProject | PRJEB15671 |
| Amplicon region | 16S rRNA V4 (primers 515F/806R) |
| Samples | 146 (WT + pho1, pho2, phf1, nla mutants) |
| Platform | Illumina MiSeq paired-end |
wget https://data.qiime2.org/distro/amplicon/qiime2-amplicon-2024.10-py310-linux-conda.yml
conda env create -n qiime2-2024.10 \
--file qiime2-amplicon-2024.10-py310-linux-conda.yml
conda activate qiime2-2024.10
qiime --version
conda activate qiime2-2024.10
# Create manifest.csv (paths to your paired FASTQ files)
# Format: sample-id, forward path, reverse path
# Header line required
echo "sample-id,absolute-filepath-forward,absolute-filepath-reverse" > manifest.csv
for r1 in data/*_R1.fastq.gz; do
r2="${r1/_R1.fastq.gz/_R2.fastq.gz}"
sample=$(basename $r1 _R1.fastq.gz)
echo "${sample},$(realpath $r1),$(realpath $r2)" >> manifest.csv
done
# Import into QIIME2 artifact format
qiime tools import \
--type "SampleData[PairedEndSequencesWithQuality]" \
--input-path manifest.csv \
--output-path paired_end_demux.qza \
--input-format PairedEndFastqManifestPhred33V2
# Summarize quality (open .qzv at https://view.qiime2.org)
qiime demux summarize \
--i-data paired_end_demux.qza \
--o-visualization paired_end_demux.qzv
| 📝manifest.csv | A CSV telling QIIME2 where each sample's FASTQ files live. The sample-id column becomes the sample name used in all downstream artifacts. realpath converts relative paths to absolute paths — QIIME2 requires absolute paths in the manifest. |
| 📦qiime tools import --type | The --type must exactly match a QIIME2 semantic type. SampleData[PairedEndSequencesWithQuality] = paired-end FASTQ with quality scores. QIIME2 validates this type at import time — wrong type = immediate error. |
| 📱--input-format PairedEndFastqManifestPhred33V2 | Specifies the manifest file format AND the quality encoding. Phred33 is the standard Illumina encoding since ~2012. If you have older Illumina data, use Phred64. |
| 🔍qiime demux summarize | Creates a .qzv visualization of per-sample read counts and per-position quality scores. Always inspect this before DADA2 to decide your truncation lengths. |
.qzv file can be drag-and-dropped there for an interactive visualization — no software needed. Bookmark it now!Choose truncation lengths from the quality visualization. Aim to truncate where quality drops below Q25. Forward and reverse reads must still overlap by ≥20 bp.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs paired_end_demux.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 220 \
--p-trunc-len-r 200 \
--p-n-threads 16 \
--o-table feature_table.qza \
--o-representative-sequences rep_seqs.qza \
--o-denoising-stats denoising_stats.qza
# Summarize results
qiime metadata tabulate \
--m-input-file denoising_stats.qza \
--o-visualization denoising_stats.qzv
qiime feature-table summarize \
--i-table feature_table.qza \
--m-sample-metadata-file metadata.txt \
--o-visualization feature_table.qzv
| ✂️--p-trim-left-f/r 0 | Trim this many bases from the 5' (left) end of forward/reverse reads. Set to the primer length if primers weren't removed upstream. Here 0 because the dataset is already primer-trimmed. |
| ✂️--p-trunc-len-f 220 | Truncate forward reads to 220 bp. DADA2 discards reads shorter than this. Set based on where quality drops in your .qzv inspection — aim to keep Q>25. F reads are usually high quality longer than R reads. |
| ✂️--p-trunc-len-r 200 | Truncate reverse reads to 200 bp. R reads degrade faster on Illumina. These truncated reads must still overlap the forward reads by ≥20 bp to merge correctly. |
| 🧵--p-n-threads 16 | DADA2 runs on 16 threads. It's the slowest step in the pipeline (error model learning) — on 146 samples expect 2–6 hours. Use all available cores. |
| 📊denoising-stats.qza | Reports reads passing each DADA2 step: input → filtered → denoised → merged → non-chimeric. If you lose >50% at any step, your truncation lengths are wrong or primer trimming was missed. |
trunc-len-f 230 trunc-len-r 200 is a common starting point.# Download pre-trained SILVA 138 V4 classifier
wget https://data.qiime2.org/2024.10/common/silva-138-99-seqs-515-806.qza
# Classify ASVs (Naive Bayes, trained on V4 region)
qiime feature-classifier classify-sklearn \
--i-classifier silva-138-99-seqs-515-806.qza \
--i-reads rep_seqs.qza \
--o-classification taxonomy.qza \
--p-n-jobs -1
# Remove chloroplasts and mitochondria (plant study contaminants)
qiime taxa filter-table \
--i-table feature_table.qza \
--i-taxonomy taxonomy.qza \
--p-exclude "Chloroplast,Mitochondria,Unassigned" \
--o-filtered-table feature_table_filtered.qza
# Taxonomy barplot (interactive, open at view.qiime2.org)
qiime taxa barplot \
--i-table feature_table_filtered.qza \
--i-taxonomy taxonomy.qza \
--m-metadata-file metadata.txt \
--o-visualization taxa_barplot.qzv
| 🧠classify-sklearn | Naive Bayes classifier trained on the SILVA 138 database trimmed to V4 region (515F–806R). It classifies each ASV by comparing its k-mer profile to trained taxon profiles. Region-specific training is critical — a full-length 16S classifier on V4 reads gives worse results. |
| 🚫--p-exclude Chloroplast,Mitochondria | Filters out ASVs classified as plant chloroplast or mitochondrial 16S (both have 16S-like rRNA). In plant root/leaf studies, these can be 10–40% of reads! Always remove them for a true bacterial community profile. |
| 📊taxa barplot | Creates a stacked bar chart showing taxonomic composition per sample. Color = taxonomy level, height = relative abundance. Interactive in the browser — you can switch between phylum, class, order, family, genus levels. |
# First build a phylogenetic tree (needed for UniFrac distances)
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences rep_seqs.qza \
--o-alignment aligned_rep_seqs.qza \
--o-masked-alignment masked_aligned_rep_seqs.qza \
--o-tree unrooted_tree.qza \
--o-rooted-tree rooted_tree.qza \
--p-n-threads 16
# Core diversity metrics (rarefied to equal depth)
# Choose sampling depth by inspecting feature_table.qzv
# Rule: keep depth where most samples are retained (e.g., 5000 reads)
SAMPLING_DEPTH=5000
qiime diversity core-metrics-phylogenetic \
--i-phylogeny rooted_tree.qza \
--i-table feature_table_filtered.qza \
--p-sampling-depth ${SAMPLING_DEPTH} \
--m-metadata-file metadata.txt \
--output-dir diversity_core/
# This generates:
# Alpha diversity: observed_features, shannon, evenness, faith_pd
# Beta diversity: unweighted_unifrac, weighted_unifrac, bray_curtis, jaccard
# PCoA ordination plots for each beta diversity metric
ls diversity_core/
| 🌳align-to-tree-mafft-fasttree | One-command pipeline: aligns ASV sequences with MAFFT, masks poorly aligned columns, builds an unrooted tree with FastTree2, then roots it at the midpoint. The rooted tree is required for Faith's PD and UniFrac distances. |
| 💰--p-sampling-depth 5000 | Rarefies all samples to 5000 reads before diversity calculation. Rarefaction equalizes sequencing effort across samples. Samples with fewer reads are dropped — choose a depth that keeps most samples. Check feature_table.qzv first. |
| 🌍core-metrics-phylogenetic | Calculates 8 diversity metrics in one command: 4 alpha (Shannon, Observed Features, Evenness, Faith PD) and 4 beta (Bray-Curtis, Jaccard, Unweighted UniFrac, Weighted UniFrac) plus PCoA plots for each beta metric. |
# Alpha diversity: test if Shannon diversity differs between genotypes
qiime diversity alpha-group-significance \
--i-alpha-diversity diversity_core/shannon_vector.qza \
--m-metadata-file metadata.txt \
--o-visualization alpha_shannon_significance.qzv
# Beta diversity: PERMANOVA test
# Does genotype explain variation in community composition?
qiime diversity beta-group-significance \
--i-distance-matrix diversity_core/unweighted_unifrac_distance_matrix.qza \
--m-metadata-file metadata.txt \
--m-metadata-column genotype \
--o-visualization uw_unifrac_permanova.qzv \
--p-method permanova \
--p-pairwise
# Weighted UniFrac (accounts for abundance)
qiime diversity beta-group-significance \
--i-distance-matrix diversity_core/weighted_unifrac_distance_matrix.qza \
--m-metadata-file metadata.txt \
--m-metadata-column genotype \
--o-visualization w_unifrac_permanova.qzv \
--p-method permanova
# Expected result (Castrillo et al.):
# pho1 and pho2 mutants have significantly altered root microbiome (p < 0.001)
# phf1 mutant shows intermediate effect
Your truncation length is too aggressive, or reads don't overlap after truncation. Check denoising_stats.qzv to see where reads are lost. Try shorter truncation lengths or remove the strictest filter. For V4 amplicons, merged reads must be ~253 bp so forward+reverse-truncation must exceed ~273 bp.
Check: (1) Did you use the correct region-trained classifier? A full-length SILVA classifier on V4 data performs poorly. Always use a primer-specific trained classifier. (2) If studying environmental or non-16S targets, retrain the classifier on the appropriate reference database.
Some samples have fewer reads than your chosen sampling depth. Either lower the depth or filter out low-read-count samples first: qiime feature-table filter-samples --p-min-frequency 5000. View the read count distribution in feature_table.qzv to pick an appropriate rarefaction depth.