Complete end-to-end pipeline using the Arabidopsis thaliana atrx-1 mutant dataset — from raw FASTQ to volcano plots and differentially expressed gene lists.
RNA-seq experiments measure transcriptome-wide gene expression changes. This tutorial covers the complete pipeline for RNA-seq data from a genome-guided mapping approach:
Set up your project folder like this before running any commands. Keeping data, indices, alignments, and results in separate numbered directories makes it easy to find outputs and re-run individual steps without overwriting anything.
ls always shows them in pipeline order. Put raw data in 1_data/ and never write to it — treat it as read-only. This way you can always re-run the whole pipeline from scratch by deleting folders 2 onwards.This tutorial uses a published Arabidopsis thaliana dataset comparing wild-type (WT) plants to atrx-1 loss-of-function mutants. ATRX is a histone chaperone that regulates gene expression epigenetically.
| Condition | Replicate | SRA ID (R1) | SRA ID (R2) |
|---|---|---|---|
| Wild-type (WT) | Rep 1 | SRR4420293_1.fastq.gz | SRR4420293_2.fastq.gz |
| Wild-type (WT) | Rep 2 | SRR4420294_1.fastq.gz | SRR4420294_2.fastq.gz |
| Wild-type (WT) | Rep 3 | SRR4420295_1.fastq.gz | SRR4420295_2.fastq.gz |
| atrx-1 mutant | Rep 1 | SRR4420296_1.fastq.gz | SRR4420296_2.fastq.gz |
| atrx-1 mutant | Rep 2 | SRR4420297_1.fastq.gz | SRR4420297_2.fastq.gz |
| atrx-1 mutant | Rep 3 | SRR4420298_1.fastq.gz | SRR4420298_2.fastq.gz |
Paired-end Illumina HiSeq 2500 reads. BioProject: PRJNA348194
Download reads directly from ENA (fastest option — no conversion needed):
mkdir -p 1_data && cd 1_data # Download all 6 paired-end samples (12 files total) wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/004/SRR4420294/SRR4420294_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/004/SRR4420294/SRR4420294_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/005/SRR4420295/SRR4420295_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/005/SRR4420295/SRR4420295_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/006/SRR4420296/SRR4420296_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/006/SRR4420296/SRR4420296_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/007/SRR4420297/SRR4420297_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/007/SRR4420297/SRR4420297_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/008/SRR4420298/SRR4420298_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/008/SRR4420298/SRR4420298_2.fastq.gz cd .. # Download Arabidopsis TAIR10 genome and annotation from NCBI mkdir -p 0_index && cd 0_index wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.gff.gz gzip -d *.gz cd ..
| 📁mkdir -p 1_data | Creates the 1_data folder. The -p flag means "make all parent directories too and don't complain if it already exists" — safe to run multiple times. |
| ⬇️wget <url> | Downloads a file from a URL. Think of it as a command-line browser that saves files directly to disk. We use ENA (European Nucleotide Archive) because it provides direct FTP/HTTPS links — faster than NCBI's SRA download tools. |
| 🗜️gzip -d *.gz | Decompresses all .gz files in the current directory. The -d flag means "decompress" (opposite of the default compress). The * wildcard matches every .gz file so you don't have to name them one by one. |
fastq-dump or fasterq-dump tools which are slower and need additional setup. ENA = instant gratification!#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=2:00:00
#SBATCH --job-name=fastqc_multiqc
#SBATCH --output=fastqc.%j.out
#SBATCH --error=fastqc.%j.err
module load fastqc
module load parallel
module load py-multiqc # or: pip install multiqc
mkdir -p 2_fastqc
OUTDIR=2_fastqc
# Run FastQC on all files in parallel
parallel "fastqc {} -o ${OUTDIR}" ::: 1_data/*.fastq.gz
# Aggregate all reports into one HTML summary
cd ${OUTDIR}
multiqc .
# Output: multiqc_report.html & multiqc_data/
| ⚡parallel "fastqc {}" | Runs FastQC on multiple files simultaneously using GNU Parallel. The {} is a placeholder — parallel replaces it with each filename in turn. Without parallel you'd have to run FastQC once per file in a loop (12× slower). |
| 🔬fastqc {} -o $OUTDIR | Generates a quality report HTML file for each FASTQ. Checks: per-base quality scores, GC content, adapter contamination, duplication rates. Think of it as a full health checkup for your sequencing data. |
| 📊multiqc . | Scans the current directory (.) for any FastQC output files and combines them into a single interactive HTML report. The dot just means "look in the current folder". One report beats twelve! |
A real MultiQC report for this dataset is available as an embedded example:
Real MultiQC output from the Atrx dataset. Adapter content and quality scores visible.
HISAT2 requires a genome index before alignment. Build once, reuse for all samples.
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=1:00:00
#SBATCH --job-name=hisat2_index
#SBATCH --output=hisat2_index.%j.out
#SBATCH --error=hisat2_index.%j.err
module load hisat2
GENOME_FNA="0_index/GCF_000001735.4_TAIR10.1_genomic.fna"
GENOME_PREFIX="0_index/GCF_000001735.4_TAIR10.1_genomic"
hisat2-build ${GENOME_FNA} ${GENOME_PREFIX}
# Expected output: 8 index files (.1.ht2 through .8.ht2)
ls 0_index/*.ht2
| 🗺️hisat2-build | Builds a compressed index of the genome so HISAT2 can align reads to it lightning-fast. Think of it like building a book's index — you do it once, then finding any word (read) is instant. Without an index, HISAT2 would have to scan the entire 135 Mb genome for every single read. |
| 📂GENOME_FNA | The input: your genome assembly in FASTA format. One sequence per chromosome. |
| 🏷️GENOME_PREFIX | The output prefix: HISAT2 will create 8 files named prefix.1.ht2 through prefix.8.ht2. Keep these in the same folder — HISAT2 needs all 8 to align. |
After indexing, align each sample. The script below loops over all pairs:
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=24:00:00
#SBATCH --job-name=hisat2_align
#SBATCH --output=hisat2_align.%j.out
#SBATCH --error=hisat2_align.%j.err
module load hisat2
module load samtools
GENOME="0_index/GCF_000001735.4_TAIR10.1_genomic"
OUTDIR="3_bam"
mkdir -p ${OUTDIR}
p=8 # threads per alignment
for fq1 in 1_data/*_1.fastq.gz; do
fq2="${fq1/_1.fastq.gz/_2.fastq.gz}"
SAMPLE=$(basename ${fq1} _1.fastq.gz)
echo "Aligning: ${SAMPLE}"
hisat2 -p ${p} \
-x ${GENOME} \
-1 ${fq1} \
-2 ${fq2} \
-S ${OUTDIR}/${SAMPLE}.sam \
2> ${OUTDIR}/${SAMPLE}.log
# Convert SAM to sorted BAM and remove SAM
samtools view --threads ${p} -bS ${OUTDIR}/${SAMPLE}.sam \
| samtools sort --threads ${p} -o ${OUTDIR}/${SAMPLE}.bam
samtools index ${OUTDIR}/${SAMPLE}.bam
rm ${OUTDIR}/${SAMPLE}.sam
echo " Done: ${SAMPLE}.bam"
done
# Check alignment rates
for log in ${OUTDIR}/*.log; do
echo "=== $(basename ${log} .log) ===" && grep "overall alignment" ${log}
done
| 🧵-p 8 | parallel threads — use 8 CPU cores simultaneously. Doubles speed compared to -p 4. Match this to the cores you requested in your SLURM header (--ntasks-per-node). |
| 🗺️-x $GENOME | The genome index prefix — point to the prefix you used in hisat2-build (without the .1.ht2 extension). |
| 📖-1 $fq1 -2 $fq2 | Read 1 (forward) and read 2 (reverse) of your paired-end data. HISAT2 uses the pairing information to improve alignment accuracy — it knows both reads come from the same DNA fragment. |
| 💾-S sample.sam | Output SAM file. SAM is human-readable text alignment format. We convert it to BAM immediately — SAM files are enormous (10–20 GB per sample!) so we never keep them. |
| 🔄samtools view -bS | Converts SAM (S) to BAM (b) — a compressed binary version. Same information, 5–10× smaller. |
| 📑samtools sort | Sorts alignments by genomic coordinate (chromosome + position). This is required by featureCounts and genome browsers like IGV. Unsorted BAMs are like a filing cabinet with folders in random order. |
| 🔍samtools index | Creates a .bai index file alongside your BAM. Like a book index — lets tools jump directly to any chromosomal region without reading the whole BAM. Required by IGV and many GATK tools. |
rm the SAM file? A SAM file for 25M reads is ~15 GB of plain text. Your BAM is the same data compressed to ~2 GB. Always pipe SAM directly to samtools sort and delete the SAM — never let it linger on disk!| Sample | Mapped Pairs | Overall Rate |
|---|---|---|
| SRR4420293 (WT rep1) | ~23M | ~96.5% |
| SRR4420294 (WT rep2) | ~25M | ~96.8% |
| SRR4420295 (WT rep3) | ~22M | ~96.3% |
| SRR4420296 (mut rep1) | ~27M | ~97.1% |
| SRR4420297 (mut rep2) | ~24M | ~96.7% |
| SRR4420298 (mut rep3) | ~26M | ~96.9% |
Expected alignment rates for this high-quality Arabidopsis dataset. Values ≥90% indicate good library quality.
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=4:00:00
#SBATCH --job-name=featureCounts
#SBATCH --output=featureCounts.%j.out
#SBATCH --error=featureCounts.%j.err
module load subread # featureCounts is part of the subread package
module load parallel
ANNOT_GFF="0_index/GCF_000001735.4_TAIR10.1_genomic.gff"
INDIR="3_bam"
OUTDIR="4_counts"
mkdir -p ${OUTDIR}
# Run featureCounts on each BAM in parallel (4 jobs x 4 threads each)
# -s 2 = reversely stranded (RiboMinus protocol used in this dataset)
# -p = paired-end
# -t gene, -g ID = count at gene level using the ID attribute
parallel -j 4 \
"featureCounts -T 4 -s 2 -p -t gene -g ID \
-a ${ANNOT_GFF} \
-o ${OUTDIR}/{/.}.gene.txt {}" \
::: ${INDIR}/*.bam
# Merge all sample count files into one matrix
paste \
<(awk 'BEGIN{OFS="\t"} {print $1,$7}' ${OUTDIR}/SRR4420293.gene.txt) \
<(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420294.gene.txt) \
<(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420295.gene.txt) \
<(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420296.gene.txt) \
<(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420297.gene.txt) \
<(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420298.gene.txt) \
| grep -v '^\#' > ${OUTDIR}/At_count.txt
echo "Count matrix created: ${OUTDIR}/At_count.txt"
head -5 ${OUTDIR}/At_count.txt
| 🧵-T 4 | Use 4 CPU Threads per featureCounts job. We run 4 jobs in parallel (via GNU parallel -j 4), so total = 16 cores — matching the SLURM request. |
| ↩️-s 2 | Strandedness: 0=unstranded, 1=forward, 2=reverse. This dataset used the RiboMinus protocol which generates reverse-stranded libraries. Wrong strandedness = ~50% of reads miscounted! |
| 👫-p | Tells featureCounts the data is paired-end — count a fragment only once (not twice, once per read). Essential for paired-end libraries. |
| 🔖-t gene | Count at the type "gene" in the GFF. Could also use "exon" to count individual exons instead of the whole gene locus. |
| 🏷️-g ID | The GFF attribute to use as the gene identifier in the output matrix. For NCBI GFFs, the ID field holds the gene name (e.g., gene0, AT1G01010). |
| 📄-a $ANNOT | The annotation file (GFF/GTF) that tells featureCounts where each gene lives on the genome. |
| 🗂️paste + awk | featureCounts outputs one file per sample. paste merges them side-by-side into a single matrix, and awk extracts only column 7 (the count column) from each file. |
Expected count matrix header (first 8 rows):
Geneid S293 S294 S295 S296 S297 S298 gene0 11 1 10 28 11 13 gene1 37 3 26 88 22 17 gene2 0 0 0 0 0 0 gene3 6 2 12 40 13 13 gene4 35 6 22 170 53 32 gene7 49 15 67 258 83 45
-s 2 (reversely stranded) is critical. Using the wrong strandedness setting can cut detected genes in half. Always check with RSeQC infer_experiment.py or the Identify_Strandedness_Information.md protocol in the workbook.
## RNA-seq Differential Expression Analysis with DESeq2
## Dataset: Arabidopsis atrx-1 mutant vs WT (PRJNA348194)
## Original analysis: Siva Chudalayandi
# Install if needed:
# BiocManager::install("DESeq2")
# install.packages(c("RColorBrewer", "gplots", "ggplot2", "ggrepel"))
library(DESeq2)
library(RColorBrewer)
library(gplots)
library(ggplot2)
library(ggrepel)
setwd("4_counts/")
# 1. Load count matrix
dat <- read.table("At_count.txt", header=TRUE, quote="", row.names=1)
dat <- as.matrix(dat)
head(dat)
# 2. Define experimental design
condition <- factor(c(rep("WT",3), rep("Mut",3)))
condition <- relevel(condition, ref="WT") # WT is the reference
coldata <- data.frame(row.names=colnames(dat), condition)
# 3. Build DESeq2 object and run pipeline
dds <- DESeqDataSetFromMatrix(countData=dat, colData=coldata, design=~condition)
dds <- DESeq(dds)
# 4. Dispersion plot (QC)
png("qc-dispersions.png", 1000, 1000, pointsize=20)
plotDispEsts(dds, main="Dispersion plot")
dev.off()
# 5. rlog transformation for PCA/clustering
rld <- rlogTransformation(dds)
# 6. PCA plot
pca_data <- plotPCA(rld, intgroup="condition", returnData=TRUE)
pct_var <- round(100 * attr(pca_data, "percentVar"))
ggplot(pca_data, aes(PC1, PC2, color=condition)) +
geom_point(size=4) +
xlab(paste0("PC1: ", pct_var[1], "% variance")) +
ylab(paste0("PC2: ", pct_var[2], "% variance")) +
theme_bw() +
ggtitle("PCA - WT vs atrx-1 mutant")
ggsave("PCA.png", width=6, height=5)
# 7. Sample distance heatmap
mycols <- brewer.pal(8, "Dark2")[1:length(unique(condition))]
sampleDists <- as.matrix(dist(t(assay(rld))))
png("qc-heatmap-samples.png", 1000, 1000, pointsize=20)
heatmap.2(sampleDists, key=FALSE, trace="none",
col=colorpanel(100, "black", "white"),
ColSideColors=mycols[condition],
RowSideColors=mycols[condition],
margin=c(10,10), main="Sample Distance Matrix")
dev.off()
# 8. Get DE results
res <- results(dds)
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res),
as.data.frame(counts(dds, normalized=TRUE)),
by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
write.csv(resdata, file="diffexpr-results.csv", quote=FALSE, row.names=FALSE)
# Summary
cat("Total genes tested:", nrow(res), "\n")
cat("DE genes (padj < 0.05):", sum(res$padj < 0.05, na.rm=TRUE), "\n")
# FALSE TRUE
# 6712 204
| 🏗️DESeqDataSetFromMatrix() | Packages your count matrix + sample metadata into a DESeq2 object. The design=~condition formula tells DESeq2 which column in your metadata to use as the grouping variable for testing. |
| 🚀DESeq(dds) | The workhorse — runs three steps in one: (1) estimates size factors (normalizes library sizes), (2) estimates gene-wise dispersion, (3) fits a negative binomial model and runs the Wald test for each gene. Your ~6 min coffee break! |
| 📐relevel(condition, ref="WT") | Sets WT as the reference level so fold changes are calculated as Mut÷WT. Without this, DESeq2 uses alphabetical order ("Mut" < "WT" alphabetically, so it would flip the comparison). |
| 🌡️rlogTransformation() | Regularized log transformation — shrinks counts of lowly-expressed genes towards the mean to avoid them dominating PCA and clustering. Better than simple log2 for visualization (not for DE testing). |
| 📊plotDispEsts() | QC plot showing per-gene dispersion estimates before and after shrinkage. The red trend line should curve smoothly downward. Wild outliers = potential quality issues in that gene. |
| 📋results(dds) | Extracts a table with: baseMean, log2FoldChange, lfcSE, stat, pvalue, padj (BH-corrected). Sort by padj to find your top hits. |
padj (adjusted p-value), never raw pvalue! When testing 27,000 genes simultaneously, ~1,350 genes will appear significant by chance at p<0.05 even if nothing is truly different. Benjamini-Hochberg correction controls the false discovery rate so your hit list is trustworthy.## Volcano Plot
resdata <- read.csv("diffexpr-results.csv")
png("diffexpr-volcanoplot.png", 1200, 1000, pointsize=20)
with(resdata, plot(log2FoldChange, -log10(pvalue), pch=20,
main="Volcano Plot: WT vs atrx-1", cex=0.5, col="grey"))
with(subset(resdata, padj < 0.05),
points(log2FoldChange, -log10(pvalue), pch=20, col="red", cex=0.8))
with(subset(resdata, abs(log2FoldChange) > 2),
points(log2FoldChange, -log10(pvalue), pch=20, col="orange", cex=0.8))
with(subset(resdata, padj < 0.05 & abs(log2FoldChange) > 2),
points(log2FoldChange, -log10(pvalue), pch=20, col="green", cex=1.2))
legend("topright", legend=c("FDR<0.05", "|Log2FC|>2", "Both"),
pch=20, col=c("red","orange","green"), bty="n")
dev.off()
## MA Plot (log2 fold change vs mean expression)
png("diffexpr-maplot.png", 1500, 1000, pointsize=20)
with(resdata, plot(log(baseMean), log2FoldChange, pch=20,
cex=0.5, col="grey50", main="MA Plot"))
with(subset(resdata, padj < 0.05),
points(log(baseMean), log2FoldChange, col="red", pch=20, cex=1))
abline(h=c(-1,1), col="blue", lty=2)
dev.off()
RNA-seq workflow overview
| Output File | Content | Key Value |
|---|---|---|
multiqc_report.html | Aggregated QC for all 12 FASTQ files | All samples pass; Q30 > 90% |
*.bam | Sorted, indexed alignments | ~96% overall alignment rate |
At_count.txt | Gene count matrix (6917 x 6) | ~6900 expressed genes |
diffexpr-results.csv | DESeq2 DE results | 204 DE genes (padj < 0.05) |
PCA.png | PCA of rlog-transformed counts | WT and mutant clusters separate on PC1 |
diffexpr-volcanoplot.png | Volcano plot | Downregulated genes in mutant visible |
The top DE gene (gene32459) has log2FC = -2.06, padj = 5.9e-9. The ATRX loss-of-function primarily downregulates genes involved in chromatin organization.
Most likely a strandedness mismatch. Try all three options: -s 0 (unstranded), -s 1 (stranded), -s 2 (reverse-stranded). The one producing the most assigned reads is correct. Also check the GFF -t and -g attributes match your annotation format.
The column names in the count matrix must match the row names in coldata exactly. Check with all(rownames(coldata) == colnames(dat)) — must return TRUE.
Check if reads are still adapter-contaminated (run FastQC first), whether the correct genome species was used, or if reads are from a different library type (e.g., total RNA with rRNA). Run hisat2 --summary-file for detailed stats per sample.