RNA-seq Analysis with HISAT2, featureCounts & DESeq2

Complete end-to-end pipeline using the Arabidopsis thaliana atrx-1 mutant dataset — from raw FASTQ to volcano plots and differentially expressed gene lists.

~120 min PRJNA348194 Intermediate Bash + R
Byte

Overview

RNA-seq experiments measure transcriptome-wide gene expression changes. This tutorial covers the complete pipeline for RNA-seq data from a genome-guided mapping approach:

1
QC — FastQC per-sample quality reports aggregated with MultiQC
2
Alignment — HISAT2 splice-aware alignment to the Arabidopsis TAIR10 genome
3
Quantification — featureCounts assigns mapped reads to annotated genes
4
DE Analysis — DESeq2 identifies differentially expressed genes (WT vs atrx-1 mutant)
Byte
Why HISAT2 and not BWA or Bowtie2? RNA-seq reads come from processed mRNA — they span intron-exon boundaries. Generic aligners cannot handle these spliced reads correctly. HISAT2 is splice-aware and was designed specifically for this purpose, making it the gold-standard for genome-guided RNA-seq alignment.

Ideal Directory Structure

Set up your project folder like this before running any commands. Keeping data, indices, alignments, and results in separate numbered directories makes it easy to find outputs and re-run individual steps without overwriting anything.

rnaseq_atrx_project/
rnaseq_atrx_project/ ├── 0_index/ # genome FASTA + GFF + HISAT2 index files (.ht2) ├── 1_data/ # raw FASTQ files (input — never modify!) │ ├── SRR4420293_1.fastq.gz │ ├── SRR4420293_2.fastq.gz │ └── ... (12 files total) ├── 2_fastqc/ # FastQC HTML reports + MultiQC summary ├── 3_bam/ # sorted BAM files + .bai index files │ ├── SRR4420293.bam │ ├── SRR4420293.bam.bai │ └── ... (6 BAM + 6 BAI) ├── 4_counts/ # featureCounts output matrix │ └── At_count.txt # gene × sample count table ├── 5_deseq2/ # R analysis outputs │ ├── DESeq2_results.csv │ ├── volcano_plot.png │ └── MA_plot.png └── scripts/ # your .sh and .R script files
💡
Pro tip: Number your folders (0_, 1_, 2_...) so ls always shows them in pipeline order. Put raw data in 1_data/ and never write to it — treat it as read-only. This way you can always re-run the whole pipeline from scratch by deleting folders 2 onwards.

Dataset

This tutorial uses a published Arabidopsis thaliana dataset comparing wild-type (WT) plants to atrx-1 loss-of-function mutants. ATRX is a histone chaperone that regulates gene expression epigenetically.

ConditionReplicateSRA ID (R1)SRA ID (R2)
Wild-type (WT)Rep 1SRR4420293_1.fastq.gzSRR4420293_2.fastq.gz
Wild-type (WT)Rep 2SRR4420294_1.fastq.gzSRR4420294_2.fastq.gz
Wild-type (WT)Rep 3SRR4420295_1.fastq.gzSRR4420295_2.fastq.gz
atrx-1 mutantRep 1SRR4420296_1.fastq.gzSRR4420296_2.fastq.gz
atrx-1 mutantRep 2SRR4420297_1.fastq.gzSRR4420297_2.fastq.gz
atrx-1 mutantRep 3SRR4420298_1.fastq.gzSRR4420298_2.fastq.gz

Paired-end Illumina HiSeq 2500 reads. BioProject: PRJNA348194

Step 1 — Download Data

Download reads directly from ENA (fastest option — no conversion needed):

Bash — Download reads from ENA
mkdir -p 1_data && cd 1_data

# Download all 6 paired-end samples (12 files total)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/004/SRR4420294/SRR4420294_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/004/SRR4420294/SRR4420294_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/005/SRR4420295/SRR4420295_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/005/SRR4420295/SRR4420295_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/006/SRR4420296/SRR4420296_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/006/SRR4420296/SRR4420296_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/007/SRR4420297/SRR4420297_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/007/SRR4420297/SRR4420297_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/008/SRR4420298/SRR4420298_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/008/SRR4420298/SRR4420298_2.fastq.gz
cd ..

# Download Arabidopsis TAIR10 genome and annotation from NCBI
mkdir -p 0_index && cd 0_index
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.gff.gz
gzip -d *.gz
cd ..
Download:
What does each command do?
📁mkdir -p 1_dataCreates the 1_data folder. The -p flag means "make all parent directories too and don't complain if it already exists" — safe to run multiple times.
⬇️wget <url>Downloads a file from a URL. Think of it as a command-line browser that saves files directly to disk. We use ENA (European Nucleotide Archive) because it provides direct FTP/HTTPS links — faster than NCBI's SRA download tools.
🗜️gzip -d *.gzDecompresses all .gz files in the current directory. The -d flag means "decompress" (opposite of the default compress). The * wildcard matches every .gz file so you don't have to name them one by one.
🎯
Why ENA and not NCBI SRA? Both host the same data, but ENA provides plain FASTQ files directly accessible via wget/curl. NCBI SRA requires the fastq-dump or fasterq-dump tools which are slower and need additional setup. ENA = instant gratification!

Step 2 — Quality Control with FastQC & MultiQC

Bash — FastQC + MultiQC (SLURM)
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=2:00:00
#SBATCH --job-name=fastqc_multiqc
#SBATCH --output=fastqc.%j.out
#SBATCH --error=fastqc.%j.err

module load fastqc
module load parallel
module load py-multiqc   # or: pip install multiqc

mkdir -p 2_fastqc
OUTDIR=2_fastqc

# Run FastQC on all files in parallel
parallel "fastqc {} -o ${OUTDIR}" ::: 1_data/*.fastq.gz

# Aggregate all reports into one HTML summary
cd ${OUTDIR}
multiqc .
# Output: multiqc_report.html  &  multiqc_data/
Download:
What does each command do?
parallel "fastqc {}"Runs FastQC on multiple files simultaneously using GNU Parallel. The {} is a placeholder — parallel replaces it with each filename in turn. Without parallel you'd have to run FastQC once per file in a loop (12× slower).
🔬fastqc {} -o $OUTDIRGenerates a quality report HTML file for each FASTQ. Checks: per-base quality scores, GC content, adapter contamination, duplication rates. Think of it as a full health checkup for your sequencing data.
📊multiqc .Scans the current directory (.) for any FastQC output files and combines them into a single interactive HTML report. The dot just means "look in the current folder". One report beats twelve!
What to check in FastQC/MultiQC: Per-base quality should stay in the green zone (Q ≥ 28). Adapter content should be minimal. A high percentage of duplicate reads in RNA-seq is normal (high-expression genes produce many identical reads). Do not over-trim — modern aligners handle low-quality tails gracefully.

A real MultiQC report for this dataset is available as an embedded example:

Real MultiQC output from the Atrx dataset. Adapter content and quality scores visible.

Step 3 — HISAT2 Splice-Aware Alignment

HISAT2 requires a genome index before alignment. Build once, reuse for all samples.

Bash — HISAT2 Genome Indexing (SLURM)
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=1:00:00
#SBATCH --job-name=hisat2_index
#SBATCH --output=hisat2_index.%j.out
#SBATCH --error=hisat2_index.%j.err

module load hisat2

GENOME_FNA="0_index/GCF_000001735.4_TAIR10.1_genomic.fna"
GENOME_PREFIX="0_index/GCF_000001735.4_TAIR10.1_genomic"

hisat2-build ${GENOME_FNA} ${GENOME_PREFIX}

# Expected output: 8 index files (.1.ht2 through .8.ht2)
ls 0_index/*.ht2
Download:
What does each command do?
🗺️hisat2-buildBuilds a compressed index of the genome so HISAT2 can align reads to it lightning-fast. Think of it like building a book's index — you do it once, then finding any word (read) is instant. Without an index, HISAT2 would have to scan the entire 135 Mb genome for every single read.
📂GENOME_FNAThe input: your genome assembly in FASTA format. One sequence per chromosome.
🏷️GENOME_PREFIXThe output prefix: HISAT2 will create 8 files named prefix.1.ht2 through prefix.8.ht2. Keep these in the same folder — HISAT2 needs all 8 to align.

After indexing, align each sample. The script below loops over all pairs:

Bash — HISAT2 Alignment Loop (SLURM)
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=24:00:00
#SBATCH --job-name=hisat2_align
#SBATCH --output=hisat2_align.%j.out
#SBATCH --error=hisat2_align.%j.err

module load hisat2
module load samtools

GENOME="0_index/GCF_000001735.4_TAIR10.1_genomic"
OUTDIR="3_bam"
mkdir -p ${OUTDIR}
p=8   # threads per alignment

for fq1 in 1_data/*_1.fastq.gz; do
    fq2="${fq1/_1.fastq.gz/_2.fastq.gz}"
    SAMPLE=$(basename ${fq1} _1.fastq.gz)

    echo "Aligning: ${SAMPLE}"

    hisat2 -p ${p} \
        -x ${GENOME} \
        -1 ${fq1} \
        -2 ${fq2} \
        -S ${OUTDIR}/${SAMPLE}.sam \
        2> ${OUTDIR}/${SAMPLE}.log

    # Convert SAM to sorted BAM and remove SAM
    samtools view --threads ${p} -bS ${OUTDIR}/${SAMPLE}.sam \
        | samtools sort --threads ${p} -o ${OUTDIR}/${SAMPLE}.bam

    samtools index ${OUTDIR}/${SAMPLE}.bam
    rm ${OUTDIR}/${SAMPLE}.sam
    echo "  Done: ${SAMPLE}.bam"
done

# Check alignment rates
for log in ${OUTDIR}/*.log; do
    echo "=== $(basename ${log} .log) ===" && grep "overall alignment" ${log}
done
Download:
HISAT2 flags decoded
🧵-p 8parallel threads — use 8 CPU cores simultaneously. Doubles speed compared to -p 4. Match this to the cores you requested in your SLURM header (--ntasks-per-node).
🗺️-x $GENOMEThe genome index prefix — point to the prefix you used in hisat2-build (without the .1.ht2 extension).
📖-1 $fq1 -2 $fq2Read 1 (forward) and read 2 (reverse) of your paired-end data. HISAT2 uses the pairing information to improve alignment accuracy — it knows both reads come from the same DNA fragment.
💾-S sample.samOutput SAM file. SAM is human-readable text alignment format. We convert it to BAM immediately — SAM files are enormous (10–20 GB per sample!) so we never keep them.
🔄samtools view -bSConverts SAM (S) to BAM (b) — a compressed binary version. Same information, 5–10× smaller.
📑samtools sortSorts alignments by genomic coordinate (chromosome + position). This is required by featureCounts and genome browsers like IGV. Unsorted BAMs are like a filing cabinet with folders in random order.
🔍samtools indexCreates a .bai index file alongside your BAM. Like a book index — lets tools jump directly to any chromosomal region without reading the whole BAM. Required by IGV and many GATK tools.
🧹
Why do we immediately rm the SAM file? A SAM file for 25M reads is ~15 GB of plain text. Your BAM is the same data compressed to ~2 GB. Always pipe SAM directly to samtools sort and delete the SAM — never let it linger on disk!
SampleMapped PairsOverall Rate
SRR4420293 (WT rep1)~23M~96.5%
SRR4420294 (WT rep2)~25M~96.8%
SRR4420295 (WT rep3)~22M~96.3%
SRR4420296 (mut rep1)~27M~97.1%
SRR4420297 (mut rep2)~24M~96.7%
SRR4420298 (mut rep3)~26M~96.9%

Expected alignment rates for this high-quality Arabidopsis dataset. Values ≥90% indicate good library quality.

Step 4 — Read Counting with featureCounts

Bash — featureCounts (SLURM)
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=4:00:00
#SBATCH --job-name=featureCounts
#SBATCH --output=featureCounts.%j.out
#SBATCH --error=featureCounts.%j.err

module load subread   # featureCounts is part of the subread package
module load parallel

ANNOT_GFF="0_index/GCF_000001735.4_TAIR10.1_genomic.gff"
INDIR="3_bam"
OUTDIR="4_counts"
mkdir -p ${OUTDIR}

# Run featureCounts on each BAM in parallel (4 jobs x 4 threads each)
# -s 2 = reversely stranded (RiboMinus protocol used in this dataset)
# -p    = paired-end
# -t gene, -g ID = count at gene level using the ID attribute
parallel -j 4 \
    "featureCounts -T 4 -s 2 -p -t gene -g ID \
     -a ${ANNOT_GFF} \
     -o ${OUTDIR}/{/.}.gene.txt {}" \
    ::: ${INDIR}/*.bam

# Merge all sample count files into one matrix
paste \
    <(awk 'BEGIN{OFS="\t"} {print $1,$7}' ${OUTDIR}/SRR4420293.gene.txt) \
    <(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420294.gene.txt) \
    <(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420295.gene.txt) \
    <(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420296.gene.txt) \
    <(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420297.gene.txt) \
    <(awk 'BEGIN{OFS="\t"} {print $7}' ${OUTDIR}/SRR4420298.gene.txt) \
    | grep -v '^\#' > ${OUTDIR}/At_count.txt

echo "Count matrix created: ${OUTDIR}/At_count.txt"
head -5 ${OUTDIR}/At_count.txt
Download:
featureCounts flags decoded
🧵-T 4Use 4 CPU Threads per featureCounts job. We run 4 jobs in parallel (via GNU parallel -j 4), so total = 16 cores — matching the SLURM request.
↩️-s 2Strandedness: 0=unstranded, 1=forward, 2=reverse. This dataset used the RiboMinus protocol which generates reverse-stranded libraries. Wrong strandedness = ~50% of reads miscounted!
👫-pTells featureCounts the data is paired-end — count a fragment only once (not twice, once per read). Essential for paired-end libraries.
🔖-t geneCount at the type "gene" in the GFF. Could also use "exon" to count individual exons instead of the whole gene locus.
🏷️-g IDThe GFF attribute to use as the gene identifier in the output matrix. For NCBI GFFs, the ID field holds the gene name (e.g., gene0, AT1G01010).
📄-a $ANNOTThe annotation file (GFF/GTF) that tells featureCounts where each gene lives on the genome.
🗂️paste + awkfeatureCounts outputs one file per sample. paste merges them side-by-side into a single matrix, and awk extracts only column 7 (the count column) from each file.

Expected count matrix header (first 8 rows):

Expected Output — At_count.txt
Geneid  S293    S294    S295    S296    S297    S298
gene0   11      1       10      28      11      13
gene1   37      3       26      88      22      17
gene2   0       0       0       0       0       0
gene3   6       2       12      40      13      13
gene4   35      6       22      170     53      32
gene7   49      15      67      258     83      45
Strandedness matters. This dataset used the plant RiboMinus kit, which produces a reverse-strand library. Using -s 2 (reversely stranded) is critical. Using the wrong strandedness setting can cut detected genes in half. Always check with RSeQC infer_experiment.py or the Identify_Strandedness_Information.md protocol in the workbook.

Step 5 — Differential Expression with DESeq2

R — DESeq2 Differential Expression
## RNA-seq Differential Expression Analysis with DESeq2
## Dataset: Arabidopsis atrx-1 mutant vs WT (PRJNA348194)
## Original analysis: Siva Chudalayandi

# Install if needed:
# BiocManager::install("DESeq2")
# install.packages(c("RColorBrewer", "gplots", "ggplot2", "ggrepel"))

library(DESeq2)
library(RColorBrewer)
library(gplots)
library(ggplot2)
library(ggrepel)

setwd("4_counts/")

# 1. Load count matrix
dat <- read.table("At_count.txt", header=TRUE, quote="", row.names=1)
dat <- as.matrix(dat)
head(dat)

# 2. Define experimental design
condition <- factor(c(rep("WT",3), rep("Mut",3)))
condition <- relevel(condition, ref="WT")   # WT is the reference
coldata   <- data.frame(row.names=colnames(dat), condition)

# 3. Build DESeq2 object and run pipeline
dds <- DESeqDataSetFromMatrix(countData=dat, colData=coldata, design=~condition)
dds <- DESeq(dds)

# 4. Dispersion plot (QC)
png("qc-dispersions.png", 1000, 1000, pointsize=20)
plotDispEsts(dds, main="Dispersion plot")
dev.off()

# 5. rlog transformation for PCA/clustering
rld <- rlogTransformation(dds)

# 6. PCA plot
pca_data <- plotPCA(rld, intgroup="condition", returnData=TRUE)
pct_var  <- round(100 * attr(pca_data, "percentVar"))
ggplot(pca_data, aes(PC1, PC2, color=condition)) +
    geom_point(size=4) +
    xlab(paste0("PC1: ", pct_var[1], "% variance")) +
    ylab(paste0("PC2: ", pct_var[2], "% variance")) +
    theme_bw() +
    ggtitle("PCA - WT vs atrx-1 mutant")
ggsave("PCA.png", width=6, height=5)

# 7. Sample distance heatmap
mycols      <- brewer.pal(8, "Dark2")[1:length(unique(condition))]
sampleDists <- as.matrix(dist(t(assay(rld))))
png("qc-heatmap-samples.png", 1000, 1000, pointsize=20)
heatmap.2(sampleDists, key=FALSE, trace="none",
          col=colorpanel(100, "black", "white"),
          ColSideColors=mycols[condition],
          RowSideColors=mycols[condition],
          margin=c(10,10), main="Sample Distance Matrix")
dev.off()

# 8. Get DE results
res     <- results(dds)
res     <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res),
                  as.data.frame(counts(dds, normalized=TRUE)),
                  by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"

write.csv(resdata, file="diffexpr-results.csv", quote=FALSE, row.names=FALSE)

# Summary
cat("Total genes tested:", nrow(res), "\n")
cat("DE genes (padj < 0.05):", sum(res$padj < 0.05, na.rm=TRUE), "\n")

#   FALSE  TRUE
#   6712   204
Download:
DESeq2 pipeline decoded
🏗️DESeqDataSetFromMatrix()Packages your count matrix + sample metadata into a DESeq2 object. The design=~condition formula tells DESeq2 which column in your metadata to use as the grouping variable for testing.
🚀DESeq(dds)The workhorse — runs three steps in one: (1) estimates size factors (normalizes library sizes), (2) estimates gene-wise dispersion, (3) fits a negative binomial model and runs the Wald test for each gene. Your ~6 min coffee break!
📐relevel(condition, ref="WT")Sets WT as the reference level so fold changes are calculated as Mut÷WT. Without this, DESeq2 uses alphabetical order ("Mut" < "WT" alphabetically, so it would flip the comparison).
🌡️rlogTransformation()Regularized log transformation — shrinks counts of lowly-expressed genes towards the mean to avoid them dominating PCA and clustering. Better than simple log2 for visualization (not for DE testing).
📊plotDispEsts()QC plot showing per-gene dispersion estimates before and after shrinkage. The red trend line should curve smoothly downward. Wild outliers = potential quality issues in that gene.
📋results(dds)Extracts a table with: baseMean, log2FoldChange, lfcSE, stat, pvalue, padj (BH-corrected). Sort by padj to find your top hits.
🎯
padj vs pvalue: Always use padj (adjusted p-value), never raw pvalue! When testing 27,000 genes simultaneously, ~1,350 genes will appear significant by chance at p<0.05 even if nothing is truly different. Benjamini-Hochberg correction controls the false discovery rate so your hit list is trustworthy.

Step 6 — Volcano Plot & MA Plot

R — Volcano Plot and MA Plot
## Volcano Plot
resdata <- read.csv("diffexpr-results.csv")

png("diffexpr-volcanoplot.png", 1200, 1000, pointsize=20)
with(resdata, plot(log2FoldChange, -log10(pvalue), pch=20,
                   main="Volcano Plot: WT vs atrx-1", cex=0.5, col="grey"))
with(subset(resdata, padj < 0.05),
     points(log2FoldChange, -log10(pvalue), pch=20, col="red", cex=0.8))
with(subset(resdata, abs(log2FoldChange) > 2),
     points(log2FoldChange, -log10(pvalue), pch=20, col="orange", cex=0.8))
with(subset(resdata, padj < 0.05 & abs(log2FoldChange) > 2),
     points(log2FoldChange, -log10(pvalue), pch=20, col="green", cex=1.2))
legend("topright", legend=c("FDR<0.05", "|Log2FC|>2", "Both"),
       pch=20, col=c("red","orange","green"), bty="n")
dev.off()

## MA Plot (log2 fold change vs mean expression)
png("diffexpr-maplot.png", 1500, 1000, pointsize=20)
with(resdata, plot(log(baseMean), log2FoldChange, pch=20,
                   cex=0.5, col="grey50", main="MA Plot"))
with(subset(resdata, padj < 0.05),
     points(log(baseMean), log2FoldChange, col="red", pch=20, cex=1))
abline(h=c(-1,1), col="blue", lty=2)
dev.off()
Download:
RNA-seq overview

RNA-seq workflow overview

Expected Results

Output FileContentKey Value
multiqc_report.htmlAggregated QC for all 12 FASTQ filesAll samples pass; Q30 > 90%
*.bamSorted, indexed alignments~96% overall alignment rate
At_count.txtGene count matrix (6917 x 6)~6900 expressed genes
diffexpr-results.csvDESeq2 DE results204 DE genes (padj < 0.05)
PCA.pngPCA of rlog-transformed countsWT and mutant clusters separate on PC1
diffexpr-volcanoplot.pngVolcano plotDownregulated genes in mutant visible

The top DE gene (gene32459) has log2FC = -2.06, padj = 5.9e-9. The ATRX loss-of-function primarily downregulates genes involved in chromatin organization.

Troubleshooting

featureCounts: very few reads assigned (<50%)

Most likely a strandedness mismatch. Try all three options: -s 0 (unstranded), -s 1 (stranded), -s 2 (reverse-stranded). The one producing the most assigned reads is correct. Also check the GFF -t and -g attributes match your annotation format.

DESeq2: "all samples belong to the same group"

The column names in the count matrix must match the row names in coldata exactly. Check with all(rownames(coldata) == colnames(dat)) — must return TRUE.

HISAT2: <70% alignment rate

Check if reads are still adapter-contaminated (run FastQC first), whether the correct genome species was used, or if reads are from a different library type (e.g., total RNA with rRNA). Run hisat2 --summary-file for detailed stats per sample.