QC & trimming · Trinity assembly · BUSCO completeness · Salmon quantification · tximport · DESeq2 DE · TransDecoder ORF prediction · functional annotation
Most organisms do not have a reference genome. De novo transcriptome assembly lets you reconstruct all expressed transcripts directly from RNA-seq reads, with no reference needed. Trinity is the gold-standard tool for this. After assembly, you use Salmon for ultra-fast pseudo-alignment quantification, then DESeq2 for differential expression — the same statistical framework as reference-based RNA-seq.
We use the Acropora millepora coral bleaching RNA-seq dataset — a classic de novo transcriptomics study of gene expression during thermal stress in a non-model coral species with no chromosome-level genome at the time of publication.
| Property | Value |
|---|---|
| Organism | Acropora millepora (reef-building coral) |
| Experiment | Control vs thermal bleaching stress (3 replicates each) |
| Accession | PRJNA189030 |
| Read type | Paired-end Illumina 100 bp |
| Expected assembly | ~85,000 Trinity transcripts (Trinity “genes”) |
| Publication | Matz et al. 2013 PLOS ONE |
######################################################################
## STEP 1: Quality control and adapter trimming
## De novo assembly is sensitive to low-quality reads and adapter
## contamination — dirty reads inflate transcript count and reduce N50.
## Always trim BEFORE running Trinity.
######################################################################
mkdir -p 0_raw 1_trimmed 2_assembly 3_busco 4_quant 5_deseq2 6_annotation
module load fastqc/0.11.9
module load trimmomatic/0.39
module load sra-tools/3.0
## --- Download coral bleaching dataset (6 samples: 3 control, 3 bleach) ---
## Sample accessions from PRJNA189030
SAMPLES=(SRR534005 SRR534006 SRR534007 SRR534008 SRR534009 SRR534010)
for SRR in "${SAMPLES[@]}"; do
## fasterq-dump: download and split paired-end reads from SRA
## --split-files : create _1.fastq and _2.fastq for paired reads
## --outdir : where to save output files
## --threads 8 : use 8 threads for faster download
fasterq-dump \
--split-files \
--outdir 0_raw/ \
--threads 8 \
${SRR}
## Compress to save disk space (FASTQ files are large; gzip reduces by ~4x)
gzip 0_raw/${SRR}_1.fastq 0_raw/${SRR}_2.fastq
echo "Downloaded: ${SRR}"
done
## --- Run FastQC on all raw files ---
## FastQC: visual QC report — shows adapter content, quality scores per cycle,
## GC content, overrepresented sequences, duplication levels
## -t 12 : use 12 threads (one per file)
fastqc -t 12 0_raw/*.fastq.gz -o 1_trimmed/
## --- Adapter trimming with Trimmomatic ---
## ILLUMINACLIP : remove Illumina adapter sequences
## TruSeq3-PE.fa = adapter file (check Trimmomatic's adapters/ folder)
## 2 = seed mismatches allowed
## 30 = palindrome clip threshold (for PE reads)
## 10 = simple clip threshold
## LEADING:3 : remove leading bases below quality 3
## TRAILING:3 : remove trailing bases below quality 3
## SLIDINGWINDOW:4:15 : slide a 4-base window; cut when average quality < 15
## MINLEN:36 : discard reads shorter than 36 bases after trimming
for SRR in "${SAMPLES[@]}"; do
trimmomatic PE \
-threads 8 \
0_raw/${SRR}_1.fastq.gz \
0_raw/${SRR}_2.fastq.gz \
1_trimmed/${SRR}_R1_paired.fq.gz \
1_trimmed/${SRR}_R1_unpaired.fq.gz \
1_trimmed/${SRR}_R2_paired.fq.gz \
1_trimmed/${SRR}_R2_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36 \
2> 1_trimmed/${SRR}_trim.log
echo "Trimmed: ${SRR}"
done
## --- Run FastQC again on trimmed reads to verify improvement ---
fastqc -t 12 1_trimmed/*_paired.fq.gz -o 1_trimmed/qc_post/
echo "QC complete. Check FastQC HTML reports in 1_trimmed/."
| ✂SLIDINGWINDOW:4:15 | Scans along the read in windows of 4 bases. When the average quality score inside that window drops below 15 (Phred scale: ~97% accuracy), the read is cut from that point onward. This is gentler than a hard quality cutoff at the 3′ end and handles gradual quality decline better. |
| 👥PE output files | Trimmomatic PE outputs 4 files per sample: two “paired” files (both reads survived trimming) and two “unpaired” files (only one of the pair survived). Trinity can use both paired and unpaired reads — pass unpaired reads with --single flag to maximise data. Most users discard unpaired reads but including them improves assembly completeness by ~5–10%. |
######################################################################
## STEP 2: Run Trinity to assemble a de novo transcriptome
## Trinity works in 3 phases:
## Phase 1 (Inchworm): builds linear contigs from k-mers
## Phase 2 (Chrysalis): clusters contigs into components (= genes)
## Phase 3 (Butterfly): resolves alternative splicing within components
## Result: FASTA file of all assembled transcript sequences
######################################################################
module load trinity/2.15.1
## Create a comma-separated list of R1 and R2 files for Trinity
## Trinity accepts all samples simultaneously to improve splice graph assembly
R1_FILES=$(ls 1_trimmed/*_R1_paired.fq.gz | tr '\n' ',' | sed 's/,$//')
R2_FILES=$(ls 1_trimmed/*_R2_paired.fq.gz | tr '\n' ',' | sed 's/,$//')
echo "R1 files: ${R1_FILES}"
echo "R2 files: ${R2_FILES}"
## --- Run Trinity ---
## --seqType fq : input is FASTQ format (not FASTA)
## --left : comma-separated list of R1 (left) read files
## --right : comma-separated list of R2 (right) read files
## --SS_lib_type RF : strand-specific library type
## RF = typical dUTP/Illumina stranded protocol (read 1 is reverse complement)
## Use FR for older protocols; omit if non-stranded (reduces false transcripts)
## --max_memory 50G : maximum RAM Trinity can use
## Trinity is memory-intensive: needs ~1 GB per million read pairs
## --CPU 16 : number of CPUs to use
## --output : output directory name
## --min_contig_length 200 : discard transcripts shorter than 200 bp
## (very short transcripts are usually artifacts or degradation fragments)
## --normalize_max_cov 200 : normalize read coverage to max 200x
## (prevents assembly from being dominated by very highly expressed genes)
Trinity \
--seqType fq \
--left ${R1_FILES} \
--right ${R2_FILES} \
--SS_lib_type RF \
--max_memory 50G \
--CPU 16 \
--output 2_assembly/trinity_out \
--min_contig_length 200 \
--normalize_max_cov 200
## The assembled transcriptome is in:
## 2_assembly/trinity_out/Trinity.fasta
## --- Basic assembly statistics ---
## TrinityStats.pl counts transcripts, genes, N50, median length
## Trinity gene = component = a set of overlapping transcripts (isoforms)
## Trinity transcript = individual isoform within a gene
${TRINITY_HOME}/util/TrinityStats.pl 2_assembly/trinity_out/Trinity.fasta
echo "Assembly complete. FASTA: 2_assembly/trinity_out/Trinity.fasta"
| 📈--SS_lib_type RF | Tells Trinity the strand orientation of the library. RF means read 2 is sense (forward) and read 1 is antisense (reverse) — this is the output of most dUTP-based stranded Illumina protocols (TruSeq Stranded). Using the wrong strand type will cause Trinity to merge sense and antisense transcripts, doubling transcript count artifactually. Check with RSeQC infer_experiment.py if unsure. |
| 📌--normalize_max_cov 200 | Trinity down-samples reads from very highly expressed genes (e.g. actin, tubulin) to a maximum of 200x coverage before assembly. Without normalisation, high-abundance transcripts dominate memory and runtime, and low-abundance transcripts are missed. This is on by default in Trinity ≥v2.6 but worth understanding. |
| 🕰Runtime expectations | Trinity on 6 samples × 20M read pairs typically takes 12–48 hours on 16 CPUs and 50 GB RAM. Submit as an HPC job requesting at least 60 GB RAM. Trinity creates thousands of temporary files — run on a local scratch disk if possible, not on a network filesystem. |
--normalize_max_cov 50 to be more aggressive about normalization.######################################################################
## STEP 3: Assess transcriptome completeness using BUSCO
## BUSCO (Benchmarking Universal Single-Copy Orthologs) tests whether
## your assembly contains the expected set of highly conserved genes.
## A complete, non-fragmented transcriptome should recover >70-80% of
## BUSCOs as complete. Low BUSCO scores = incomplete assembly or
## wrong lineage database.
######################################################################
module load busco/5.7.0
TRANSCRIPTOME="2_assembly/trinity_out/Trinity.fasta"
## --- Run BUSCO in transcriptome mode ---
## -i : input FASTA file (the Trinity output)
## -o : output directory name
## -m tran : mode = transcriptome (vs genome or protein)
## -l metazoa_odb10 : BUSCO lineage database to use
## Choose based on your organism:
## metazoa_odb10 = animals (coral = Cnidaria = Metazoa)
## viridiplantae_odb10 = plants
## fungi_odb10 = fungi
## bacteria_odb10 = bacteria
## Download databases: busco --list-datasets
## --cpu 16 : number of parallel threads for BLAST/HMMER
## -f : force overwrite existing output directory
busco \
-i ${TRANSCRIPTOME} \
-o 3_busco/coral_busco \
-m transcriptome \
-l metazoa_odb10 \
--cpu 16 \
-f
## --- View the summary ---
cat 3_busco/coral_busco/short_summary.specific.metazoa_odb10.coral_busco.txt
## --- Expected output format ---
## C:78.5%[S:42.1%,D:36.4%],F:6.3%,M:15.2%,n:954
## C = Complete BUSCOs (S = Single copy, D = Duplicated)
## F = Fragmented BUSCOs
## M = Missing BUSCOs
## n = total BUSCOs in the database
## Duplicated (D) > 20% is normal for transcriptomes because
## alternative splicing isoforms of the same gene each match the BUSCO.
| 🌟High Duplicated % is OK in transcriptomes | For genome BUSCO, high duplication is a red flag (suggests assembly chimeras). For transcriptome BUSCO, high duplication is expected and normal — each isoform of a gene independently matches the same BUSCO marker, so genes with many isoforms appear “duplicated.” Focus on C+F > 80% as the completeness indicator. |
| 📋Missing % interpretation | Missing BUSCOs can mean: (1) the gene is not expressed in your samples (not a problem), (2) the assembly failed to reconstruct that gene (quality issue), or (3) the lineage database is too distant from your organism. For coral, use metazoa_odb10 rather than a more specific database — more BUSCO markers = better assessment. |
######################################################################
## STEP 4: Quantify expression using Salmon
## Salmon uses quasi-mapping (not full alignment) to quantify how many
## reads came from each transcript — much faster than STAR/HISAT2+featureCounts
## and statistically superior for isoform-level quantification.
## Output: estimated counts and TPM per transcript, per sample.
######################################################################
module load salmon/1.10.0
TRANSCRIPTOME="2_assembly/trinity_out/Trinity.fasta"
## --- Step 4a: Build Salmon index from the Trinity transcriptome ---
## salmon index: creates a quasi-mapping index (suffix array) from FASTA
## -t : transcriptome FASTA file (Trinity.fasta)
## -i : where to save the index files
## --gencode : NOT used here (that flag is for GENCODE reference transcriptomes)
## -k 31 : k-mer length for indexing
## Default 31 works for most data; use 25 for reads < 75 bp
salmon index \
-t ${TRANSCRIPTOME} \
-i 2_assembly/salmon_index \
-k 31
echo "Salmon index built."
## --- Step 4b: Quantify each sample ---
## salmon quant: count reads per transcript using quasi-mapping
## -i : path to the Salmon index built above
## -l A : library type = Automatic detection (Salmon detects strand orientation)
## You can also specify: IU=inward unstranded, ISR=inward stranded reverse, etc.
## 'A' lets Salmon auto-detect from the first 100k reads — recommended
## -1 : R1 (left) reads
## -2 : R2 (right) reads
## -p 8 : 8 threads
## --validateMappings : more sensitive quasi-mapping (uses chaining)
## Increases sensitivity by ~5% at cost of ~2x runtime — recommended
## --gcBias : correct for GC content bias in quantification
## Recommended when GC content varies between samples
## -o : output directory for this sample's results
SAMPLES=(SRR534005 SRR534006 SRR534007 SRR534008 SRR534009 SRR534010)
for SRR in "${SAMPLES[@]}"; do
salmon quant \
-i 2_assembly/salmon_index \
-l A \
-1 1_trimmed/${SRR}_R1_paired.fq.gz \
-2 1_trimmed/${SRR}_R2_paired.fq.gz \
-p 8 \
--validateMappings \
--gcBias \
-o 4_quant/${SRR}
echo "Quantified: ${SRR}"
done
## Each output directory contains:
## quant.sf ← the main output: Name, Length, EffectiveLength, TPM, NumReads
## lib_format_counts.json ← shows which library type was auto-detected
## aux_info/ ← GC bias correction curves, mapping rate stats
echo "Salmon quantification complete."
echo "Check mapping rates in each aux_info/alevin_meta_info.json file."
| 📊TPM vs raw counts | Salmon outputs both TPM (Transcripts Per Million — normalised for length and sequencing depth) and NumReads (estimated raw counts). Use TPM for visualisation and sample-level comparison. Use NumReads as input to DESeq2 — DESeq2 does its own internal normalisation and requires raw counts, not pre-normalised values. |
| 🎯--validateMappings | Standard quasi-mapping is extremely fast but can have slightly lower sensitivity for reads with many mismatches. --validateMappings adds a chaining step that checks if mapped k-mers form a consistent chain along the transcript. This improves sensitivity for reads near splicing junctions and reduces false mappings by ~3-fold. |
######################################################################
## STEP 5: Import Salmon counts and run DESeq2 differential expression
## tximport: aggregates Salmon transcript-level estimates to gene level
## using the Trinity gene-to-transcript mapping table
## DESeq2: negative binomial model for count-based differential expression
######################################################################
library(tximport) # import Salmon/kallisto/STAR quantifications
library(DESeq2) # differential expression analysis
library(tidyverse) # data manipulation and ggplot2 plotting
library(ggrepel) # non-overlapping labels on volcano plots
## --- Step 5a: Build the gene-to-transcript mapping table ---
## Trinity names transcripts like: TRINITY_DN1000_c0_g1_i1
## DN1000 = component number
## c0 = sub-component
## g1 = gene within component
## i1 = isoform number (transcript)
## We need: transcript_ID → gene_ID mapping for tximport aggregation
## TrinityStats generates this table automatically
tx2gene <- read.table(
"2_assembly/trinity_out/Trinity.fasta.gene_trans_map",
header = FALSE,
col.names = c("gene_id", "transcript_id")
)
## tximport expects: data.frame with columns TXNAME, GENEID
tx2gene <- tx2gene[, c("transcript_id", "gene_id")]
cat("Transcript-to-gene mapping: ", nrow(tx2gene), "transcripts\n")
## --- Step 5b: Import Salmon quantifications with tximport ---
## List all quant.sf files
samples <- c("SRR534005","SRR534006","SRR534007",
"SRR534008","SRR534009","SRR534010")
files <- file.path("4_quant", samples, "quant.sf")
names(files) <- samples # name each file by sample ID
## tximport aggregates transcript-level counts to gene level
## type = "salmon" : tell tximport the input format is Salmon quant.sf
## tx2gene : the mapping table from above
## ignoreTxVersion = TRUE : ignore transcript version suffixes if present
txi <- tximport(
files,
type = "salmon",
tx2gene = tx2gene,
ignoreTxVersion = TRUE
)
cat("Genes after aggregation:", nrow(txi$counts), "\n")
## --- Step 5c: Sample metadata table ---
## This tells DESeq2 which samples belong to which condition
col_data <- data.frame(
sample = samples,
condition = factor(c("control","control","control",
"bleach","bleach","bleach")),
row.names = samples
)
## --- Step 5d: Build the DESeq2 object ---
## DESeqDataSetFromTximport: builds a DESeqDataSet from tximport output
## design = ~ condition : the statistical model
## DESeq2 will test for differential expression between conditions
dds <- DESeqDataSetFromTximport(
txi = txi,
colData = col_data,
design = ~ condition
)
## --- Step 5e: Pre-filter low-count genes ---
## Remove genes with fewer than 10 total counts across all samples
## Low-count genes add noise and inflate multiple testing correction
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
cat("Genes after low-count filter:", nrow(dds), "\n")
## Set the reference level: "control" is baseline
## (so fold-change = bleach / control)
dds$condition <- relevel(dds$condition, ref = "control")
## --- Step 5f: Run DESeq2 ---
## DESeq() fits the negative binomial GLM, estimates size factors,
## dispersions, and performs Wald test for each gene
dds <- DESeq(dds)
## --- Step 5g: Extract results ---
## results(): extract the comparison bleach vs control
## alpha = 0.05 : significance threshold for independent filtering
## lfcThreshold = 0 : test against null hypothesis of log2FC = 0
## contrast: which comparison to extract
res <- results(
dds,
contrast = c("condition", "bleach", "control"),
alpha = 0.05,
lfcThreshold = 0
)
## --- Shrink log2FC estimates (recommended for visualisation) ---
## lfcShrink: applies Bayesian shrinkage to log2FC estimates
## This reduces noise for low-count genes without affecting significance
## type = "apeglm" is the most accurate shrinkage method
res_shrunk <- lfcShrink(
dds,
coef = "condition_bleach_vs_control",
type = "apeglm"
)
## --- Summary of results ---
summary(res)
## "out of X with nonzero total read count"
## "LFC > 0 (up)" = upregulated in bleach
## "LFC < 0 (down)" = downregulated in bleach
## --- Save results ---
res_df <- as.data.frame(res_shrunk) %>%
rownames_to_column("gene_id") %>%
arrange(padj)
write.csv(res_df, "5_deseq2/DESeq2_bleach_vs_control.csv", row.names=FALSE)
## --- Volcano plot ---
## Highlights significant genes (padj < 0.05 AND |log2FC| > 1)
res_plot <- res_df %>%
filter(!is.na(padj)) %>%
mutate(
significant = padj < 0.05 & abs(log2FoldChange) > 1,
direction = case_when(
padj < 0.05 & log2FoldChange > 1 ~ "Up in bleach",
padj < 0.05 & log2FoldChange < -1 ~ "Down in bleach",
TRUE ~ "NS"
)
)
ggplot(res_plot, aes(x=log2FoldChange, y=-log10(padj), color=direction)) +
geom_point(size=0.8, alpha=0.6) +
geom_hline(yintercept=-log10(0.05), linetype="dashed", color="grey50") +
geom_vline(xintercept=c(-1, 1), linetype="dashed", color="grey50") +
scale_color_manual(values=c("Up in bleach"="firebrick","Down in bleach"="steelblue","NS"="grey70")) +
geom_text_repel(
data = filter(res_plot, significant) %>% head(20),
aes(label=gene_id), size=2.5, max.overlaps=20
) +
labs(title="Coral bleaching: DE genes (bleach vs control)",
x="log2 Fold Change", y="-log10 adjusted p-value",
color="Direction") +
theme_bw(base_size=12)
ggsave("5_deseq2/volcano_bleach_vs_control.png", width=10, height=7, dpi=150)
| 📊tximport gene aggregation | Salmon quantifies at the transcript (isoform) level. tximport sums counts across all isoforms of each Trinity gene (component) to get gene-level counts. This is better than simply summing TPM because tximport uses the “scaledTPM” method, which corrects for differences in average transcript length between samples when performing gene-level aggregation. |
| 🎯lfcShrink(type="apeglm") | Genes with very low counts have wildly unstable fold-change estimates (e.g. 1 read → 3 reads = log2FC = 1.58, but meaningless). apeglm shrinkage borrows information across all genes to pull noisy fold-changes toward zero, making MA plots and volcano plots much cleaner. Always use shrunken log2FC for visualisation, but report unshrunken padj for significance. |
######################################################################
## STEP 6: Predict protein-coding ORFs and annotate transcripts
## TransDecoder: predicts coding regions (ORFs) in Trinity transcripts
## BLAST: annotates ORFs against UniProt/SwissProt
## HMMER: annotates with Pfam protein family domains
## Combined: prioritise ORFs with homology support
######################################################################
module load transdecoder/5.7.0
module load blast+/2.14.0
module load hmmer/3.3.2
TRANSCRIPTOME="2_assembly/trinity_out/Trinity.fasta"
## --- Step 6a: Extract long ORFs ---
## TransDecoder.LongOrfs: scans each transcript for all ORFs
## (sets of codons in frame between a start and stop codon)
## -t : input transcriptome FASTA
## -S : strand-specific (only predict ORFs on the sense strand)
## Use -S if you know your library is stranded (RF or FR)
TransDecoder.LongOrfs \
-t ${TRANSCRIPTOME} \
-S
## Creates: Trinity.fasta.transdecoder_dir/longest_orfs.pep
## This file contains all candidate ORF protein sequences
## --- Step 6b: BLAST ORFs against SwissProt ---
## blastp: protein-protein BLAST (ORF vs protein database)
## -query : predicted ORF protein sequences
## -db : SwissProt database (download: makeblastdb -in uniprot_sprot.fasta)
## -outfmt 6 : tabular output (easier to parse than default)
## -max_target_seqs 1 : report only the single best hit per query
## -evalue 1e-5 : maximum E-value for reporting (stringent)
## -num_threads 16 : parallel threads
blastp \
-query Trinity.fasta.transdecoder_dir/longest_orfs.pep \
-db /path/to/uniprot_sprot.fasta \
-out 6_annotation/blastp_swissprot.outfmt6 \
-outfmt 6 \
-max_target_seqs 1 \
-evalue 1e-5 \
-num_threads 16
## --- Step 6c: HMMER Pfam domain search ---
## hmmscan: search ORF sequences against the Pfam database of protein families
## --cpu 16 : use 16 threads
## --domtblout : domain table output (one hit per domain per sequence)
## Pfam-A.hmm : download from https://www.ebi.ac.uk/interpro/download/Pfam/
hmmscan \
--cpu 16 \
--domtblout 6_annotation/pfam_domains.domtblout \
/path/to/Pfam-A.hmm \
Trinity.fasta.transdecoder_dir/longest_orfs.pep
## --- Step 6d: TransDecoder.Predict with homology support ---
## Uses BLAST and HMMER evidence to preferentially retain ORFs with
## functional annotation, even if they don't meet the length threshold alone
TransDecoder.Predict \
-t ${TRANSCRIPTOME} \
--retain_blastp_hits 6_annotation/blastp_swissprot.outfmt6 \
--retain_pfam_hits 6_annotation/pfam_domains.domtblout \
--single_best_only
## Output: Trinity.fasta.transdecoder.pep ← predicted protein sequences
## Output: Trinity.fasta.transdecoder.gff3 ← coordinates of coding regions
echo "TransDecoder complete. Protein file: Trinity.fasta.transdecoder.pep"
echo "Annotated ORFs: $(grep -c '>' Trinity.fasta.transdecoder.pep)"
######################################################################
## STEP 7: Heatmap of top DE genes and GO enrichment analysis
## clusterProfiler: over-representation analysis using GO annotations
## ComplexHeatmap: publication-quality clustered heatmaps
######################################################################
library(DESeq2); library(clusterProfiler); library(tidyverse)
library(pheatmap); library(RColorBrewer)
## --- Heatmap of top 50 most variable DE genes ---
## Extract variance-stabilised expression values (better for visualisation than raw counts)
vsd <- vst(dds, blind=FALSE) # variance stabilising transformation
## Get top 50 DE genes by adjusted p-value
top50 <- res_df %>%
filter(!is.na(padj), padj < 0.05) %>%
arrange(padj) %>%
head(50) %>%
pull(gene_id)
## Extract expression matrix for top 50 genes
mat <- assay(vsd)[top50, ]
## Center and scale each gene (subtract mean, divide by SD)
## This makes rows comparable regardless of absolute expression level
mat_scaled <- t(scale(t(mat)))
## Annotation for column colors
ann_col <- data.frame(
Condition = col_data$condition,
row.names = colnames(mat_scaled)
)
pheatmap(
mat_scaled,
annotation_col = ann_col,
show_rownames = TRUE,
show_colnames = TRUE,
cluster_cols = TRUE, # cluster samples by expression similarity
cluster_rows = TRUE, # cluster genes by expression pattern
color = colorRampPalette(rev(brewer.pal(9, "RdBu")))(100),
main = "Top 50 DE genes: coral bleaching",
filename = "5_deseq2/heatmap_top50.png",
width = 10, height = 12
)
## --- Gene Ontology enrichment (requires annotation) ---
## For de novo transcriptomes, GO terms come from BLAST/Trinotate annotation
## Load a simple gene → GO mapping table (created from Trinotate output)
## Format: two columns: gene_id, GO_term (one row per term per gene)
## This file is created by running Trinotate after TransDecoder annotation
gene2go <- read.table("6_annotation/gene2go.txt",
sep="\t", col.names=c("gene_id","GO"))
## DE genes (adjusted p-value < 0.05)
de_genes <- res_df %>% filter(!is.na(padj), padj < 0.05) %>% pull(gene_id)
## All tested genes (background universe)
all_genes <- res_df$gene_id
## clusterProfiler: enrichGO tests if GO terms are over-represented in DE genes
## OrgDb: organism database (for non-model organisms, use the gene2go table instead)
## ont = "BP" : Biological Process GO category
## pAdjustMethod = "BH" : Benjamini-Hochberg FDR correction
ego <- enricher(
gene = de_genes,
universe = all_genes,
TERM2GENE = gene2go,
pAdjustMethod = "BH",
pvalueCutoff = 0.05
)
dotplot(ego, showCategory=20, title="GO Biological Process: bleach vs control")
ggsave("5_deseq2/GO_enrichment_dotplot.png", width=10, height=8, dpi=150)