From raw 10X Genomics output to annotated cell clusters — a complete walkthrough of the Seurat v5 pipeline with every QC threshold rationalized.
Bulk RNA-Seq measures the average gene expression across millions of cells. Single-cell RNA-Seq (scRNA-Seq) measures expression in individual cells, revealing cell-type-specific programs, rare populations, and cellular heterogeneity that bulk methods completely miss.
The most common platform is 10X Genomics Chromium, which produces a sparse gene × cell count matrix via the Cell Ranger pipeline. Seurat is the most widely used R package for downstream analysis of this data.
install.packages("Seurat")
install.packages(c("tidyverse", "patchwork", "ggrepel"))
# For cell type annotation
install.packages("remotes")
remotes::install_github("immunogenomics/presto") # fast marker detection
BiocManager::install("SingleR")
BiocManager::install("celldex")
library(Seurat)
library(tidyverse)
library(patchwork)
Cell Ranger produces a folder containing three files: barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz. Seurat reads these directly.
# Point to the Cell Ranger output directory counts <- Read10X(data.dir = "path/to/filtered_feature_bc_matrix/") # Create the Seurat object # min.cells = 3: keep genes expressed in at least 3 cells # min.features = 200: keep cells with at least 200 detected genes sobj <- CreateSeuratObject( counts = counts, project = "MyExperiment", min.cells = 3, min.features = 200 ) sobj # Quick summary: genes x cells
This is the most impactful step. Poor QC leads to garbage clusters. We filter on three metrics:
# Calculate mitochondrial gene percentage (marker of dying cells)
sobj[["percent.mt"]] <- PercentageFeatureSet(sobj, pattern = "^MT-")
# For mouse, use pattern = "^mt-"
# Calculate ribosomal gene percentage
sobj[["percent.ribo"]] <- PercentageFeatureSet(sobj, pattern = "^RP[SL]")
# Visualize QC metrics
VlnPlot(sobj,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3, pt.size = 0)
# Scatter plots to identify relationships
plot1 <- FeatureScatter(sobj, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(sobj, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
nFeature_RNAGenes per cell
nCount_RNATotal UMIs per cell
percent.mtMitochondrial %
# Apply filters — ADJUST THESE BASED ON YOUR VIOLIN PLOTS
sobj <- subset(sobj,
subset = nFeature_RNA > 200 &
nFeature_RNA < 5000 &
percent.mt < 10
)
cat("Cells after QC:", ncol(sobj), "\n")
# Log-normalization (default, fast, works well for most datasets)
sobj <- NormalizeData(sobj,
normalization.method = "LogNormalize",
scale.factor = 10000)
scale.factor, then log1p transforms. Fast and sufficient for most analyses.sobj <- SCTransform(sobj) instead.# Find the top 2000 most variable genes
sobj <- FindVariableFeatures(sobj,
selection.method = "vst",
nfeatures = 2000)
# Plot variable features
top10 <- head(VariableFeatures(sobj), 10)
plot1 <- VariableFeaturePlot(sobj)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2
# Scale data (centers and scales each gene across cells) sobj <- ScaleData(sobj) # PCA on the variable features sobj <- RunPCA(sobj, features = VariableFeatures(sobj)) # Elbow plot to decide how many PCs to use ElbowPlot(sobj, ndims = 40)
JackStraw analysis, though it's slow for large datasets.n_pcs <- 20 # Based on elbow plot # Build the shared nearest neighbor graph sobj <- FindNeighbors(sobj, dims = 1:n_pcs) # UMAP for visualization sobj <- RunUMAP(sobj, dims = 1:n_pcs)
FindNeighbors + FindClusters — that's what matters.
# Cluster cells — resolution controls granularity
sobj <- FindClusters(sobj, resolution = 0.5)
# Visualize clusters on UMAP
DimPlot(sobj, reduction = "umap", label = TRUE, label.size = 5) +
ggtitle("Cell Clusters (resolution = 0.5)") +
theme_minimal()
Fewer, larger clusters. Good for identifying major cell lineages (e.g., immune vs. epithelial vs. stromal).
Balanced. Standard starting point. Usually gives biologically meaningful cell types.
Many small clusters. Use for sub-type discovery within known populations. Risk of over-splitting.
Practical tip: Try multiple resolutions and use the clustree R package to visualize how clusters split as resolution increases. Choose the resolution where clusters are stable and biologically interpretable.
# Find markers for ALL clusters (one-vs-rest) all_markers <- FindAllMarkers(sobj, only.pos = TRUE, # only upregulated markers min.pct = 0.25, # gene must be in ≥25% of cluster cells logfc.threshold = 0.25 # minimum log2FC ) # Top 5 markers per cluster top5 <- all_markers %>% group_by(cluster) %>% slice_max(n = 5, order_by = avg_log2FC) print(top5) # Dot plot of top markers DoHeatmap(sobj, features = top5$gene) + scale_fill_viridis_c() + theme(axis.text.y = element_text(size = 6))
| Parameter | Purpose | How to Tune |
|---|---|---|
min.pct | Minimum fraction of cells expressing the gene in either group | 0.25 is standard. Decrease to 0.1 for rare markers; increase to 0.5 to focus on broadly expressed genes. |
logfc.threshold | Minimum average log2FC between cluster and rest | 0.25 is default. Increase to 0.5 or 1.0 for stronger, more specific markers. |
test.use | Statistical test | "wilcox" (default, fast). "MAST" for publication-quality DE (accounts for cellular detection rate). |
# Feature plot — expression on UMAP
FeaturePlot(sobj,
features = c("CD3D", "MS4A1", "LYZ", "NKG7"), # example markers
cols = c("lightgrey", "firebrick3"),
ncol = 2
)
# Dot plot — compact multi-gene view
DotPlot(sobj,
features = unique(top5$gene),
cols = c("lightgrey", "navy")
) + RotatedAxis()
Assigning biological identity to clusters. Two approaches: manual (using known markers) and automated (reference-based).
# Rename clusters based on marker gene knowledge
new_labels <- c(
"0" = "CD4+ T cells",
"1" = "CD14+ Monocytes",
"2" = "B cells",
"3" = "CD8+ T cells",
"4" = "NK cells",
"5" = "Dendritic cells",
"6" = "Platelets"
# ... add more as needed
)
sobj <- RenameIdents(sobj, new_labels)
DimPlot(sobj, reduction = "umap", label = TRUE, repel = TRUE) +
ggtitle("Annotated Cell Types") +
theme_minimal()
library(SingleR)
library(celldex)
# Load a reference dataset
ref <- celldex::HumanPrimaryCellAtlasData()
# Run SingleR on the normalized expression matrix
pred <- SingleR(
test = GetAssayData(sobj, layer = "data"),
ref = ref,
labels = ref$label.main
)
# Add predictions to Seurat object
sobj$singler_labels <- pred$labels
DimPlot(sobj, group.by = "singler_labels", label = TRUE, repel = TRUE) +
ggtitle("SingleR Automated Annotation")
FeaturePlot() and domain expertise.
If your experiment has multiple conditions (e.g., disease vs. healthy), you can perform DE within each cell type.
# Pseudobulk approach (recommended for multi-sample experiments)
# Aggregate counts per cell type per sample
bulk <- AggregateExpression(sobj,
group.by = c("cell_type", "sample_id"),
return.seurat = TRUE
)
# Then run DESeq2 on pseudobulk counts (see RNA-Seq tutorial)
# --- OR: Cell-level DE (simpler, but inflated p-values) ---
# Compare conditions within a specific cell type
cd4_cells <- subset(sobj, idents = "CD4+ T cells")
Idents(cd4_cells) <- "condition"
de_results <- FindMarkers(cd4_cells,
ident.1 = "Disease",
ident.2 = "Healthy",
test.use = "MAST", # recommended for scRNA-Seq DE
min.pct = 0.1,
logfc.threshold = 0.25
)
head(de_results, 20)