Single-Cell RNA-Seq Analysis with Seurat v5 — Complete Tutorial

? What Is Single-Cell RNA-Seq?

Bulk RNA-Seq measures the average gene expression across millions of cells. Single-cell RNA-Seq (scRNA-Seq) measures expression in individual cells, revealing cell-type-specific programs, rare populations, and cellular heterogeneity that bulk methods completely miss.

The most common platform is 10X Genomics Chromium, which produces a sparse gene × cell count matrix via the Cell Ranger pipeline. Seurat is the most widely used R package for downstream analysis of this data.

The Standard Seurat Workflow

Load 10X data → Create Seurat object
Quality control (filter dead cells, doublets)
Normalize expression values
Find highly variable features
Scale data & run PCA
Find neighbors & cluster cells (UMAP)
Identify cluster marker genes
Annotate cell types

0 Installation

R

install.packages("Seurat")
install.packages(c("tidyverse", "patchwork", "ggrepel"))

# For cell type annotation
install.packages("remotes")
remotes::install_github("immunogenomics/presto")  # fast marker detection
BiocManager::install("SingleR")
BiocManager::install("celldex")

library(Seurat)
library(tidyverse)
library(patchwork)

1 Load 10X Genomics Data

Cell Ranger produces a folder containing three files: barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz. Seurat reads these directly.

R

# Point to the Cell Ranger output directory
counts <- Read10X(data.dir = "path/to/filtered_feature_bc_matrix/")

# Create the Seurat object
# min.cells = 3: keep genes expressed in at least 3 cells
# min.features = 200: keep cells with at least 200 detected genes
sobj <- CreateSeuratObject(
  counts       = counts,
  project      = "MyExperiment",
  min.cells    = 3,
  min.features = 200
)

sobj   # Quick summary: genes x cells

Why min.cells = 3? Genes detected in fewer than 3 cells provide almost no statistical information and are likely noise or ambient RNA. Setting this to 3 removes them upfront, reducing object size. Similarly, min.features = 200 removes empty droplets or debris that passed Cell Ranger's initial filtering.

2 Quality Control & Cell Filtering

This is the most impactful step. Poor QC leads to garbage clusters. We filter on three metrics:

R

# Calculate mitochondrial gene percentage (marker of dying cells)
sobj[["percent.mt"]] <- PercentageFeatureSet(sobj, pattern = "^MT-")
# For mouse, use pattern = "^mt-"

# Calculate ribosomal gene percentage
sobj[["percent.ribo"]] <- PercentageFeatureSet(sobj, pattern = "^RP[SL]")

# Visualize QC metrics
VlnPlot(sobj,
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
        ncol = 3, pt.size = 0)

# Scatter plots to identify relationships
plot1 <- FeatureScatter(sobj, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(sobj, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

How to Choose QC Thresholds

`nFeature_RNA`

Genes per cell

Low cutoff (200–500): removes empty droplets
High cutoff (2500–6000): removes likely doublets
Look at the violin plot — set thresholds at the natural breakpoints

`nCount_RNA`

Total UMIs per cell

Correlates strongly with nFeature
Extreme outliers (top 1%) are likely doublets
Often don't need a separate filter if nFeature is filtered

`percent.mt`

Mitochondrial %

< 5%: stringent (most tissues)
< 10%: lenient (muscle, kidney, metabolically active tissues)
< 20%: only for highly metabolic cells or FFPE samples
High mt% = dying/stressed cells leaking cytoplasmic RNA

R

# Apply filters — ADJUST THESE BASED ON YOUR VIOLIN PLOTS
sobj <- subset(sobj,
  subset = nFeature_RNA > 200 &
           nFeature_RNA < 5000 &
           percent.mt < 10
)

cat("Cells after QC:", ncol(sobj), "\n")

Never blindly copy thresholds! The right cutoffs depend on your tissue type, dissociation protocol, and sequencing depth. Always inspect the violin and scatter plots first and set thresholds at the natural inflection points in your specific data. A brain sample and a tumor sample will have very different QC distributions.

3 Normalization

R

# Log-normalization (default, fast, works well for most datasets)
sobj <- NormalizeData(sobj,
                      normalization.method = "LogNormalize",
                      scale.factor = 10000)

Normalization Methods:

LogNormalize (default): divides each cell's counts by total, multiplies by scale.factor, then log1p transforms. Fast and sufficient for most analyses.
SCTransform: uses regularized negative binomial regression. Better variance stabilization, especially for complex datasets. Use sobj <- SCTransform(sobj) instead.
When to use SCTransform: if you're integrating multiple samples, have strong batch effects, or want the most statistically rigorous approach. It's slower but often gives cleaner clusters.

4 Identify Highly Variable Features

R

# Find the top 2000 most variable genes
sobj <- FindVariableFeatures(sobj,
                              selection.method = "vst",
                              nfeatures = 2000)

# Plot variable features
top10 <- head(VariableFeatures(sobj), 10)
plot1 <- VariableFeaturePlot(sobj)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2

How Many Variable Features?

2000 — standard default, works well for most datasets
3000–5000 — use for complex tissues with many cell types (e.g., whole brain, tumor microenvironment)
1000 — use for simpler datasets (e.g., sorted cell populations) or when you want tighter clusters
More features = more information but also more noise; diminishing returns past ~3000

5 Scaling, PCA & UMAP

R

# Scale data (centers and scales each gene across cells)
sobj <- ScaleData(sobj)

# PCA on the variable features
sobj <- RunPCA(sobj, features = VariableFeatures(sobj))

# Elbow plot to decide how many PCs to use
ElbowPlot(sobj, ndims = 40)

Choosing the Number of PCs:

Look at the elbow plot — where the curve flattens is your cutoff
Typical range: 10–30 PCs
Too few PCs: you lose real biological signal, clusters merge
Too many PCs: you include noise dimensions, clusters fragment
When in doubt, use 20. It's rarely wrong and is robust for most datasets.
For a more data-driven approach, use JackStraw analysis, though it's slow for large datasets.

R

n_pcs <- 20  # Based on elbow plot

# Build the shared nearest neighbor graph
sobj <- FindNeighbors(sobj, dims = 1:n_pcs)

# UMAP for visualization
sobj <- RunUMAP(sobj, dims = 1:n_pcs)

UMAP is for Visualization Only! UMAP preserves local structure but distorts global distances. Two clusters that appear far apart on UMAP may actually be quite similar, and vice versa. Never draw biological conclusions from UMAP distances alone. Clustering happens in PCA space via FindNeighbors + FindClusters — that's what matters.

6 Clustering

R

# Cluster cells — resolution controls granularity
sobj <- FindClusters(sobj, resolution = 0.5)

# Visualize clusters on UMAP
DimPlot(sobj, reduction = "umap", label = TRUE, label.size = 5) +
  ggtitle("Cell Clusters (resolution = 0.5)") +
  theme_minimal()

Choosing Clustering Resolution

Low (0.1–0.3)

Fewer, larger clusters. Good for identifying major cell lineages (e.g., immune vs. epithelial vs. stromal).

Medium (0.4–0.8)

Balanced. Standard starting point. Usually gives biologically meaningful cell types.

High (1.0–2.0)

Many small clusters. Use for sub-type discovery within known populations. Risk of over-splitting.

Practical tip: Try multiple resolutions and use the clustree R package to visualize how clusters split as resolution increases. Choose the resolution where clusters are stable and biologically interpretable.

7 Find Cluster Marker Genes

R

# Find markers for ALL clusters (one-vs-rest)
all_markers <- FindAllMarkers(sobj,
  only.pos    = TRUE,     # only upregulated markers
  min.pct     = 0.25,     # gene must be in ≥25% of cluster cells
  logfc.threshold = 0.25  # minimum log2FC
)

# Top 5 markers per cluster
top5 <- all_markers %>%
  group_by(cluster) %>%
  slice_max(n = 5, order_by = avg_log2FC)

print(top5)

# Dot plot of top markers
DoHeatmap(sobj, features = top5$gene) +
  scale_fill_viridis_c() +
  theme(axis.text.y = element_text(size = 6))

Parameter	Purpose	How to Tune
`min.pct`	Minimum fraction of cells expressing the gene in either group	0.25 is standard. Decrease to 0.1 for rare markers; increase to 0.5 to focus on broadly expressed genes.
`logfc.threshold`	Minimum average log2FC between cluster and rest	0.25 is default. Increase to 0.5 or 1.0 for stronger, more specific markers.
`test.use`	Statistical test	`"wilcox"` (default, fast). `"MAST"` for publication-quality DE (accounts for cellular detection rate).

Visualize Key Markers

R

# Feature plot — expression on UMAP
FeaturePlot(sobj,
  features = c("CD3D", "MS4A1", "LYZ", "NKG7"),  # example markers
  cols = c("lightgrey", "firebrick3"),
  ncol = 2
)

# Dot plot — compact multi-gene view
DotPlot(sobj,
  features = unique(top5$gene),
  cols = c("lightgrey", "navy")
) + RotatedAxis()

8 Cell Type Annotation

Assigning biological identity to clusters. Two approaches: manual (using known markers) and automated (reference-based).

Manual Annotation

R

# Rename clusters based on marker gene knowledge
new_labels <- c(
  "0" = "CD4+ T cells",
  "1" = "CD14+ Monocytes",
  "2" = "B cells",
  "3" = "CD8+ T cells",
  "4" = "NK cells",
  "5" = "Dendritic cells",
  "6" = "Platelets"
  # ... add more as needed
)

sobj <- RenameIdents(sobj, new_labels)
DimPlot(sobj, reduction = "umap", label = TRUE, repel = TRUE) +
  ggtitle("Annotated Cell Types") +
  theme_minimal()

Automated Annotation with SingleR

R

library(SingleR)
library(celldex)

# Load a reference dataset
ref <- celldex::HumanPrimaryCellAtlasData()

# Run SingleR on the normalized expression matrix
pred <- SingleR(
  test   = GetAssayData(sobj, layer = "data"),
  ref    = ref,
  labels = ref$label.main
)

# Add predictions to Seurat object
sobj$singler_labels <- pred$labels

DimPlot(sobj, group.by = "singler_labels", label = TRUE, repel = TRUE) +
  ggtitle("SingleR Automated Annotation")

Best Practice: Use automated annotation as a starting point, then refine manually using known marker genes from the literature. No automated tool is perfect — always validate the top markers for each assigned cell type with FeaturePlot() and domain expertise.

9 Differential Expression Between Conditions

If your experiment has multiple conditions (e.g., disease vs. healthy), you can perform DE within each cell type.

R

# Pseudobulk approach (recommended for multi-sample experiments)
# Aggregate counts per cell type per sample
bulk <- AggregateExpression(sobj,
  group.by = c("cell_type", "sample_id"),
  return.seurat = TRUE
)

# Then run DESeq2 on pseudobulk counts (see RNA-Seq tutorial)

# --- OR: Cell-level DE (simpler, but inflated p-values) ---
# Compare conditions within a specific cell type
cd4_cells <- subset(sobj, idents = "CD4+ T cells")
Idents(cd4_cells) <- "condition"

de_results <- FindMarkers(cd4_cells,
  ident.1      = "Disease",
  ident.2      = "Healthy",
  test.use     = "MAST",    # recommended for scRNA-Seq DE
  min.pct      = 0.1,
  logfc.threshold = 0.25
)

head(de_results, 20)

Pseudobulk vs. Cell-Level DE: Cell-level DE (treating each cell as an independent observation) inflates sample sizes and produces overly optimistic p-values. For multi-sample experiments, always use pseudobulk — aggregate counts per cell type per biological replicate, then run DESeq2 or edgeR. This is now the gold standard per Squair et al. (2021) and the Seurat team's own recommendations.

Summary

What You Learned:

Load 10X Genomics Cell Ranger output into Seurat
Perform QC filtering based on genes/cell, UMIs/cell, and mitochondrial %
Normalize with LogNormalize or SCTransform
Select highly variable features and run PCA
Choose the right number of PCs using the elbow plot
Cluster cells with appropriate resolution and visualize with UMAP
Identify marker genes and understand parameter tuning
Annotate cell types manually and with SingleR
Perform DE between conditions using pseudobulk (recommended) or cell-level approaches

Prev: Visualization Next: Variant Calling

Single-Cell RNA-Seq Analysis with Seurat

On this page