Single-Cell RNA-Seq Analysis with Seurat

From raw 10X Genomics output to annotated cell clusters — a complete walkthrough of the Seurat v5 pipeline with every QC threshold rationalized.

~75 min Advanced R / Seurat v5
Byte explains DNA

? What Is Single-Cell RNA-Seq?

Bulk RNA-Seq measures the average gene expression across millions of cells. Single-cell RNA-Seq (scRNA-Seq) measures expression in individual cells, revealing cell-type-specific programs, rare populations, and cellular heterogeneity that bulk methods completely miss.

The most common platform is 10X Genomics Chromium, which produces a sparse gene × cell count matrix via the Cell Ranger pipeline. Seurat is the most widely used R package for downstream analysis of this data.

The Standard Seurat Workflow
  1. Load 10X data → Create Seurat object
  2. Quality control (filter dead cells, doublets)
  3. Normalize expression values
  4. Find highly variable features
  5. Scale data & run PCA
  6. Find neighbors & cluster cells (UMAP)
  7. Identify cluster marker genes
  8. Annotate cell types

0 Installation

R
install.packages("Seurat")
install.packages(c("tidyverse", "patchwork", "ggrepel"))

# For cell type annotation
install.packages("remotes")
remotes::install_github("immunogenomics/presto")  # fast marker detection
BiocManager::install("SingleR")
BiocManager::install("celldex")

library(Seurat)
library(tidyverse)
library(patchwork)

1 Load 10X Genomics Data

Cell Ranger produces a folder containing three files: barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz. Seurat reads these directly.

R
# Point to the Cell Ranger output directory
counts <- Read10X(data.dir = "path/to/filtered_feature_bc_matrix/")

# Create the Seurat object
# min.cells = 3: keep genes expressed in at least 3 cells
# min.features = 200: keep cells with at least 200 detected genes
sobj <- CreateSeuratObject(
  counts       = counts,
  project      = "MyExperiment",
  min.cells    = 3,
  min.features = 200
)

sobj   # Quick summary: genes x cells
Byte key point
Why min.cells = 3? Genes detected in fewer than 3 cells provide almost no statistical information and are likely noise or ambient RNA. Setting this to 3 removes them upfront, reducing object size. Similarly, min.features = 200 removes empty droplets or debris that passed Cell Ranger's initial filtering.

2 Quality Control & Cell Filtering

This is the most impactful step. Poor QC leads to garbage clusters. We filter on three metrics:

R
# Calculate mitochondrial gene percentage (marker of dying cells)
sobj[["percent.mt"]] <- PercentageFeatureSet(sobj, pattern = "^MT-")
# For mouse, use pattern = "^mt-"

# Calculate ribosomal gene percentage
sobj[["percent.ribo"]] <- PercentageFeatureSet(sobj, pattern = "^RP[SL]")

# Visualize QC metrics
VlnPlot(sobj,
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
        ncol = 3, pt.size = 0)

# Scatter plots to identify relationships
plot1 <- FeatureScatter(sobj, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(sobj, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
How to Choose QC Thresholds
nFeature_RNA

Genes per cell

  • Low cutoff (200–500): removes empty droplets
  • High cutoff (2500–6000): removes likely doublets
  • Look at the violin plot — set thresholds at the natural breakpoints
nCount_RNA

Total UMIs per cell

  • Correlates strongly with nFeature
  • Extreme outliers (top 1%) are likely doublets
  • Often don't need a separate filter if nFeature is filtered
percent.mt

Mitochondrial %

  • < 5%: stringent (most tissues)
  • < 10%: lenient (muscle, kidney, metabolically active tissues)
  • < 20%: only for highly metabolic cells or FFPE samples
  • High mt% = dying/stressed cells leaking cytoplasmic RNA
R
# Apply filters — ADJUST THESE BASED ON YOUR VIOLIN PLOTS
sobj <- subset(sobj,
  subset = nFeature_RNA > 200 &
           nFeature_RNA < 5000 &
           percent.mt < 10
)

cat("Cells after QC:", ncol(sobj), "\n")
Byte warning
Never blindly copy thresholds! The right cutoffs depend on your tissue type, dissociation protocol, and sequencing depth. Always inspect the violin and scatter plots first and set thresholds at the natural inflection points in your specific data. A brain sample and a tumor sample will have very different QC distributions.

3 Normalization

R
# Log-normalization (default, fast, works well for most datasets)
sobj <- NormalizeData(sobj,
                      normalization.method = "LogNormalize",
                      scale.factor = 10000)
Byte explains
Normalization Methods:
  • LogNormalize (default): divides each cell's counts by total, multiplies by scale.factor, then log1p transforms. Fast and sufficient for most analyses.
  • SCTransform: uses regularized negative binomial regression. Better variance stabilization, especially for complex datasets. Use sobj <- SCTransform(sobj) instead.
  • When to use SCTransform: if you're integrating multiple samples, have strong batch effects, or want the most statistically rigorous approach. It's slower but often gives cleaner clusters.

4 Identify Highly Variable Features

R
# Find the top 2000 most variable genes
sobj <- FindVariableFeatures(sobj,
                              selection.method = "vst",
                              nfeatures = 2000)

# Plot variable features
top10 <- head(VariableFeatures(sobj), 10)
plot1 <- VariableFeaturePlot(sobj)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2
How Many Variable Features?
  • 2000 — standard default, works well for most datasets
  • 3000–5000 — use for complex tissues with many cell types (e.g., whole brain, tumor microenvironment)
  • 1000 — use for simpler datasets (e.g., sorted cell populations) or when you want tighter clusters
  • More features = more information but also more noise; diminishing returns past ~3000

5 Scaling, PCA & UMAP

R
# Scale data (centers and scales each gene across cells)
sobj <- ScaleData(sobj)

# PCA on the variable features
sobj <- RunPCA(sobj, features = VariableFeatures(sobj))

# Elbow plot to decide how many PCs to use
ElbowPlot(sobj, ndims = 40)
Byte key point
Choosing the Number of PCs:
  • Look at the elbow plot — where the curve flattens is your cutoff
  • Typical range: 10–30 PCs
  • Too few PCs: you lose real biological signal, clusters merge
  • Too many PCs: you include noise dimensions, clusters fragment
  • When in doubt, use 20. It's rarely wrong and is robust for most datasets.
  • For a more data-driven approach, use JackStraw analysis, though it's slow for large datasets.
R
n_pcs <- 20  # Based on elbow plot

# Build the shared nearest neighbor graph
sobj <- FindNeighbors(sobj, dims = 1:n_pcs)

# UMAP for visualization
sobj <- RunUMAP(sobj, dims = 1:n_pcs)
Byte warns
UMAP is for Visualization Only! UMAP preserves local structure but distorts global distances. Two clusters that appear far apart on UMAP may actually be quite similar, and vice versa. Never draw biological conclusions from UMAP distances alone. Clustering happens in PCA space via FindNeighbors + FindClusters — that's what matters.

6 Clustering

R
# Cluster cells — resolution controls granularity
sobj <- FindClusters(sobj, resolution = 0.5)

# Visualize clusters on UMAP
DimPlot(sobj, reduction = "umap", label = TRUE, label.size = 5) +
  ggtitle("Cell Clusters (resolution = 0.5)") +
  theme_minimal()
Choosing Clustering Resolution
Low (0.1–0.3)

Fewer, larger clusters. Good for identifying major cell lineages (e.g., immune vs. epithelial vs. stromal).

Medium (0.4–0.8)

Balanced. Standard starting point. Usually gives biologically meaningful cell types.

High (1.0–2.0)

Many small clusters. Use for sub-type discovery within known populations. Risk of over-splitting.

Practical tip: Try multiple resolutions and use the clustree R package to visualize how clusters split as resolution increases. Choose the resolution where clusters are stable and biologically interpretable.

7 Find Cluster Marker Genes

R
# Find markers for ALL clusters (one-vs-rest)
all_markers <- FindAllMarkers(sobj,
  only.pos    = TRUE,     # only upregulated markers
  min.pct     = 0.25,     # gene must be in ≥25% of cluster cells
  logfc.threshold = 0.25  # minimum log2FC
)

# Top 5 markers per cluster
top5 <- all_markers %>%
  group_by(cluster) %>%
  slice_max(n = 5, order_by = avg_log2FC)

print(top5)

# Dot plot of top markers
DoHeatmap(sobj, features = top5$gene) +
  scale_fill_viridis_c() +
  theme(axis.text.y = element_text(size = 6))
ParameterPurposeHow to Tune
min.pctMinimum fraction of cells expressing the gene in either group0.25 is standard. Decrease to 0.1 for rare markers; increase to 0.5 to focus on broadly expressed genes.
logfc.thresholdMinimum average log2FC between cluster and rest0.25 is default. Increase to 0.5 or 1.0 for stronger, more specific markers.
test.useStatistical test"wilcox" (default, fast). "MAST" for publication-quality DE (accounts for cellular detection rate).
Visualize Key Markers
R
# Feature plot — expression on UMAP
FeaturePlot(sobj,
  features = c("CD3D", "MS4A1", "LYZ", "NKG7"),  # example markers
  cols = c("lightgrey", "firebrick3"),
  ncol = 2
)

# Dot plot — compact multi-gene view
DotPlot(sobj,
  features = unique(top5$gene),
  cols = c("lightgrey", "navy")
) + RotatedAxis()

8 Cell Type Annotation

Assigning biological identity to clusters. Two approaches: manual (using known markers) and automated (reference-based).

Manual Annotation
R
# Rename clusters based on marker gene knowledge
new_labels <- c(
  "0" = "CD4+ T cells",
  "1" = "CD14+ Monocytes",
  "2" = "B cells",
  "3" = "CD8+ T cells",
  "4" = "NK cells",
  "5" = "Dendritic cells",
  "6" = "Platelets"
  # ... add more as needed
)

sobj <- RenameIdents(sobj, new_labels)
DimPlot(sobj, reduction = "umap", label = TRUE, repel = TRUE) +
  ggtitle("Annotated Cell Types") +
  theme_minimal()
Automated Annotation with SingleR
R
library(SingleR)
library(celldex)

# Load a reference dataset
ref <- celldex::HumanPrimaryCellAtlasData()

# Run SingleR on the normalized expression matrix
pred <- SingleR(
  test   = GetAssayData(sobj, layer = "data"),
  ref    = ref,
  labels = ref$label.main
)

# Add predictions to Seurat object
sobj$singler_labels <- pred$labels

DimPlot(sobj, group.by = "singler_labels", label = TRUE, repel = TRUE) +
  ggtitle("SingleR Automated Annotation")
Byte tip
Best Practice: Use automated annotation as a starting point, then refine manually using known marker genes from the literature. No automated tool is perfect — always validate the top markers for each assigned cell type with FeaturePlot() and domain expertise.

9 Differential Expression Between Conditions

If your experiment has multiple conditions (e.g., disease vs. healthy), you can perform DE within each cell type.

R
# Pseudobulk approach (recommended for multi-sample experiments)
# Aggregate counts per cell type per sample
bulk <- AggregateExpression(sobj,
  group.by = c("cell_type", "sample_id"),
  return.seurat = TRUE
)

# Then run DESeq2 on pseudobulk counts (see RNA-Seq tutorial)

# --- OR: Cell-level DE (simpler, but inflated p-values) ---
# Compare conditions within a specific cell type
cd4_cells <- subset(sobj, idents = "CD4+ T cells")
Idents(cd4_cells) <- "condition"

de_results <- FindMarkers(cd4_cells,
  ident.1      = "Disease",
  ident.2      = "Healthy",
  test.use     = "MAST",    # recommended for scRNA-Seq DE
  min.pct      = 0.1,
  logfc.threshold = 0.25
)

head(de_results, 20)
Byte warns
Pseudobulk vs. Cell-Level DE: Cell-level DE (treating each cell as an independent observation) inflates sample sizes and produces overly optimistic p-values. For multi-sample experiments, always use pseudobulk — aggregate counts per cell type per biological replicate, then run DESeq2 or edgeR. This is now the gold standard per Squair et al. (2021) and the Seurat team's own recommendations.

Summary

Byte
What You Learned:
  1. Load 10X Genomics Cell Ranger output into Seurat
  2. Perform QC filtering based on genes/cell, UMIs/cell, and mitochondrial %
  3. Normalize with LogNormalize or SCTransform
  4. Select highly variable features and run PCA
  5. Choose the right number of PCs using the elbow plot
  6. Cluster cells with appropriate resolution and visualize with UMAP
  7. Identify marker genes and understand parameter tuning
  8. Annotate cell types manually and with SingleR
  9. Perform DE between conditions using pseudobulk (recommended) or cell-level approaches