Technologies & Raw Processing · QC & Filtering · Normalization · Dimensionality Reduction · Clustering · Annotation · Differential Expression
| Platform | Type | Cells/Run | Sensitivity | Doublet Rate | Best For |
|---|---|---|---|---|---|
| 10x Chromium (3') | Droplet | 500–10,000 | Moderate (~2000 genes/cell) | ~4% at 5k cells | Large-scale surveys, atlas building |
| 10x Chromium (5') | Droplet | 500–10,000 | Moderate | ~4% | TCR/BCR + transcriptome |
| Smart-seq2/3 | Plate | 96–384 | High (~5000 genes/cell) | Near 0 | Deep profiling, rare cell types, full-length |
| Drop-seq | Droplet | 5,000+ | Lower | ~5% | Cost-effective, large screens |
| sci-RNA-seq3 | Combinatorial | 100,000+ | Low | Low | Ultra-high throughput, organism-scale |
| SPLiT-seq | Combinatorial | 100,000+ | Low | Low | No special equipment needed |
# 10x Chromium v3 chemistry: # Read 1 (28 bp): [Cell Barcode: 16 bp][UMI: 12 bp] # Read 2 (91 bp): [cDNA insert → transcript sequence] # Index Read: [Sample Index: 8 bp] # # Cell barcode: identifies which cell (from whitelist of ~6.7M barcodes) # UMI: deduplicates PCR amplification artifacts # Each unique CB+UMI+gene combination = 1 original mRNA molecule # # Expected doublet rate ≈ 0.008 × (cells_loaded / 1000) # Loading 10,000 cells → ~8% doublets # Loading 5,000 cells → ~4% doublets (recommended default)
# STARsolo: open-source Cell Ranger equivalent (faster, more flexible) STAR --runMode genomeGenerate \ --genomeDir star_index/ \ --genomeFastaFiles genome.fa \ --sjdbGTFfile genes.gtf \ --runThreadN 16 STAR --soloType CB_UMI_Simple \ --soloCBwhitelist 3M-february-2018.txt \ --soloCBstart 1 --soloCBlen 16 \ --soloUMIstart 17 --soloUMIlen 12 \ --genomeDir star_index/ \ --readFilesIn R2.fq.gz R1.fq.gz \ --readFilesCommand zcat \ --runThreadN 16 \ --outSAMtype BAM SortedByCoordinate \ --soloFeatures Gene GeneFull Velocyto \ --soloCellFilter EmptyDrops_CR \ --outFileNamePrefix starsolo_output/ # GeneFull: counts intronic reads (needed for RNA velocity) # Velocyto: outputs spliced/unspliced/ambiguous matrices # Cell Ranger (10x official pipeline): cellranger count --id=sample1 \ --transcriptome=/path/to/refdata \ --fastqs=/path/to/fastqs/ \ --sample=Sample1 \ --localcores=16 --localmem=64 # alevin-fry (salmon-based, very fast): salmon alevin -l ISR \ -i salmon_index \ -1 R1.fq.gz -2 R2.fq.gz \ --chromiumV3 -p 16 -o alevin_output/ # Then process with alevin-fry for USA (unspliced/spliced/ambiguous) alevin-fry generate-permit-list -d fw -i alevin_output/ -o quant/ alevin-fry collate -i quant/ -r alevin_output/ alevin-fry quant -i quant/ -m quant/map.rad -t 16 \ --resolution cr-like -o quant_result/
| Tool | Speed | Velocity Support | License | Best For |
|---|---|---|---|---|
| Cell Ranger | Slow | No (external velocyto) | Proprietary (free) | 10x default, regulatory |
| STARsolo | Fast | Yes (built-in) | Open source | Most flexible, recommended |
| alevin-fry | Very fast | Yes (USA mode) | Open source | Speed-critical, non-model organisms |
| kallisto-bustools | Very fast | Yes | Open source | Pseudoalignment, lightweight |
"STARsolo Supersedes Cell Ranger"
STARsolo produces nearly identical results to Cell Ranger but is faster, open-source, and directly outputs velocity matrices. Unless your collaborator or journal requires Cell Ranger, use STARsolo.
library(DropletUtils)
library(SoupX)
# --- EmptyDrops: distinguish cells from empty droplets ---
raw_counts <- read10xCounts("starsolo_output/raw/")
# The knee plot: total UMI per barcode, sorted descending
br.out <- barcodeRanks(counts(raw_counts))
plot(br.out$rank, br.out$total, log="xy",
xlab="Rank", ylab="Total UMIs",
main="Knee Plot")
abline(h=metadata(br.out)$inflection, col="red", lty=2) # inflection
abline(h=metadata(br.out)$knee, col="blue", lty=2) # knee
# Statistical test: is this barcode a real cell?
e.out <- emptyDrops(counts(raw_counts), lower=100)
is.cell <- e.out$FDR <= 0.01
table(is.cell)
# TRUE = real cell, FALSE = empty droplet
# NAs = below lower threshold (definitely empty)
# Keep only real cells
cell_counts <- counts(raw_counts)[, which(is.cell)]
# --- SoupX: remove ambient RNA contamination ---
# Ambient RNA = mRNA from lysed cells floating in the droplet solution
# Contaminates ALL droplets with a background expression profile
# Load raw (unfiltered) and filtered matrices
sc <- load10X("cellranger_output/")
# OR manually:
# sc <- SoupChannel(tod=raw_matrix, toc=filtered_matrix)
# Estimate contamination fraction (rho)
sc <- autoEstCont(sc)
# rho typically 0.01-0.20 (1-20% contamination)
# Print estimate:
cat("Estimated contamination:", sc$metaData$rho[1], "\n")
# Correct the count matrix
corrected <- adjustCounts(sc)
# CellBender (alternative, deep learning-based):
# cellbender remove-background \
# --input raw_feature_bc_matrix.h5 \
# --output cellbender_output.h5 \
# --expected-cells 5000 \
# --total-droplets-included 25000 \
# --epochs 150
If 10% of counts in every cell come from ambient RNA, highly-expressed genes from abundant cell types will appear in ALL clusters — creating false marker genes and phantom cell states. SoupX or CellBender should be run before any downstream analysis. This is especially critical for tissues with high cell lysis (e.g., brain, tumor).
import scrublet as scr
import scanpy as sc
import numpy as np
# Load count matrix
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
# --- Scrublet: simulation-based doublet detection ---
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
# Visualize score distribution
scrub.plot_histogram()
# Bimodal distribution: low scores = singlets, high = doublets
# Threshold is set automatically; check it makes sense
adata.obs['doublet_score'] = doublet_scores
adata.obs['predicted_doublet'] = predicted_doublets
# Remove doublets
adata = adata[~adata.obs['predicted_doublet']].copy()
print(f"Removed {predicted_doublets.sum()} doublets, "
f"{adata.n_obs} cells remaining")
# scDblFinder: fastest and most accurate doublet detector library(scDblFinder) library(SingleCellExperiment) sce <- SingleCellExperiment(assays=list(counts=count_matrix)) sce <- scDblFinder(sce, dbr=0.06) # expected doublet rate # Results in colData table(sce$scDblFinder.class) # singlet / doublet # For multiplexed samples (e.g., pooled donors): # Use Demuxlet/Vireo for genotype-based demultiplexing # This identifies doublets from DIFFERENT donors (heterotypic) # but misses same-donor doublets (homotypic)
import scanpy as sc
import numpy as np
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
adata.var_names_make_unique()
# Calculate QC metrics
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['ribo'] = adata.var_names.str.startswith(('RPS','RPL'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','ribo'],
percent_top=None, log1p=False, inplace=True)
# Visualize QC distributions
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts',
'pct_counts_mt', 'pct_counts_ribo'],
jitter=0.4, multi_panel=True)
# --- Adaptive thresholds (MAD-based) ---
# Much better than hard cutoffs — adapts to each dataset
from scipy.stats import median_abs_deviation
def mad_outlier(data, nmads=5):
median = np.median(data)
mad = median_abs_deviation(data)
return (data < median - nmads * mad) | (data > median + nmads * mad)
# Identify outlier cells
adata.obs['outlier_genes'] = mad_outlier(np.log1p(adata.obs['n_genes_by_counts']), nmads=5)
adata.obs['outlier_counts'] = mad_outlier(np.log1p(adata.obs['total_counts']), nmads=5)
adata.obs['outlier_mt'] = adata.obs['pct_counts_mt'] > 20 # hard cap for mt
# Combined filter
adata.obs['is_outlier'] = (adata.obs['outlier_genes'] |
adata.obs['outlier_counts'] |
adata.obs['outlier_mt'])
print(f"Filtering {adata.obs['is_outlier'].sum()} outlier cells "
f"out of {adata.n_obs}")
adata = adata[~adata.obs['is_outlier']].copy()
# Also filter genes
sc.pp.filter_genes(adata, min_cells=10) # gene in ≥10 cells
| Method | Approach | Tool | When to Use |
|---|---|---|---|
| Library size + log | Divide by total counts, multiply by 10k, log1p | Scanpy/Seurat default | Quick exploration, simple analyses |
| scran pooling | Pool cells to estimate size factors, deconvolve | scran (R) | Better for heterogeneous datasets |
| SCTransform | Variance-stabilizing transformation (GLM per gene) | Seurat v5 | Best for Seurat workflows, handles zero-inflation |
| Analytic Pearson residuals | Residuals from null model of technical noise | scanpy/scry | Scanpy workflows, theoretically grounded |
# Standard: library size normalization + log transform adata.layers['counts'] = adata.X.copy() # save raw counts sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) adata.layers['log1p'] = adata.X.copy() # Analytic Pearson residuals (preferred for feature selection) # From scanpy >=1.9: sc.experimental.pp.normalize_pearson_residuals(adata) # Automatically selects HVGs and normalizes in one step
# SCTransform: regularized negative binomial regression per gene
library(Seurat)
obj <- CreateSeuratObject(counts=count_matrix)
obj <- SCTransform(obj, vst.flavor="v2",
vars.to.regress="percent.mt",
verbose=FALSE)
# v2 flavor: improved, uses glmGamPoi for speed
# vars.to.regress: removes unwanted variation (MT%, cell cycle, etc.)
# Output: SCT assay with corrected counts, normalized data, scale.data
"SCT for Seurat, Pearson for Python"
If using Seurat (R), SCTransform v2 is the best normalization. If using Scanpy (Python), analytic Pearson residuals are the preferred modern approach. Both outperform simple library-size normalization for heterogeneous datasets.
# Scanpy: identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000,
flavor='seurat_v3',
layer='counts') # use raw counts for seurat_v3
sc.pl.highly_variable_genes(adata)
# Restrict to HVGs for downstream analysis
adata = adata[:, adata.var['highly_variable']].copy()
# Alternative: deviance-based feature selection (scry)
# Selects genes with most biological variation relative to a null model
# from scry import deviance_residuals
# deviance_residuals(adata) # adds 'highly_deviant' to adata.var
# How many genes to select?
# 1000-3000: typical range
# Too few: miss subtle cell states
# Too many: include noise, batch-driven genes
# Rule of thumb: start with 2000, increase if rare types are missed
# PCA: the foundation — all downstream analysis uses PC space sc.pp.scale(adata, max_value=10) # z-score (only if not using Pearson) sc.tl.pca(adata, n_comps=50, svd_solver='arpack') # How many PCs to use? Elbow plot: sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True) # Pick the "elbow" where variance explained levels off # Typical: 15-30 PCs for most datasets # Molecular cross-validation (more rigorous): # Split genes randomly, run PCA on each half, compare # Number of PCs where halves agree = optimal number # --- UMAP --- sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30) sc.tl.umap(adata, min_dist=0.3, spread=1.0) sc.pl.umap(adata, color=['leiden', 'n_genes_by_counts', 'pct_counts_mt']) # UMAP parameters that matter: # n_neighbors: 5-15 = local structure (tight clusters) # 30-50 = global structure (preserves distances) # min_dist: 0.0-0.1 = dense, packed clusters # 0.3-0.5 = more spread out (default 0.5 in scanpy) # --- t-SNE --- sc.tl.tsne(adata, n_pcs=30, perplexity=30) # perplexity: ~sqrt(n_cells) or 30 for typical datasets # t-SNE is NOT suitable for trajectory inference # Use openTSNE for large datasets (faster, more features) # CRITICAL: UMAP/t-SNE distances are NOT meaningful! # Cells close together MAY be similar, cells far apart may or may NOT be # Never interpret inter-cluster distances on UMAP # All quantitative comparisons should use PCA space or graph-based methods
"UMAP Lies about Distances"
UMAP preserves local neighborhood structure but distorts global distances. Two clusters far apart on UMAP may actually be very similar in gene expression. Never say "cluster A is more similar to cluster B than C" based on UMAP position alone. Use dendrogram on PCA space instead.
# Leiden algorithm (preferred over Louvain — provably better partitions)
# First: build k-nearest-neighbor graph in PCA space
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30, metric='cosine')
# Cluster at multiple resolutions
for res in [0.2, 0.4, 0.6, 0.8, 1.0, 1.5]:
sc.tl.leiden(adata, resolution=res, key_added=f'leiden_{res}')
# Visualize each
sc.pl.umap(adata, color=[f'leiden_{r}' for r in [0.2,0.5,1.0,1.5]],
ncols=2)
# Choosing resolution:
# Low (0.1-0.3): broad cell types (T cells, B cells, myeloid)
# Medium (0.5-0.8): refined types (CD4 naive, CD4 memory, CD8 naive)
# High (1.0-2.0): subtypes/states (activated, resting, cycling)
# Strategy: overcluster (high resolution) then merge
# based on marker genes and biological knowledge
# Cluster stability assessment (optional but recommended):
# Subsample 80% of cells 100 times, re-cluster, check consistency
# Stable clusters appear in >90% of subsamples
# Hierarchical relationship between clusters:
sc.tl.dendrogram(adata, groupby='leiden_0.8')
sc.pl.dendrogram(adata, groupby='leiden_0.8')
# --- Manual annotation: find marker genes per cluster ---
sc.tl.rank_genes_groups(adata, groupby='leiden_0.8', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False)
# Check known markers:
marker_genes = {
'T cells': ['CD3D','CD3E','CD2'],
'CD4 T': ['CD4','IL7R','CCR7'],
'CD8 T': ['CD8A','CD8B','GZMK'],
'B cells': ['CD19','MS4A1','CD79A'],
'NK cells': ['NKG7','GNLY','KLRB1'],
'Monocytes': ['CD14','LYZ','S100A9'],
'DC': ['FCER1A','CST3','CLEC10A'],
}
sc.pl.dotplot(adata, marker_genes, groupby='leiden_0.8')
# --- CellTypist: automated annotation ---
import celltypist
from celltypist import models
# Download model
models.download_models(model='Immune_All_Low.pkl')
model = models.Model.load(model='Immune_All_Low.pkl')
# Predict
predictions = celltypist.annotate(adata, model=model,
majority_voting=True)
adata.obs['celltypist'] = predictions.predicted_labels.values
sc.pl.umap(adata, color='celltypist')
# --- SingleR (R, reference-based) ---
# library(SingleR)
# library(celldex)
# ref <- HumanPrimaryCellAtlasData()
# results <- SingleR(test=sce, ref=ref, labels=ref$label.main)
# sce$singler_labels <- results$labels
# --- Azimuth (Seurat, reference mapping) ---
# library(Seurat)
# library(Azimuth)
# obj <- RunAzimuth(obj, reference="pbmcref")
library(Seurat)
library(DESeq2)
# CRITICAL: Single-cell DE treating each cell as independent
# inflates false positives CATASTROPHICALLY
# Reason: cells from the same donor are correlated (pseudoreplication)
# Solution: PSEUDOBULK — aggregate counts per donor per cell type
# Step 1: Aggregate counts per sample per cell type
obj$sample_celltype <- paste(obj$sample_id, obj$celltype, sep="_")
pseudo <- AggregateExpression(obj, assays="RNA",
group.by="sample_celltype",
return.seurat=FALSE)$RNA
# Step 2: Create DESeq2 dataset
# Metadata for each pseudobulk sample
meta <- data.frame(
sample = colnames(pseudo),
condition = c("Control","Control","Control",
"Treated","Treated","Treated"),
row.names = colnames(pseudo)
)
dds <- DESeqDataSetFromMatrix(countData=pseudo,
colData=meta,
design=~condition)
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition","Treated","Control"))
res <- res[order(res$padj),]
# Step 3: Interpret
sig_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 0.5)
cat(nrow(sig_genes), "significant DE genes\n")
# Alternative: MAST (mixed model at single-cell level)
# Includes donor as random effect
# library(MAST)
# zlm(~condition + (1|donor), sca=sca_object)
Treating 5,000 cells from 3 donors as 5,000 independent observations (Wilcoxon per cell) gives thousands of false positive DE genes. The statistical unit is the donor/sample, not the cell. Always use pseudobulk (DESeq2/edgeR on aggregated counts) or mixed models (MAST with donor random effect). Minimum: 3 biological replicates per condition.
"Pseudobulk is the Proper way" — PP
For differential expression between conditions (case vs control), always use pseudobulk. Aggregate counts per sample per cell type → run DESeq2/edgeR. The Wilcoxon test per cell is only appropriate for finding marker genes between clusters within one sample, not for comparing conditions across biological replicates.
Causes: (1) Too few HVGs — increase to 3000+, (2) Too few PCs — check elbow plot, (3) Batch effect dominating — integrate before clustering (Module 2), (4) Over-correction — SCTransform regression removed real biology, (5) All cells are actually the same type (e.g., sorted population).
Reference mismatch: (1) Using a PBMC reference for brain tissue, (2) Species mismatch (human reference for mouse data), (3) Low-quality cells getting random labels. Always validate automated labels with known markers. Never trust automated annotation blindly.
Low statistical power: (1) Too few replicates (need ≥3, ideally ≥5), (2) High donor-to-donor variability, (3) Effect size is real but small. Solutions: increase replicates, use MAST with donor random effect for more power (at the cost of more assumptions), or relax thresholds (padj < 0.1).
Check: (1) Correct whitelist file for your chemistry version, (2) Read structure is correct (R1/R2 order), (3) Reference genome matches your species, (4) Use --soloCellFilter EmptyDrops_CR instead of the default knee filter for low-quality samples.
"Empty → Doublet → QC → Norm → Cluster → Annotate" — EDQNCA
The sacred order of scRNA-seq preprocessing. Skipping any step contaminates all downstream results. Empty droplet removal → Doublet detection → QC filtering → Normalization → Clustering → Annotation.
"3 replicates, 3000 cells, 3 conditions max"
Minimum experimental design for condition-comparison scRNA-seq: 3 biological replicates per condition, ~3000 cells per sample, and start with ≤3 conditions. This gives you enough statistical power for pseudobulk DE while keeping costs manageable.