Batch Effect Correction - Integration Methods - Trajectory Inference - Pseudotime - PAGA - RNA Velocity - Cell Fate
Batch effects are systematic technical differences between samples processed at different times, operators, or lanes. They can dominate biological signal on UMAP and in clustering.
| Source | Effect | Diagnosis |
|---|---|---|
| Sample preparation | Dissociation protocol, enzyme lot | Sample-specific clusters on UMAP |
| Sequencing | Lane effects, depth differences | Samples cluster by lane |
| Chemistry version | 10x v2 vs v3 sensitivity | Systematic gene count shifts |
| Donor/individual | Genetic background | Confounded with condition if not designed well |
import scanpy as sc
# Load multiple samples and concatenate
adatas = []
for sample in ['sample1','sample2','sample3','sample4']:
a = sc.read_10x_h5(f'{sample}/filtered_feature_bc_matrix.h5')
a.obs['sample'] = sample
a.obs['condition'] = 'ctrl' if 'ctrl' in sample else 'treat'
adatas.append(a)
adata = sc.concat(adatas, label='sample', keys=['s1','s2','s3','s4'])
adata.obs_names_make_unique()
# Standard preprocessing (without integration)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key='sample')
adata = adata[:, adata.var['highly_variable']].copy()
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.umap(adata)
# Diagnosis: color by sample vs cell type
sc.pl.umap(adata, color=['sample','condition'], ncols=2)
# If cells cluster by sample rather than cell type = batch effect present
# LISI score (Local Inverse Simpson Index):
# Measures mixing of batches in local neighborhoods
# iLISI near n_batches = well mixed, iLISI near 1 = poorly mixed
| Method | Approach | Speed | Best For |
|---|---|---|---|
| Harmony | Iterative PCA correction | Very fast | Simple batch effects, large datasets |
| scVI | Variational autoencoder (deep learning) | Moderate | Complex batches, preserves biology best |
| scANVI | scVI + cell type labels | Moderate | When reference labels available |
| BBKNN | Batch-balanced kNN graph | Fast | Lightweight, graph-level only |
| Scanorama | Panoramic stitching analog | Fast | Many batches, memory efficient |
| Seurat CCA/RPCA | Canonical correlation + anchors | Moderate | R/Seurat ecosystems |
# Harmony: fast, works directly on PCA embeddings
import harmonypy as hm
# Run PCA first (on log-normalized, scaled data)
sc.tl.pca(adata, n_comps=50)
# Harmony correction
ho = hm.run_harmony(adata.obsm['X_pca'], adata.obs, 'sample',
max_iter_harmony=20)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
# Use corrected PCA for neighbors/UMAP/clustering
sc.pp.neighbors(adata, use_rep='X_pca_harmony', n_pcs=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.8)
sc.pl.umap(adata, color=['sample','leiden'])
import scvi
# scVI: deep generative model for single-cell data
# Learns a latent space that separates biology from batch effects
# Does NOT modify the count matrix -- produces a latent embedding
# Setup: use RAW counts (not normalized)
adata_raw = adata.copy()
adata_raw.X = adata_raw.layers['counts']
# Register the data
scvi.model.SCVI.setup_anndata(adata_raw, layer='counts',
batch_key='sample',
continuous_covariate_keys=['pct_counts_mt'])
# Train model
model = scvi.model.SCVI(adata_raw, n_latent=30, n_layers=2)
model.train(max_epochs=400, early_stopping=True)
# Get latent representation
adata.obsm['X_scVI'] = model.get_latent_representation()
# Use for clustering/UMAP
sc.pp.neighbors(adata, use_rep='X_scVI')
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.8)
# scVI also provides:
# - Differential expression (proper Bayesian framework)
# - Imputed/denoised expression values
de_results = model.differential_expression(
groupby='celltype', group1='T_cells', group2='B_cells'
)
# scANVI: semi-supervised -- use known labels to guide integration
scvi.model.SCANVI.setup_anndata(adata_raw, layer='counts',
batch_key='sample',
labels_key='celltypist')
scanvi_model = scvi.model.SCANVI.from_scvi_model(model,
unlabeled_category='Unknown')
scanvi_model.train(max_epochs=200)
adata.obsm['X_scANVI'] = scanvi_model.get_latent_representation()
adata.obs['scANVI_pred'] = scanvi_model.predict()
# Seurat v5: CCA + RPCA integration
library(Seurat)
obj[["RNA"]] <- split(obj[["RNA"]], f=obj$sample)
obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)
# Integrate with RPCA (faster, recommended for large datasets)
obj <- IntegrateLayers(object=obj, method=RPCAIntegration,
orig.reduction="pca",
new.reduction="integrated.rpca")
# Or CCA (classic, better for very different batches)
obj <- IntegrateLayers(object=obj, method=CCAIntegration,
orig.reduction="pca",
new.reduction="integrated.cca")
# Continue with integrated reduction
obj <- FindNeighbors(obj, reduction="integrated.rpca", dims=1:30)
obj <- FindClusters(obj, resolution=0.8)
obj <- RunUMAP(obj, reduction="integrated.rpca", dims=1:30)
"Harmony for Haste, scVI for Value"
Harmony is fastest and works well for simple batches. scVI is slower but gives the best biological preservation and includes built-in DE testing. Start with Harmony; upgrade to scVI if results are unsatisfactory or for final publication analysis.
import scib
# scIB: single-cell Integration Benchmarking
# Measures both batch correction AND biological conservation
# Batch removal metrics:
# - iLISI: batch mixing in local neighborhoods
# - kBET: k-nearest-neighbor batch effect test
# - Graph connectivity: are same cell types connected across batches?
# Bio conservation metrics:
# - cLISI: cell type separation preserved
# - NMI/ARI: clustering agrees with known labels
# - Silhouette score: cluster compactness
# Run comprehensive benchmark
results = scib.metrics.metrics(
adata,
adata_int=adata, # integrated version
batch_key='sample',
label_key='celltype',
embed='X_scVI',
cluster_key='leiden',
type_='embed',
organism='human'
)
print(results)
# Compare multiple methods:
# Run Harmony, scVI, BBKNN, Scanorama on same data
# Compute metrics for each
# Choose method with best balance of batch removal + bio conservation
Trajectory inference recovers the dynamic processes (differentiation, activation, cell cycle) from the static snapshot of scRNA-seq data. Cells are ordered along a continuous path from progenitor to mature states.
| Method | Topology | Input | Strengths |
|---|---|---|---|
| Diffusion pseudotime | Linear/branching | kNN graph | Robust, foundation for PAGA |
| PAGA | Graph abstraction | kNN graph | Reveals global connectivity of clusters |
| Monocle3 | Tree/graph | Reduced space | Principled, R-native, DE along trajectory |
| Palantir | Branching | Diffusion maps | Fate probabilities, entropy |
| CytoTRACE | N/A (stemness score) | Gene counts | Predicts differentiation potential |
| Slingshot | Smooth curves | Reduced space | R-native, clear branching |
# PAGA: Partition-based Graph Abstraction
# Reveals which clusters are connected (trajectory backbone)
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, threshold=0.05, edge_width_scale=0.5)
# Edges = significant connectivity between clusters
# Thickness = strength of connection
# Use PAGA to initialize UMAP (better global structure)
sc.tl.umap(adata, init_pos='paga')
# Diffusion pseudotime
# Step 1: compute diffusion map
sc.tl.diffmap(adata, n_comps=15)
# Step 2: set root cell (must know biology!)
# Find the root: the most stem-like cell in your progenitor cluster
root_cluster = 'HSC' # example: hematopoietic stem cells
root_mask = adata.obs['celltype'] == root_cluster
adata.uns['iroot'] = int(
adata[root_mask].obsm['X_diffmap'][:,0].argmax()
)
# Step 3: compute pseudotime
sc.tl.dpt(adata)
sc.pl.umap(adata, color='dpt_pseudotime', cmap='viridis')
# Genes that change along pseudotime
sc.pl.paga_path(adata, nodes=['HSC','CMP','GMP','Mono'],
keys=['CD34','MPO','CSF1R','CD14'])
library(monocle3)
# Convert Seurat to Monocle3
cds <- as.cell_data_set(seurat_obj)
cds <- preprocess_cds(cds, num_dim=30)
cds <- reduce_dimension(cds, reduction_method="UMAP")
# Cluster and learn trajectory graph
cds <- cluster_cells(cds, resolution=1e-3)
cds <- learn_graph(cds, use_partition=TRUE)
# Order cells (set root interactively or programmatically)
cds <- order_cells(cds, root_cells=root_cell_ids)
# Or interactively: order_cells(cds) will open a GUI
# Plot pseudotime
plot_cells(cds, color_cells_by="pseudotime",
cell_size=0.5, label_branch_points=TRUE)
# Find genes that change along the trajectory
trajectory_genes <- graph_test(cds, neighbor_graph="principal_graph",
cores=8)
sig_genes <- subset(trajectory_genes, q_value < 0.05)
sig_genes <- sig_genes[order(sig_genes$morans_I, decreasing=TRUE),]
# Plot top trajectory genes
top_genes <- head(rownames(sig_genes), 6)
plot_genes_in_pseudotime(cds[top_genes,], min_expr=0.5)
# Palantir: computes fate probabilities at branch points
import palantir
# Run diffusion maps
dm_res = palantir.utils.run_diffusion_maps(adata, n_components=10)
# Set start cell
start_cell = 'ATCGATCG-1' # barcode of a stem/progenitor cell
# Run Palantir
pr_res = palantir.core.run_palantir(
adata, start_cell,
num_waypoints=500,
terminal_states=['Mono_terminal','Ery_terminal','Lymph_terminal']
)
# Outputs:
# pr_res.pseudotime: ordering along differentiation
# pr_res.entropy: high = multipotent, low = committed
# pr_res.branch_probs: probability of reaching each terminal state
# Visualize
palantir.plot.plot_palantir_results(adata, pr_res)
# Shows pseudotime, entropy, and fate probabilities on UMAP
RNA velocity infers the future state of a cell by comparing spliced (mature) and unspliced (nascent) mRNA. If unspliced > expected: gene is being upregulated. If unspliced < expected: gene is being downregulated.
import scvelo as scv
# Load data with spliced/unspliced counts
# If using STARsolo with --soloFeatures Velocyto:
adata = scv.read('starsolo_output/Velocyto/raw/', cache=True)
# Or merge with loom file from velocyto:
loom = scv.read('sample.loom', cache=True)
adata = scv.utils.merge(adata, loom)
# Preprocessing
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
# --- Stochastic model (fast, default) ---
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)
# Visualize velocity streamlines on UMAP
scv.pl.velocity_embedding_stream(adata, basis='umap',
color='celltype', figsize=(10,8))
# --- Dynamical model (recommended, more accurate) ---
scv.tl.recover_dynamics(adata, n_jobs=8)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
# Latent time (better pseudotime from velocity)
scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', cmap='viridis')
# Velocity confidence / coherence
scv.tl.velocity_confidence(adata)
scv.pl.scatter(adata, color='velocity_confidence')
# High confidence = velocity is consistent in neighborhood
# Low confidence = noisy or transitional region
# Top velocity genes (drivers of state transitions)
scv.tl.rank_velocity_genes(adata, groupby='celltype')
df = scv.get_df(adata.uns['rank_velocity_genes']['names'])
print(df.head(10))
"Unspliced up = Upregulating"
RNA velocity logic: if a gene has more unspliced (pre-mRNA) than expected from its spliced level, the gene is being actively transcribed (upregulating). The cell is moving toward a state where that gene is more highly expressed.
# CytoTRACE: predicts differentiation state from gene counts
# Principle: less differentiated cells express more genes
# No trajectory needed -- just a per-cell score
# Install: pip install cytotrace
import cytotrace
results = cytotrace.cytotrace(adata.X.toarray(),
gene_names=adata.var_names.tolist())
adata.obs['CytoTRACE'] = results['CytoTRACE']
# Score: 1 = most stem-like (highest gene count diversity)
# 0 = most differentiated
sc.pl.umap(adata, color='CytoTRACE', cmap='RdYlBu_r')
# Use CytoTRACE to validate pseudotime direction:
# Pseudotime should correlate negatively with CytoTRACE
# (higher pseudotime = more differentiated = lower CytoTRACE)
from scipy.stats import spearmanr
r, p = spearmanr(adata.obs['dpt_pseudotime'], adata.obs['CytoTRACE'])
print(f"Spearman r={r:.3f}, p={p:.2e}")
# Expect r < -0.5 if pseudotime direction is correct
Increase iterations (max_iter_harmony=30) or increase theta (batch diversity penalty). If still failing: the batch effect may be too severe for Harmony. Try scVI or check if the batches truly contain the same cell types.
Check: (1) Input must be RAW counts (not normalized), (2) Reduce learning rate, (3) Increase n_latent to 30-50, (4) Filter out very low-quality cells first. Also ensure no NaN/Inf values in the count matrix.
Low-quality velocity signal. Causes: (1) Very few unspliced reads (check with scv.pl.proportions(adata) -- need >10% unspliced), (2) Cells are in steady state (no dynamic process), (3) 3' chemistry misses intronic reads. Try: dynamical mode, increase n_top_genes, or accept that velocity is uninformative for this dataset.
Resolution too low (everything is one big connected component). Increase Leiden resolution before running PAGA, or increase the PAGA threshold for plotting. Alternatively, the biology is genuinely continuous (no discrete boundaries).
"PAGA before PseudoTime"
Always run PAGA first to understand the global topology of your data (which clusters connect to which). Then compute pseudotime along the path PAGA reveals. Pseudotime without knowing the topology is meaningless.
"Integrate to Mix batches, not to Merge biology"
The goal of integration is to align the same cell types across batches, NOT to force different cell types together. Over-integration is as bad as no integration. Always check that distinct cell types remain distinct after integration.