Batch Integration, Trajectory and RNA Velocity - SC Module 2

3.1 - Understanding Batch Effects

Batch effects are systematic technical differences between samples processed at different times, operators, or lanes. They can dominate biological signal on UMAP and in clustering.

Source	Effect	Diagnosis
Sample preparation	Dissociation protocol, enzyme lot	Sample-specific clusters on UMAP
Sequencing	Lane effects, depth differences	Samples cluster by lane
Chemistry version	10x v2 vs v3 sensitivity	Systematic gene count shifts
Donor/individual	Genetic background	Confounded with condition if not designed well

Python - Diagnosing Batch Effects

import scanpy as sc

# Load multiple samples and concatenate
adatas = []
for sample in ['sample1','sample2','sample3','sample4']:
    a = sc.read_10x_h5(f'{sample}/filtered_feature_bc_matrix.h5')
    a.obs['sample'] = sample
    a.obs['condition'] = 'ctrl' if 'ctrl' in sample else 'treat'
    adatas.append(a)

adata = sc.concat(adatas, label='sample', keys=['s1','s2','s3','s4'])
adata.obs_names_make_unique()

# Standard preprocessing (without integration)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key='sample')
adata = adata[:, adata.var['highly_variable']].copy()
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.umap(adata)

# Diagnosis: color by sample vs cell type
sc.pl.umap(adata, color=['sample','condition'], ncols=2)
# If cells cluster by sample rather than cell type = batch effect present

# LISI score (Local Inverse Simpson Index):
# Measures mixing of batches in local neighborhoods
# iLISI near n_batches = well mixed, iLISI near 1 = poorly mixed

Download:

3.2 - Integration Methods

Method	Approach	Speed	Best For
Harmony	Iterative PCA correction	Very fast	Simple batch effects, large datasets
scVI	Variational autoencoder (deep learning)	Moderate	Complex batches, preserves biology best
scANVI	scVI + cell type labels	Moderate	When reference labels available
BBKNN	Batch-balanced kNN graph	Fast	Lightweight, graph-level only
Scanorama	Panoramic stitching analog	Fast	Many batches, memory efficient
Seurat CCA/RPCA	Canonical correlation + anchors	Moderate	R/Seurat ecosystems

Python - Harmony Integration

# Harmony: fast, works directly on PCA embeddings
import harmonypy as hm

# Run PCA first (on log-normalized, scaled data)
sc.tl.pca(adata, n_comps=50)

# Harmony correction
ho = hm.run_harmony(adata.obsm['X_pca'], adata.obs, 'sample',
                     max_iter_harmony=20)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T

# Use corrected PCA for neighbors/UMAP/clustering
sc.pp.neighbors(adata, use_rep='X_pca_harmony', n_pcs=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.8)
sc.pl.umap(adata, color=['sample','leiden'])

Download:

Python - scVI Integration (Recommended)

import scvi

# scVI: deep generative model for single-cell data
# Learns a latent space that separates biology from batch effects
# Does NOT modify the count matrix -- produces a latent embedding

# Setup: use RAW counts (not normalized)
adata_raw = adata.copy()
adata_raw.X = adata_raw.layers['counts']

# Register the data
scvi.model.SCVI.setup_anndata(adata_raw, layer='counts',
                                batch_key='sample',
                                continuous_covariate_keys=['pct_counts_mt'])

# Train model
model = scvi.model.SCVI(adata_raw, n_latent=30, n_layers=2)
model.train(max_epochs=400, early_stopping=True)

# Get latent representation
adata.obsm['X_scVI'] = model.get_latent_representation()

# Use for clustering/UMAP
sc.pp.neighbors(adata, use_rep='X_scVI')
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.8)

# scVI also provides:
# - Differential expression (proper Bayesian framework)
# - Imputed/denoised expression values
de_results = model.differential_expression(
    groupby='celltype', group1='T_cells', group2='B_cells'
)

# scANVI: semi-supervised -- use known labels to guide integration
scvi.model.SCANVI.setup_anndata(adata_raw, layer='counts',
                                  batch_key='sample',
                                  labels_key='celltypist')
scanvi_model = scvi.model.SCANVI.from_scvi_model(model,
                                                    unlabeled_category='Unknown')
scanvi_model.train(max_epochs=200)
adata.obsm['X_scANVI'] = scanvi_model.get_latent_representation()
adata.obs['scANVI_pred'] = scanvi_model.predict()

Download:

R - Seurat v5 Integration

# Seurat v5: CCA + RPCA integration
library(Seurat)

obj[["RNA"]] <- split(obj[["RNA"]], f=obj$sample)

obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)

# Integrate with RPCA (faster, recommended for large datasets)
obj <- IntegrateLayers(object=obj, method=RPCAIntegration,
                        orig.reduction="pca",
                        new.reduction="integrated.rpca")

# Or CCA (classic, better for very different batches)
obj <- IntegrateLayers(object=obj, method=CCAIntegration,
                        orig.reduction="pca",
                        new.reduction="integrated.cca")

# Continue with integrated reduction
obj <- FindNeighbors(obj, reduction="integrated.rpca", dims=1:30)
obj <- FindClusters(obj, resolution=0.8)
obj <- RunUMAP(obj, reduction="integrated.rpca", dims=1:30)

Mnemonic

"Harmony for Haste, scVI for Value"

Harmony is fastest and works well for simple batches. scVI is slower but gives the best biological preservation and includes built-in DE testing. Start with Harmony; upgrade to scVI if results are unsatisfactory or for final publication analysis.

3.3 - Benchmarking Integration Quality

Python - scIB Benchmarking

import scib

# scIB: single-cell Integration Benchmarking
# Measures both batch correction AND biological conservation

# Batch removal metrics:
# - iLISI: batch mixing in local neighborhoods
# - kBET: k-nearest-neighbor batch effect test
# - Graph connectivity: are same cell types connected across batches?

# Bio conservation metrics:
# - cLISI: cell type separation preserved
# - NMI/ARI: clustering agrees with known labels
# - Silhouette score: cluster compactness

# Run comprehensive benchmark
results = scib.metrics.metrics(
    adata,
    adata_int=adata,  # integrated version
    batch_key='sample',
    label_key='celltype',
    embed='X_scVI',
    cluster_key='leiden',
    type_='embed',
    organism='human'
)
print(results)

# Compare multiple methods:
# Run Harmony, scVI, BBKNN, Scanorama on same data
# Compute metrics for each
# Choose method with best balance of batch removal + bio conservation

Download:

Thumb Rule - Integration Assessment: A good integration (1) mixes batches within each cell type, (2) does NOT merge distinct cell types, (3) preserves known marker gene expression. Always check: can you still identify expected cell types with known markers after integration? If markers are lost, the method over-corrected.

4.1 - Trajectory Inference Concepts

Trajectory inference recovers the dynamic processes (differentiation, activation, cell cycle) from the static snapshot of scRNA-seq data. Cells are ordered along a continuous path from progenitor to mature states.

Method	Topology	Input	Strengths
Diffusion pseudotime	Linear/branching	kNN graph	Robust, foundation for PAGA
PAGA	Graph abstraction	kNN graph	Reveals global connectivity of clusters
Monocle3	Tree/graph	Reduced space	Principled, R-native, DE along trajectory
Palantir	Branching	Diffusion maps	Fate probabilities, entropy
CytoTRACE	N/A (stemness score)	Gene counts	Predicts differentiation potential
Slingshot	Smooth curves	Reduced space	R-native, clear branching

Choosing a Trajectory Method

Simple linear differentiation: Diffusion pseudotime
Complex topology (unknown): PAGA first to determine structure, then pseudotime
Branching with fate probabilities: Palantir
DE genes along trajectory: Monocle3 (graph_test)
Direction of differentiation: CytoTRACE (no trajectory needed)
Direction with velocity: RNA velocity (Section 4.4)

4.2 - Diffusion Pseudotime and PAGA

Python - Scanpy DPT and PAGA

# PAGA: Partition-based Graph Abstraction
# Reveals which clusters are connected (trajectory backbone)
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, threshold=0.05, edge_width_scale=0.5)
# Edges = significant connectivity between clusters
# Thickness = strength of connection

# Use PAGA to initialize UMAP (better global structure)
sc.tl.umap(adata, init_pos='paga')

# Diffusion pseudotime
# Step 1: compute diffusion map
sc.tl.diffmap(adata, n_comps=15)

# Step 2: set root cell (must know biology!)
# Find the root: the most stem-like cell in your progenitor cluster
root_cluster = 'HSC'  # example: hematopoietic stem cells
root_mask = adata.obs['celltype'] == root_cluster
adata.uns['iroot'] = int(
    adata[root_mask].obsm['X_diffmap'][:,0].argmax()
)

# Step 3: compute pseudotime
sc.tl.dpt(adata)
sc.pl.umap(adata, color='dpt_pseudotime', cmap='viridis')

# Genes that change along pseudotime
sc.pl.paga_path(adata, nodes=['HSC','CMP','GMP','Mono'],
                 keys=['CD34','MPO','CSF1R','CD14'])

Download:

Thumb Rule - Root Cell: You MUST specify the root cell for pseudotime. This requires biological knowledge. Look for: (1) known stem/progenitor markers, (2) CytoTRACE score (highest = most stem-like), (3) RNA velocity direction (converge toward root). Wrong root = meaningless pseudotime.

4.3 - Monocle3 and Palantir

R - Monocle3

library(monocle3)

# Convert Seurat to Monocle3
cds <- as.cell_data_set(seurat_obj)
cds <- preprocess_cds(cds, num_dim=30)
cds <- reduce_dimension(cds, reduction_method="UMAP")

# Cluster and learn trajectory graph
cds <- cluster_cells(cds, resolution=1e-3)
cds <- learn_graph(cds, use_partition=TRUE)

# Order cells (set root interactively or programmatically)
cds <- order_cells(cds, root_cells=root_cell_ids)
# Or interactively: order_cells(cds) will open a GUI

# Plot pseudotime
plot_cells(cds, color_cells_by="pseudotime",
           cell_size=0.5, label_branch_points=TRUE)

# Find genes that change along the trajectory
trajectory_genes <- graph_test(cds, neighbor_graph="principal_graph",
                                cores=8)
sig_genes <- subset(trajectory_genes, q_value < 0.05)
sig_genes <- sig_genes[order(sig_genes$morans_I, decreasing=TRUE),]

# Plot top trajectory genes
top_genes <- head(rownames(sig_genes), 6)
plot_genes_in_pseudotime(cds[top_genes,], min_expr=0.5)

Download:

Python - Palantir (Fate Probabilities)

# Palantir: computes fate probabilities at branch points
import palantir

# Run diffusion maps
dm_res = palantir.utils.run_diffusion_maps(adata, n_components=10)

# Set start cell
start_cell = 'ATCGATCG-1'  # barcode of a stem/progenitor cell

# Run Palantir
pr_res = palantir.core.run_palantir(
    adata, start_cell,
    num_waypoints=500,
    terminal_states=['Mono_terminal','Ery_terminal','Lymph_terminal']
)

# Outputs:
# pr_res.pseudotime: ordering along differentiation
# pr_res.entropy: high = multipotent, low = committed
# pr_res.branch_probs: probability of reaching each terminal state

# Visualize
palantir.plot.plot_palantir_results(adata, pr_res)
# Shows pseudotime, entropy, and fate probabilities on UMAP

4.4 - RNA Velocity

RNA velocity infers the future state of a cell by comparing spliced (mature) and unspliced (nascent) mRNA. If unspliced > expected: gene is being upregulated. If unspliced < expected: gene is being downregulated.

Python - scVelo

import scvelo as scv

# Load data with spliced/unspliced counts
# If using STARsolo with --soloFeatures Velocyto:
adata = scv.read('starsolo_output/Velocyto/raw/', cache=True)
# Or merge with loom file from velocyto:
loom = scv.read('sample.loom', cache=True)
adata = scv.utils.merge(adata, loom)

# Preprocessing
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

# --- Stochastic model (fast, default) ---
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)

# Visualize velocity streamlines on UMAP
scv.pl.velocity_embedding_stream(adata, basis='umap',
                                   color='celltype', figsize=(10,8))

# --- Dynamical model (recommended, more accurate) ---
scv.tl.recover_dynamics(adata, n_jobs=8)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

# Latent time (better pseudotime from velocity)
scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', cmap='viridis')

# Velocity confidence / coherence
scv.tl.velocity_confidence(adata)
scv.pl.scatter(adata, color='velocity_confidence')
# High confidence = velocity is consistent in neighborhood
# Low confidence = noisy or transitional region

# Top velocity genes (drivers of state transitions)
scv.tl.rank_velocity_genes(adata, groupby='celltype')
df = scv.get_df(adata.uns['rank_velocity_genes']['names'])
print(df.head(10))

Download:

RNA Velocity Caveats (Important!):

Velocity assumes a simple splicing model (constant rates). This is often violated.
3' scRNA-seq has limited unspliced read coverage (intronic reads). Smart-seq2 is better.
Velocity arrows on UMAP can be misleading because UMAP distorts distances.
Low velocity confidence means the direction is unreliable -- do not over-interpret.
UniTVelo and VeloVAE are newer methods that may be more robust.

Mnemonic

"Unspliced up = Upregulating"

RNA velocity logic: if a gene has more unspliced (pre-mRNA) than expected from its spliced level, the gene is being actively transcribed (upregulating). The cell is moving toward a state where that gene is more highly expressed.

4.5 - Cell Fate Prediction and CytoTRACE

Python - CytoTRACE

# CytoTRACE: predicts differentiation state from gene counts
# Principle: less differentiated cells express more genes
# No trajectory needed -- just a per-cell score

# Install: pip install cytotrace
import cytotrace

results = cytotrace.cytotrace(adata.X.toarray(), 
                               gene_names=adata.var_names.tolist())

adata.obs['CytoTRACE'] = results['CytoTRACE']
# Score: 1 = most stem-like (highest gene count diversity)
#        0 = most differentiated

sc.pl.umap(adata, color='CytoTRACE', cmap='RdYlBu_r')

# Use CytoTRACE to validate pseudotime direction:
# Pseudotime should correlate negatively with CytoTRACE
# (higher pseudotime = more differentiated = lower CytoTRACE)
from scipy.stats import spearmanr
r, p = spearmanr(adata.obs['dpt_pseudotime'], adata.obs['CytoTRACE'])
print(f"Spearman r={r:.3f}, p={p:.2e}")
# Expect r < -0.5 if pseudotime direction is correct

Download:

Thumb Rule - Trajectory Validation: Never trust a single trajectory method. Use at least 2 of: (1) Diffusion pseudotime, (2) RNA velocity direction, (3) CytoTRACE stemness score, (4) Known biology (marker gene ordering). If all agree, you have a robust trajectory.

AI Prompt Guide

Integration and Trajectory Pipeline

I have [NUMBER] 10x scRNA-seq samples across [NUMBER] batches/conditions. Expected biology: [DESCRIBE differentiation system, e.g., hematopoiesis]. Known progenitor markers: [GENE LIST]. Known terminal state markers: [GENE LIST]. Please write a complete Python (Scanpy) pipeline that: 1. Integrates samples using both Harmony and scVI (compare) 2. Benchmarks integration quality with scIB metrics 3. Runs PAGA to determine trajectory topology 4. Computes diffusion pseudotime with the correct root 5. Runs RNA velocity (scVelo dynamical mode) with velocity confidence 6. Computes CytoTRACE scores and validates pseudotime direction 7. Identifies genes that change along the trajectory (Moran's I test) 8. Computes fate probabilities at branch points (Palantir) 9. Produces publication-quality figures for each analysis 10. Includes interpretation guidelines

⚠ Troubleshooting

Harmony: Cells still cluster by batch after integration

Increase iterations (max_iter_harmony=30) or increase theta (batch diversity penalty). If still failing: the batch effect may be too severe for Harmony. Try scVI or check if the batches truly contain the same cell types.

scVI: Training loss not decreasing

Check: (1) Input must be RAW counts (not normalized), (2) Reduce learning rate, (3) Increase n_latent to 30-50, (4) Filter out very low-quality cells first. Also ensure no NaN/Inf values in the count matrix.

scVelo: Velocity arrows point in random directions

Low-quality velocity signal. Causes: (1) Very few unspliced reads (check with scv.pl.proportions(adata) -- need >10% unspliced), (2) Cells are in steady state (no dynamic process), (3) 3' chemistry misses intronic reads. Try: dynamical mode, increase n_top_genes, or accept that velocity is uninformative for this dataset.

PAGA: All clusters connected with strong edges

Resolution too low (everything is one big connected component). Increase Leiden resolution before running PAGA, or increase the PAGA threshold for plotting. Alternatively, the biology is genuinely continuous (no discrete boundaries).

Mnemonics and Thumb Rules

Mnemonic

"PAGA before PseudoTime"

Always run PAGA first to understand the global topology of your data (which clusters connect to which). Then compute pseudotime along the path PAGA reveals. Pseudotime without knowing the topology is meaningless.

Mnemonic

"Integrate to Mix batches, not to Merge biology"

The goal of integration is to align the same cell types across batches, NOT to force different cell types together. Over-integration is as bad as no integration. Always check that distinct cell types remain distinct after integration.

Thumb Rule - Velocity Confidence: Only interpret velocity arrows in regions with confidence > 0.5. Low confidence regions are inherently noisy. Focus velocity interpretation on the strongest, most coherent streams.

Thumb Rule - Dynamical vs Stochastic scVelo: Always use dynamical mode for publication. Stochastic mode is fine for quick exploration but assumes a steady-state that is often violated. Dynamical mode learns gene-specific kinetic parameters and provides latent time.

SC Module 1: Core Analysis SC Module 3: GRN and Multi-modal

SC Module 2 - Integration, Trajectory and Velocity

On this page

3.1 - Understanding Batch Effects

3.2 - Integration Methods

3.3 - Benchmarking Integration Quality

4.1 - Trajectory Inference Concepts

Choosing a Trajectory Method

4.2 - Diffusion Pseudotime and PAGA

4.3 - Monocle3 and Palantir

4.4 - RNA Velocity

4.5 - Cell Fate Prediction and CytoTRACE

AI Prompt Guide

Integration and Trajectory Pipeline

⚠ Troubleshooting

Harmony: Cells still cluster by batch after integration

scVI: Training loss not decreasing

scVelo: Velocity arrows point in random directions

PAGA: All clusters connected with strong edges

Mnemonics and Thumb Rules