SC Module 2 - Integration, Trajectory and Velocity

Batch Effect Correction - Integration Methods - Trajectory Inference - Pseudotime - PAGA - RNA Velocity - Cell Fate

~180 min Advanced Python / R
Byte

3.1 - Understanding Batch Effects

Batch effects are systematic technical differences between samples processed at different times, operators, or lanes. They can dominate biological signal on UMAP and in clustering.

SourceEffectDiagnosis
Sample preparationDissociation protocol, enzyme lotSample-specific clusters on UMAP
SequencingLane effects, depth differencesSamples cluster by lane
Chemistry version10x v2 vs v3 sensitivitySystematic gene count shifts
Donor/individualGenetic backgroundConfounded with condition if not designed well
Python - Diagnosing Batch Effects
import scanpy as sc

# Load multiple samples and concatenate
adatas = []
for sample in ['sample1','sample2','sample3','sample4']:
    a = sc.read_10x_h5(f'{sample}/filtered_feature_bc_matrix.h5')
    a.obs['sample'] = sample
    a.obs['condition'] = 'ctrl' if 'ctrl' in sample else 'treat'
    adatas.append(a)

adata = sc.concat(adatas, label='sample', keys=['s1','s2','s3','s4'])
adata.obs_names_make_unique()

# Standard preprocessing (without integration)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key='sample')
adata = adata[:, adata.var['highly_variable']].copy()
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.umap(adata)

# Diagnosis: color by sample vs cell type
sc.pl.umap(adata, color=['sample','condition'], ncols=2)
# If cells cluster by sample rather than cell type = batch effect present

# LISI score (Local Inverse Simpson Index):
# Measures mixing of batches in local neighborhoods
# iLISI near n_batches = well mixed, iLISI near 1 = poorly mixed
Download:

3.2 - Integration Methods

MethodApproachSpeedBest For
HarmonyIterative PCA correctionVery fastSimple batch effects, large datasets
scVIVariational autoencoder (deep learning)ModerateComplex batches, preserves biology best
scANVIscVI + cell type labelsModerateWhen reference labels available
BBKNNBatch-balanced kNN graphFastLightweight, graph-level only
ScanoramaPanoramic stitching analogFastMany batches, memory efficient
Seurat CCA/RPCACanonical correlation + anchorsModerateR/Seurat ecosystems
Python - Harmony Integration
# Harmony: fast, works directly on PCA embeddings
import harmonypy as hm

# Run PCA first (on log-normalized, scaled data)
sc.tl.pca(adata, n_comps=50)

# Harmony correction
ho = hm.run_harmony(adata.obsm['X_pca'], adata.obs, 'sample',
                     max_iter_harmony=20)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T

# Use corrected PCA for neighbors/UMAP/clustering
sc.pp.neighbors(adata, use_rep='X_pca_harmony', n_pcs=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.8)
sc.pl.umap(adata, color=['sample','leiden'])
Download:
Python - scVI Integration (Recommended)
import scvi

# scVI: deep generative model for single-cell data
# Learns a latent space that separates biology from batch effects
# Does NOT modify the count matrix -- produces a latent embedding

# Setup: use RAW counts (not normalized)
adata_raw = adata.copy()
adata_raw.X = adata_raw.layers['counts']

# Register the data
scvi.model.SCVI.setup_anndata(adata_raw, layer='counts',
                                batch_key='sample',
                                continuous_covariate_keys=['pct_counts_mt'])

# Train model
model = scvi.model.SCVI(adata_raw, n_latent=30, n_layers=2)
model.train(max_epochs=400, early_stopping=True)

# Get latent representation
adata.obsm['X_scVI'] = model.get_latent_representation()

# Use for clustering/UMAP
sc.pp.neighbors(adata, use_rep='X_scVI')
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.8)

# scVI also provides:
# - Differential expression (proper Bayesian framework)
# - Imputed/denoised expression values
de_results = model.differential_expression(
    groupby='celltype', group1='T_cells', group2='B_cells'
)

# scANVI: semi-supervised -- use known labels to guide integration
scvi.model.SCANVI.setup_anndata(adata_raw, layer='counts',
                                  batch_key='sample',
                                  labels_key='celltypist')
scanvi_model = scvi.model.SCANVI.from_scvi_model(model,
                                                    unlabeled_category='Unknown')
scanvi_model.train(max_epochs=200)
adata.obsm['X_scANVI'] = scanvi_model.get_latent_representation()
adata.obs['scANVI_pred'] = scanvi_model.predict()
Download:
R - Seurat v5 Integration
# Seurat v5: CCA + RPCA integration
library(Seurat)

obj[["RNA"]] <- split(obj[["RNA"]], f=obj$sample)

obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)

# Integrate with RPCA (faster, recommended for large datasets)
obj <- IntegrateLayers(object=obj, method=RPCAIntegration,
                        orig.reduction="pca",
                        new.reduction="integrated.rpca")

# Or CCA (classic, better for very different batches)
obj <- IntegrateLayers(object=obj, method=CCAIntegration,
                        orig.reduction="pca",
                        new.reduction="integrated.cca")

# Continue with integrated reduction
obj <- FindNeighbors(obj, reduction="integrated.rpca", dims=1:30)
obj <- FindClusters(obj, resolution=0.8)
obj <- RunUMAP(obj, reduction="integrated.rpca", dims=1:30)
Mnemonic

"Harmony for Haste, scVI for Value"

Harmony is fastest and works well for simple batches. scVI is slower but gives the best biological preservation and includes built-in DE testing. Start with Harmony; upgrade to scVI if results are unsatisfactory or for final publication analysis.

3.3 - Benchmarking Integration Quality

Python - scIB Benchmarking
import scib

# scIB: single-cell Integration Benchmarking
# Measures both batch correction AND biological conservation

# Batch removal metrics:
# - iLISI: batch mixing in local neighborhoods
# - kBET: k-nearest-neighbor batch effect test
# - Graph connectivity: are same cell types connected across batches?

# Bio conservation metrics:
# - cLISI: cell type separation preserved
# - NMI/ARI: clustering agrees with known labels
# - Silhouette score: cluster compactness

# Run comprehensive benchmark
results = scib.metrics.metrics(
    adata,
    adata_int=adata,  # integrated version
    batch_key='sample',
    label_key='celltype',
    embed='X_scVI',
    cluster_key='leiden',
    type_='embed',
    organism='human'
)
print(results)

# Compare multiple methods:
# Run Harmony, scVI, BBKNN, Scanorama on same data
# Compute metrics for each
# Choose method with best balance of batch removal + bio conservation
Download:
Thumb Rule - Integration Assessment: A good integration (1) mixes batches within each cell type, (2) does NOT merge distinct cell types, (3) preserves known marker gene expression. Always check: can you still identify expected cell types with known markers after integration? If markers are lost, the method over-corrected.

4.1 - Trajectory Inference Concepts

Trajectory inference recovers the dynamic processes (differentiation, activation, cell cycle) from the static snapshot of scRNA-seq data. Cells are ordered along a continuous path from progenitor to mature states.

MethodTopologyInputStrengths
Diffusion pseudotimeLinear/branchingkNN graphRobust, foundation for PAGA
PAGAGraph abstractionkNN graphReveals global connectivity of clusters
Monocle3Tree/graphReduced spacePrincipled, R-native, DE along trajectory
PalantirBranchingDiffusion mapsFate probabilities, entropy
CytoTRACEN/A (stemness score)Gene countsPredicts differentiation potential
SlingshotSmooth curvesReduced spaceR-native, clear branching
Choosing a Trajectory Method
  • Simple linear differentiation: Diffusion pseudotime
  • Complex topology (unknown): PAGA first to determine structure, then pseudotime
  • Branching with fate probabilities: Palantir
  • DE genes along trajectory: Monocle3 (graph_test)
  • Direction of differentiation: CytoTRACE (no trajectory needed)
  • Direction with velocity: RNA velocity (Section 4.4)

4.2 - Diffusion Pseudotime and PAGA

Python - Scanpy DPT and PAGA
# PAGA: Partition-based Graph Abstraction
# Reveals which clusters are connected (trajectory backbone)
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, threshold=0.05, edge_width_scale=0.5)
# Edges = significant connectivity between clusters
# Thickness = strength of connection

# Use PAGA to initialize UMAP (better global structure)
sc.tl.umap(adata, init_pos='paga')

# Diffusion pseudotime
# Step 1: compute diffusion map
sc.tl.diffmap(adata, n_comps=15)

# Step 2: set root cell (must know biology!)
# Find the root: the most stem-like cell in your progenitor cluster
root_cluster = 'HSC'  # example: hematopoietic stem cells
root_mask = adata.obs['celltype'] == root_cluster
adata.uns['iroot'] = int(
    adata[root_mask].obsm['X_diffmap'][:,0].argmax()
)

# Step 3: compute pseudotime
sc.tl.dpt(adata)
sc.pl.umap(adata, color='dpt_pseudotime', cmap='viridis')

# Genes that change along pseudotime
sc.pl.paga_path(adata, nodes=['HSC','CMP','GMP','Mono'],
                 keys=['CD34','MPO','CSF1R','CD14'])
Download:
Thumb Rule - Root Cell: You MUST specify the root cell for pseudotime. This requires biological knowledge. Look for: (1) known stem/progenitor markers, (2) CytoTRACE score (highest = most stem-like), (3) RNA velocity direction (converge toward root). Wrong root = meaningless pseudotime.

4.3 - Monocle3 and Palantir

R - Monocle3
library(monocle3)

# Convert Seurat to Monocle3
cds <- as.cell_data_set(seurat_obj)
cds <- preprocess_cds(cds, num_dim=30)
cds <- reduce_dimension(cds, reduction_method="UMAP")

# Cluster and learn trajectory graph
cds <- cluster_cells(cds, resolution=1e-3)
cds <- learn_graph(cds, use_partition=TRUE)

# Order cells (set root interactively or programmatically)
cds <- order_cells(cds, root_cells=root_cell_ids)
# Or interactively: order_cells(cds) will open a GUI

# Plot pseudotime
plot_cells(cds, color_cells_by="pseudotime",
           cell_size=0.5, label_branch_points=TRUE)

# Find genes that change along the trajectory
trajectory_genes <- graph_test(cds, neighbor_graph="principal_graph",
                                cores=8)
sig_genes <- subset(trajectory_genes, q_value < 0.05)
sig_genes <- sig_genes[order(sig_genes$morans_I, decreasing=TRUE),]

# Plot top trajectory genes
top_genes <- head(rownames(sig_genes), 6)
plot_genes_in_pseudotime(cds[top_genes,], min_expr=0.5)
Download:
Python - Palantir (Fate Probabilities)
# Palantir: computes fate probabilities at branch points
import palantir

# Run diffusion maps
dm_res = palantir.utils.run_diffusion_maps(adata, n_components=10)

# Set start cell
start_cell = 'ATCGATCG-1'  # barcode of a stem/progenitor cell

# Run Palantir
pr_res = palantir.core.run_palantir(
    adata, start_cell,
    num_waypoints=500,
    terminal_states=['Mono_terminal','Ery_terminal','Lymph_terminal']
)

# Outputs:
# pr_res.pseudotime: ordering along differentiation
# pr_res.entropy: high = multipotent, low = committed
# pr_res.branch_probs: probability of reaching each terminal state

# Visualize
palantir.plot.plot_palantir_results(adata, pr_res)
# Shows pseudotime, entropy, and fate probabilities on UMAP

4.4 - RNA Velocity

RNA velocity infers the future state of a cell by comparing spliced (mature) and unspliced (nascent) mRNA. If unspliced > expected: gene is being upregulated. If unspliced < expected: gene is being downregulated.

Python - scVelo
import scvelo as scv

# Load data with spliced/unspliced counts
# If using STARsolo with --soloFeatures Velocyto:
adata = scv.read('starsolo_output/Velocyto/raw/', cache=True)
# Or merge with loom file from velocyto:
loom = scv.read('sample.loom', cache=True)
adata = scv.utils.merge(adata, loom)

# Preprocessing
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

# --- Stochastic model (fast, default) ---
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)

# Visualize velocity streamlines on UMAP
scv.pl.velocity_embedding_stream(adata, basis='umap',
                                   color='celltype', figsize=(10,8))

# --- Dynamical model (recommended, more accurate) ---
scv.tl.recover_dynamics(adata, n_jobs=8)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

# Latent time (better pseudotime from velocity)
scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', cmap='viridis')

# Velocity confidence / coherence
scv.tl.velocity_confidence(adata)
scv.pl.scatter(adata, color='velocity_confidence')
# High confidence = velocity is consistent in neighborhood
# Low confidence = noisy or transitional region

# Top velocity genes (drivers of state transitions)
scv.tl.rank_velocity_genes(adata, groupby='celltype')
df = scv.get_df(adata.uns['rank_velocity_genes']['names'])
print(df.head(10))
Download:
Byte warns
RNA Velocity Caveats (Important!):
  • Velocity assumes a simple splicing model (constant rates). This is often violated.
  • 3' scRNA-seq has limited unspliced read coverage (intronic reads). Smart-seq2 is better.
  • Velocity arrows on UMAP can be misleading because UMAP distorts distances.
  • Low velocity confidence means the direction is unreliable -- do not over-interpret.
  • UniTVelo and VeloVAE are newer methods that may be more robust.
Mnemonic

"Unspliced up = Upregulating"

RNA velocity logic: if a gene has more unspliced (pre-mRNA) than expected from its spliced level, the gene is being actively transcribed (upregulating). The cell is moving toward a state where that gene is more highly expressed.

4.5 - Cell Fate Prediction and CytoTRACE

Python - CytoTRACE
# CytoTRACE: predicts differentiation state from gene counts
# Principle: less differentiated cells express more genes
# No trajectory needed -- just a per-cell score

# Install: pip install cytotrace
import cytotrace

results = cytotrace.cytotrace(adata.X.toarray(), 
                               gene_names=adata.var_names.tolist())

adata.obs['CytoTRACE'] = results['CytoTRACE']
# Score: 1 = most stem-like (highest gene count diversity)
#        0 = most differentiated

sc.pl.umap(adata, color='CytoTRACE', cmap='RdYlBu_r')

# Use CytoTRACE to validate pseudotime direction:
# Pseudotime should correlate negatively with CytoTRACE
# (higher pseudotime = more differentiated = lower CytoTRACE)
from scipy.stats import spearmanr
r, p = spearmanr(adata.obs['dpt_pseudotime'], adata.obs['CytoTRACE'])
print(f"Spearman r={r:.3f}, p={p:.2e}")
# Expect r < -0.5 if pseudotime direction is correct
Download:
Thumb Rule - Trajectory Validation: Never trust a single trajectory method. Use at least 2 of: (1) Diffusion pseudotime, (2) RNA velocity direction, (3) CytoTRACE stemness score, (4) Known biology (marker gene ordering). If all agree, you have a robust trajectory.

AI Prompt Guide

Integration and Trajectory Pipeline
I have [NUMBER] 10x scRNA-seq samples across [NUMBER] batches/conditions. Expected biology: [DESCRIBE differentiation system, e.g., hematopoiesis]. Known progenitor markers: [GENE LIST]. Known terminal state markers: [GENE LIST]. Please write a complete Python (Scanpy) pipeline that: 1. Integrates samples using both Harmony and scVI (compare) 2. Benchmarks integration quality with scIB metrics 3. Runs PAGA to determine trajectory topology 4. Computes diffusion pseudotime with the correct root 5. Runs RNA velocity (scVelo dynamical mode) with velocity confidence 6. Computes CytoTRACE scores and validates pseudotime direction 7. Identifies genes that change along the trajectory (Moran's I test) 8. Computes fate probabilities at branch points (Palantir) 9. Produces publication-quality figures for each analysis 10. Includes interpretation guidelines

Troubleshooting

Harmony: Cells still cluster by batch after integration

Increase iterations (max_iter_harmony=30) or increase theta (batch diversity penalty). If still failing: the batch effect may be too severe for Harmony. Try scVI or check if the batches truly contain the same cell types.

scVI: Training loss not decreasing

Check: (1) Input must be RAW counts (not normalized), (2) Reduce learning rate, (3) Increase n_latent to 30-50, (4) Filter out very low-quality cells first. Also ensure no NaN/Inf values in the count matrix.

scVelo: Velocity arrows point in random directions

Low-quality velocity signal. Causes: (1) Very few unspliced reads (check with scv.pl.proportions(adata) -- need >10% unspliced), (2) Cells are in steady state (no dynamic process), (3) 3' chemistry misses intronic reads. Try: dynamical mode, increase n_top_genes, or accept that velocity is uninformative for this dataset.

PAGA: All clusters connected with strong edges

Resolution too low (everything is one big connected component). Increase Leiden resolution before running PAGA, or increase the PAGA threshold for plotting. Alternatively, the biology is genuinely continuous (no discrete boundaries).

Mnemonics and Thumb Rules

Mnemonic

"PAGA before PseudoTime"

Always run PAGA first to understand the global topology of your data (which clusters connect to which). Then compute pseudotime along the path PAGA reveals. Pseudotime without knowing the topology is meaningless.

Mnemonic

"Integrate to Mix batches, not to Merge biology"

The goal of integration is to align the same cell types across batches, NOT to force different cell types together. Over-integration is as bad as no integration. Always check that distinct cell types remain distinct after integration.

Thumb Rule - Velocity Confidence: Only interpret velocity arrows in regions with confidence > 0.5. Low confidence regions are inherently noisy. Focus velocity interpretation on the strongest, most coherent streams.
Thumb Rule - Dynamical vs Stochastic scVelo: Always use dynamical mode for publication. Stochastic mode is fine for quick exploration but assumes a steady-state that is often violated. Dynamical mode learns gene-specific kinetic parameters and provides latent time.