SC Module 3 - GRN, Multi-modal and Perturbation

Gene Regulatory Networks - CITE-seq Protein - scATAC-seq Chromatin - Multiome - Perturbation Screens - Spatial Transcriptomics

~200 min Advanced Python / R
Byte

5.1 - Gene Regulatory Network Inference with SCENIC

SCENIC infers transcription factor (TF) activity and their target gene regulons from scRNA-seq data. It combines co-expression analysis with TF motif enrichment to identify genuine regulatory relationships.

Bash - pySCENIC Command Line
# pySCENIC three-step pipeline (command line, recommended for HPC)

# Step 1: GRN inference with GRNBoost2
pyscenic grn expr_matrix.loom allTFs_hg38.txt \
  -o adjacencies.tsv --num_workers 16

# Step 2: cisTarget - prune by TF motif enrichment
pyscenic ctx adjacencies.tsv \
  hg38_10kbp_up_10kbp_down_full_tx_v10.genes_vs_motifs.rankings.feather \
  hg38_500bp_up_100bp_down_full_tx_v10.genes_vs_motifs.rankings.feather \
  --annotations_fname motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl \
  --expression_mtx_fname expr_matrix.loom \
  --output regulons.csv --num_workers 16

# Step 3: AUCell - score regulon activity per cell
pyscenic aucell expr_matrix.loom regulons.csv \
  --output scenic_output.loom --num_workers 16

# Download cisTarget databases:
# https://resources.aertslab.org/cistarget/
# Human: hg38, Mouse: mm10
Download:
Python - Loading SCENIC Results into Scanpy
import loompy, pandas as pd, scanpy as sc

# Load regulon AUC scores
lf = loompy.connect("scenic_output.loom")
auc_mtx = pd.DataFrame(lf.ca["RegulonsAUC"],
                         index=lf.ca["CellID"])
lf.close()

# Add to existing adata
adata.obsm['X_regulon_auc'] = auc_mtx.loc[adata.obs_names].values

# Visualize regulon activity per cell type
sc.pl.matrixplot(adata, var_names=['GATA1(+)','SPI1(+)','PAX5(+)',
                  'CEBPA(+)','TCF7(+)','FOXP3(+)'],
                  groupby='celltype', standard_scale='var',
                  cmap='RdBu_r')

# Regulon specificity score (RSS)
# Identifies which regulons are most cell-type-specific
from pyscenic.plotting import plot_rss
rss = pd.DataFrame(columns=adata.obs['celltype'].unique())
for ct in rss.columns:
    mask = adata.obs['celltype'] == ct
    rss[ct] = auc_mtx[mask].mean(axis=0)
# Top regulons per cell type = cell identity TFs
Download:
Thumb Rule - SCENIC Validation: A regulon is trustworthy if: (1) The TF is expressed in the expected cell type, (2) Known target genes are in the regulon, (3) Activity matches published biology. SCENIC finds real biology but also many spurious regulons - always validate top hits experimentally.

5.2 - CellOracle: In Silico Perturbation of GRNs

Python - CellOracle
import celloracle as co
import scanpy as sc

# CellOracle: build GRN + simulate TF knockouts in silico
# Step 1: Load base GRN from scATAC or motif scanning
base_GRN = co.data.load_mouse_scATAC_atlas_base_GRN()

# Step 2: Set up Oracle object
oracle = co.Oracle()
oracle.import_anndata_as_raw_count(
    adata=adata, cluster_column_name="celltype",
    embedding_name="X_umap"
)
oracle.import_TF_data(TF_info_matrix=base_GRN)
oracle.perform_PCA()
oracle.knn_imputation(k=30)

# Step 3: Fit GRN per cluster
links = oracle.get_links(
    cluster_name_for_GRN_unit="celltype",
    alpha=10, verbose_level=10
)
links.filter_links(p=0.001, weight="coef_abs", threshold=0.1)

# Step 4: Simulate TF knockout
oracle.simulate_shift(
    perturb_condition={"Gata1": 0.0},  # set GATA1 expression to 0
    n_propagation=3
)
# Visualize predicted cell state shift
oracle.estimate_transition_prob(n_neighbors=200, knn_random=True)
oracle.calculate_embedding_shift(sigma_corr=0.05)

fig, ax = oracle.plot_simulation_result(
    scale=0.5, show_background=True
)
# Arrows show predicted direction cells would move after KO
# Validates: does GATA1 KO push HSCs away from erythroid fate?
Download:
Mnemonic

"SCENIC finds who regulates, CellOracle predicts what happens if you perturb"

SCENIC identifies TF regulons (the network). CellOracle uses that network to simulate what happens when you knock out or overexpress a TF. Together they provide discovery + prediction.

6.1 - CITE-seq and Weighted Nearest Neighbor (WNN)

R - Seurat WNN for CITE-seq
library(Seurat)

# CITE-seq: RNA + surface protein (ADT) from same cell
# Load 10x multi-modal data
obj <- Read10X("filtered_feature_bc_matrix/")
# Creates: obj$`Gene Expression` and obj$`Antibody Capture`

obj <- CreateSeuratObject(counts=obj$`Gene Expression`)
obj[["ADT"]] <- CreateAssay5Object(counts=obj$`Antibody Capture`)

# Process RNA
obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)

# Process protein (ADT) - CLR normalization
obj <- NormalizeData(obj, assay="ADT", normalization.method="CLR",
                      margin=2)
obj <- ScaleData(obj, assay="ADT")
obj <- RunPCA(obj, assay="ADT", reduction.name="apca",
               features=rownames(obj[["ADT"]]))

# Weighted Nearest Neighbor (WNN) - combines both modalities
obj <- FindMultiModalNeighbors(obj,
  reduction.list=list("pca","apca"),
  dims.list=list(1:30, 1:18),
  modality.weight.name="RNA.weight"
)

# Cluster and visualize using WNN graph
obj <- RunUMAP(obj, nn.name="weighted.nn",
                reduction.name="wnn.umap")
obj <- FindClusters(obj, graph.name="wsnn", resolution=0.8)

# Compare modalities
p1 <- DimPlot(obj, reduction="umap", label=TRUE) + ggtitle("RNA")
p2 <- DimPlot(obj, reduction="wnn.umap", label=TRUE) + ggtitle("WNN")
p1 + p2

# Protein markers often resolve cell types that RNA cannot
# e.g., CD4 vs CD8 T cells, naive vs memory
FeaturePlot(obj, features=c("adt_CD4","adt_CD8","adt_CD45RA"),
            reduction="wnn.umap")
Download:

6.2 - scATAC-seq Analysis

R - ArchR (Recommended for scATAC)
library(ArchR)
addArchRGenome("hg38")
addArchRThreads(threads=16)

# Create Arrow files from fragments
ArrowFiles <- createArrowFiles(
  inputFiles="fragments.tsv.gz",
  sampleNames="sample1",
  minTSS=4, minFrags=1000, addTileMat=TRUE, addGeneScoreMat=TRUE
)

# Create ArchR project
proj <- ArchRProject(ArrowFiles=ArrowFiles, outputDirectory="ArchR_out")

# Doublet removal
proj <- addDoubletScores(proj)
proj <- filterDoublets(proj)

# Dimensionality reduction (LSI - Latent Semantic Indexing)
proj <- addIterativeLSI(proj, useMatrix="TileMatrix",
                          name="IterativeLSI", iterations=2,
                          clusterParams=list(resolution=0.2))

# Clustering
proj <- addClusters(proj, reducedDims="IterativeLSI", resolution=0.8)
proj <- addUMAP(proj, reducedDims="IterativeLSI")
plotEmbedding(proj, colorBy="cellColData", name="Clusters")

# Gene scores (infer gene expression from chromatin accessibility)
proj <- addGeneScoreMatrix(proj)

# Peak calling per cluster (MACS2)
proj <- addGroupCoverages(proj, groupBy="Clusters")
proj <- addReproduciblePeakSet(proj, groupBy="Clusters",
                                 pathToMacs2="/path/to/macs2")
proj <- addPeakMatrix(proj)

# Motif enrichment in differential peaks
proj <- addMotifAnnotations(proj, motifSet="cisbp")
enrichMotifs <- peakAnnoEnrichment(
  seMarker=markersPeaks, ArchRProj=proj,
  peakAnnotation="Motif", cutOff="FDR <= 0.01"
)

# ChromVAR: TF motif deviation scores per cell
proj <- addBgdPeaks(proj)
proj <- addDeviationsMatrix(proj, peakAnnotation="Motif")
plotVarDev <- getVarDeviations(proj, name="MotifMatrix")
plotVarDev  # top variable TF motifs
Download:
scATAC Tools Comparison
  • ArchR (R): Most complete, handles large datasets, includes peak calling, motifs, integration. Recommended.
  • Signac (R, Seurat ecosystem): Good if already using Seurat. Requires Cell Ranger ATAC output.
  • SnapATAC2 (Python): Fast, scalable, integrates with Scanpy. Best for Python users.
  • EpiScanpy (Python): Lightweight, extends Scanpy for epigenomics.

6.3 - 10x Multiome (RNA + ATAC from Same Cell)

R - Seurat/Signac Multiome
# Process multiome data: joint RNA + ATAC per cell
library(Seurat)
library(Signac)

# Load Cell Ranger ARC output
counts <- Read10X_h5("filtered_feature_bc_matrix.h5")
fragpath <- "atac_fragments.tsv.gz"

# Create multimodal Seurat object
obj <- CreateSeuratObject(counts=counts$`Gene Expression`, assay="RNA")
obj[["ATAC"]] <- CreateChromatinAssay(
  counts=counts$Peaks, sep=c(":","-"),
  fragments=fragpath, annotation=annotation
)

# Process RNA
obj <- SCTransform(obj, assay="RNA")
obj <- RunPCA(obj)

# Process ATAC
DefaultAssay(obj) <- "ATAC"
obj <- FindTopFeatures(obj, min.cutoff=5)
obj <- RunTFIDF(obj)
obj <- RunSVD(obj)

# WNN integration of both modalities
obj <- FindMultiModalNeighbors(obj,
  reduction.list=list("pca","lsi"),
  dims.list=list(1:30, 2:30)  # skip LSI component 1 (depth)
)
obj <- RunUMAP(obj, nn.name="weighted.nn", reduction.name="wnn.umap")

# Link peaks to genes (key multiome advantage!)
obj <- LinkPeaks(obj, peak.assay="ATAC", expression.assay="SCT")
# Identifies which ATAC peaks regulate which genes
# Based on correlation of peak accessibility with gene expression
Thumb Rule - Multiome LSI Component 1: Always exclude the first LSI component when analyzing scATAC data. It correlates with sequencing depth (technical), not biology. Use components 2-30 for neighbors/UMAP.

7.1 - Perturbation Screens (Perturb-seq / CROP-seq)

Python - Perturbation Analysis
import scanpy as sc
import pertpy as pt

# Perturb-seq: CRISPR perturbations + scRNA-seq readout
# Each cell has a guide RNA barcode identifying which gene was targeted

# Load data with guide assignments
adata = sc.read_h5ad("perturb_seq.h5ad")
# adata.obs['guide_identity'] = assigned guide RNA
# adata.obs['gene_target'] = target gene

# Mixscape (Seurat/pertpy): classify perturbed vs escaped cells
ms = pt.tl.Mixscape()
ms.perturbation_signature(adata, pert_key="gene_target",
                            control="non-targeting")
ms.mixscape(adata, pert_key="gene_target", control="non-targeting",
             labels="celltype")
# Cells classified as: perturbed, non-perturbed (KO escaped), NT control

# Differential expression: each KO vs non-targeting controls
de_results = {}
for gene in adata.obs['gene_target'].unique():
    if gene == 'non-targeting':
        continue
    mask_ko = adata.obs['gene_target'] == gene
    mask_nt = adata.obs['gene_target'] == 'non-targeting'
    sub = adata[mask_ko | mask_nt].copy()
    sc.tl.rank_genes_groups(sub, groupby='gene_target',
                             reference='non-targeting', method='wilcoxon')
    de_results[gene] = sc.get.rank_genes_groups_df(sub, group=gene)

# MILO: differential abundance testing
# Tests which cell states are enriched/depleted after perturbation
milo = pt.tl.Milo()
milo.load(adata)
milo.make_nhoods(adata)
milo.count_nhoods(adata, sample_col="replicate")
milo.da_nhoods(adata, design="~perturbation")
Download:
Mnemonic

"Mixscape separates Hits from Misses"

Not all cells with a guide RNA are successfully perturbed. Mixscape classifies cells as truly perturbed (changed transcriptome) vs escaped (guide present but no effect). Always run Mixscape before DE analysis to avoid diluting your signal with unperturbed cells.

7.2 - Spatial Transcriptomics Integration

Python - Squidpy and Cell2location
import squidpy as sq
import scanpy as sc

# Load Visium spatial data
adata_vis = sc.read_visium("visium_output/")
adata_vis.var_names_make_unique()

# Standard preprocessing
sc.pp.normalize_total(adata_vis, target_sum=1e4)
sc.pp.log1p(adata_vis)
sc.pp.highly_variable_genes(adata_vis, n_top_genes=2000)
sc.tl.pca(adata_vis)
sc.pp.neighbors(adata_vis)
sc.tl.umap(adata_vis)
sc.tl.leiden(adata_vis)

# Spatial visualization
sq.pl.spatial_scatter(adata_vis, color="leiden", shape=None, size=1.3)

# Spatial neighborhood analysis
sq.gr.spatial_neighbors(adata_vis)
sq.gr.nhood_enrichment(adata_vis, cluster_key="leiden")
sq.pl.nhood_enrichment(adata_vis, cluster_key="leiden")

# Spatially variable genes
sq.gr.spatial_autocorr(adata_vis, mode="moran")
# High Moran's I = spatially structured expression

# Cell2location: deconvolve spots using scRNA-seq reference
import cell2location as c2l

# Train reference model on scRNA-seq
c2l.models.RegressionModel.setup_anndata(
    adata_sc, batch_key="sample", labels_key="celltype"
)
mod_ref = c2l.models.RegressionModel(adata_sc)
mod_ref.train(max_epochs=250)

# Map to spatial data
c2l.models.Cell2location.setup_anndata(adata_vis)
mod_sp = c2l.models.Cell2location(adata_vis,
    cell_state_df=mod_ref.export_posterior_adata().X)
mod_sp.train(max_epochs=30000)
adata_vis = mod_sp.export_posterior(adata_vis)
# Now each spot has estimated cell type abundances
Download:
PlatformResolutionGenesTool
Visium (10x)55um spots (~1-10 cells)Whole transcriptomeSquidpy, Scanpy, Seurat
Visium HD2um bins (subcellular)Whole transcriptomeSquidpy, Seurat
MERFISHSingle-molecule100-10,000 genesSquidpy, Baysor
Slide-seq10um beadsWhole transcriptomeSquidpy, RCTD
Stereo-seqSubcellularWhole transcriptomeSTOmics, Squidpy

AI Prompt Guide

Multi-modal and GRN Analysis
I have [MODALITY: CITE-seq / Multiome RNA+ATAC / Perturb-seq / Visium spatial] data from [TISSUE/SYSTEM]. Species: [SPECIES]. Expected cell types: [LIST]. Please write a complete analysis pipeline that: 1. Processes each modality with appropriate methods 2. Integrates modalities (WNN for CITE-seq, joint for multiome) 3. Runs SCENIC/pySCENIC for GRN inference 4. [If Perturb-seq]: Classifies perturbed vs escaped cells with Mixscape 5. [If spatial]: Deconvolves spots with Cell2location using scRNA-seq reference 6. Identifies cell-type-specific regulatory programs 7. Performs in silico perturbation with CellOracle 8. Produces publication-quality figures 9. Includes interpretation guidelines for each analysis

Troubleshooting

pySCENIC: cisTarget step takes forever or crashes

The feather ranking databases are very large (5-10 GB). Solutions: (1) Use the command-line version (not Python API) for memory efficiency, (2) Ensure enough RAM (64 GB minimum), (3) Use the smaller 500bp database alone for initial exploration, (4) Subset to fewer cells (5000-10000) for GRNBoost2 step.

ArchR: "No peaks called for cluster X"

Too few cells in that cluster for MACS2 peak calling. Solutions: (1) Merge small clusters before calling peaks, (2) Lower MACS2 q-value threshold, (3) Increase coverage by combining replicates. Minimum ~200 cells per group for reliable peak calling.

Cell2location: Deconvolution gives uniform cell type proportions

The reference scRNA-seq may not match the spatial tissue well. Check: (1) Reference contains all cell types present in the tissue, (2) Reference and spatial data are from the same species/tissue, (3) Training converged (check ELBO loss). Try increasing max_epochs or filtering lowly-expressed genes.

Perturb-seq: Most cells classified as non-perturbed by Mixscape

Low perturbation efficiency. Causes: (1) Weak guide RNAs (low KO efficiency), (2) Gene is not expressed in this cell type (nothing to knock out), (3) Redundant pathways compensate. Check guide efficiency with bulk validation. Focus analysis on the subset of confirmed-perturbed cells.

Mnemonics and Thumb Rules

Mnemonic

"GRNBoost finds links, cisTarget Trims, AUCell scores"

The SCENIC pipeline in three words: Find (co-expression links), Trim (keep only those with TF motif support), Score (regulon activity per cell). Each step reduces false positives.

Mnemonic

"LSI component 1 = Library Size, Ignore it"

In scATAC-seq, the first LSI component always correlates with sequencing depth. Always use components 2-30 for downstream analysis (neighbors, UMAP, clustering). This is unique to ATAC - in RNA PCA, component 1 is usually biological.

Thumb Rule - Multi-modal Priority: When RNA and protein (CITE-seq) disagree on cell identity, trust protein for surface markers (direct measurement) but trust RNA for transcription factors and intracellular processes. WNN combines both optimally.
Thumb Rule - Spatial Resolution: Visium spots contain 1-10 cells - always deconvolve. MERFISH/Xenium are single-cell resolution - deconvolution not needed but cell segmentation is critical. For subcellular analysis (Visium HD, Stereo-seq), use bin sizes appropriate to your question.