Gene Regulatory Networks - CITE-seq Protein - scATAC-seq Chromatin - Multiome - Perturbation Screens - Spatial Transcriptomics
SCENIC infers transcription factor (TF) activity and their target gene regulons from scRNA-seq data. It combines co-expression analysis with TF motif enrichment to identify genuine regulatory relationships.
# pySCENIC three-step pipeline (command line, recommended for HPC) # Step 1: GRN inference with GRNBoost2 pyscenic grn expr_matrix.loom allTFs_hg38.txt \ -o adjacencies.tsv --num_workers 16 # Step 2: cisTarget - prune by TF motif enrichment pyscenic ctx adjacencies.tsv \ hg38_10kbp_up_10kbp_down_full_tx_v10.genes_vs_motifs.rankings.feather \ hg38_500bp_up_100bp_down_full_tx_v10.genes_vs_motifs.rankings.feather \ --annotations_fname motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl \ --expression_mtx_fname expr_matrix.loom \ --output regulons.csv --num_workers 16 # Step 3: AUCell - score regulon activity per cell pyscenic aucell expr_matrix.loom regulons.csv \ --output scenic_output.loom --num_workers 16 # Download cisTarget databases: # https://resources.aertslab.org/cistarget/ # Human: hg38, Mouse: mm10
import loompy, pandas as pd, scanpy as sc
# Load regulon AUC scores
lf = loompy.connect("scenic_output.loom")
auc_mtx = pd.DataFrame(lf.ca["RegulonsAUC"],
index=lf.ca["CellID"])
lf.close()
# Add to existing adata
adata.obsm['X_regulon_auc'] = auc_mtx.loc[adata.obs_names].values
# Visualize regulon activity per cell type
sc.pl.matrixplot(adata, var_names=['GATA1(+)','SPI1(+)','PAX5(+)',
'CEBPA(+)','TCF7(+)','FOXP3(+)'],
groupby='celltype', standard_scale='var',
cmap='RdBu_r')
# Regulon specificity score (RSS)
# Identifies which regulons are most cell-type-specific
from pyscenic.plotting import plot_rss
rss = pd.DataFrame(columns=adata.obs['celltype'].unique())
for ct in rss.columns:
mask = adata.obs['celltype'] == ct
rss[ct] = auc_mtx[mask].mean(axis=0)
# Top regulons per cell type = cell identity TFs
import celloracle as co
import scanpy as sc
# CellOracle: build GRN + simulate TF knockouts in silico
# Step 1: Load base GRN from scATAC or motif scanning
base_GRN = co.data.load_mouse_scATAC_atlas_base_GRN()
# Step 2: Set up Oracle object
oracle = co.Oracle()
oracle.import_anndata_as_raw_count(
adata=adata, cluster_column_name="celltype",
embedding_name="X_umap"
)
oracle.import_TF_data(TF_info_matrix=base_GRN)
oracle.perform_PCA()
oracle.knn_imputation(k=30)
# Step 3: Fit GRN per cluster
links = oracle.get_links(
cluster_name_for_GRN_unit="celltype",
alpha=10, verbose_level=10
)
links.filter_links(p=0.001, weight="coef_abs", threshold=0.1)
# Step 4: Simulate TF knockout
oracle.simulate_shift(
perturb_condition={"Gata1": 0.0}, # set GATA1 expression to 0
n_propagation=3
)
# Visualize predicted cell state shift
oracle.estimate_transition_prob(n_neighbors=200, knn_random=True)
oracle.calculate_embedding_shift(sigma_corr=0.05)
fig, ax = oracle.plot_simulation_result(
scale=0.5, show_background=True
)
# Arrows show predicted direction cells would move after KO
# Validates: does GATA1 KO push HSCs away from erythroid fate?
"SCENIC finds who regulates, CellOracle predicts what happens if you perturb"
SCENIC identifies TF regulons (the network). CellOracle uses that network to simulate what happens when you knock out or overexpress a TF. Together they provide discovery + prediction.
library(Seurat)
# CITE-seq: RNA + surface protein (ADT) from same cell
# Load 10x multi-modal data
obj <- Read10X("filtered_feature_bc_matrix/")
# Creates: obj$`Gene Expression` and obj$`Antibody Capture`
obj <- CreateSeuratObject(counts=obj$`Gene Expression`)
obj[["ADT"]] <- CreateAssay5Object(counts=obj$`Antibody Capture`)
# Process RNA
obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)
# Process protein (ADT) - CLR normalization
obj <- NormalizeData(obj, assay="ADT", normalization.method="CLR",
margin=2)
obj <- ScaleData(obj, assay="ADT")
obj <- RunPCA(obj, assay="ADT", reduction.name="apca",
features=rownames(obj[["ADT"]]))
# Weighted Nearest Neighbor (WNN) - combines both modalities
obj <- FindMultiModalNeighbors(obj,
reduction.list=list("pca","apca"),
dims.list=list(1:30, 1:18),
modality.weight.name="RNA.weight"
)
# Cluster and visualize using WNN graph
obj <- RunUMAP(obj, nn.name="weighted.nn",
reduction.name="wnn.umap")
obj <- FindClusters(obj, graph.name="wsnn", resolution=0.8)
# Compare modalities
p1 <- DimPlot(obj, reduction="umap", label=TRUE) + ggtitle("RNA")
p2 <- DimPlot(obj, reduction="wnn.umap", label=TRUE) + ggtitle("WNN")
p1 + p2
# Protein markers often resolve cell types that RNA cannot
# e.g., CD4 vs CD8 T cells, naive vs memory
FeaturePlot(obj, features=c("adt_CD4","adt_CD8","adt_CD45RA"),
reduction="wnn.umap")
library(ArchR)
addArchRGenome("hg38")
addArchRThreads(threads=16)
# Create Arrow files from fragments
ArrowFiles <- createArrowFiles(
inputFiles="fragments.tsv.gz",
sampleNames="sample1",
minTSS=4, minFrags=1000, addTileMat=TRUE, addGeneScoreMat=TRUE
)
# Create ArchR project
proj <- ArchRProject(ArrowFiles=ArrowFiles, outputDirectory="ArchR_out")
# Doublet removal
proj <- addDoubletScores(proj)
proj <- filterDoublets(proj)
# Dimensionality reduction (LSI - Latent Semantic Indexing)
proj <- addIterativeLSI(proj, useMatrix="TileMatrix",
name="IterativeLSI", iterations=2,
clusterParams=list(resolution=0.2))
# Clustering
proj <- addClusters(proj, reducedDims="IterativeLSI", resolution=0.8)
proj <- addUMAP(proj, reducedDims="IterativeLSI")
plotEmbedding(proj, colorBy="cellColData", name="Clusters")
# Gene scores (infer gene expression from chromatin accessibility)
proj <- addGeneScoreMatrix(proj)
# Peak calling per cluster (MACS2)
proj <- addGroupCoverages(proj, groupBy="Clusters")
proj <- addReproduciblePeakSet(proj, groupBy="Clusters",
pathToMacs2="/path/to/macs2")
proj <- addPeakMatrix(proj)
# Motif enrichment in differential peaks
proj <- addMotifAnnotations(proj, motifSet="cisbp")
enrichMotifs <- peakAnnoEnrichment(
seMarker=markersPeaks, ArchRProj=proj,
peakAnnotation="Motif", cutOff="FDR <= 0.01"
)
# ChromVAR: TF motif deviation scores per cell
proj <- addBgdPeaks(proj)
proj <- addDeviationsMatrix(proj, peakAnnotation="Motif")
plotVarDev <- getVarDeviations(proj, name="MotifMatrix")
plotVarDev # top variable TF motifs
# Process multiome data: joint RNA + ATAC per cell
library(Seurat)
library(Signac)
# Load Cell Ranger ARC output
counts <- Read10X_h5("filtered_feature_bc_matrix.h5")
fragpath <- "atac_fragments.tsv.gz"
# Create multimodal Seurat object
obj <- CreateSeuratObject(counts=counts$`Gene Expression`, assay="RNA")
obj[["ATAC"]] <- CreateChromatinAssay(
counts=counts$Peaks, sep=c(":","-"),
fragments=fragpath, annotation=annotation
)
# Process RNA
obj <- SCTransform(obj, assay="RNA")
obj <- RunPCA(obj)
# Process ATAC
DefaultAssay(obj) <- "ATAC"
obj <- FindTopFeatures(obj, min.cutoff=5)
obj <- RunTFIDF(obj)
obj <- RunSVD(obj)
# WNN integration of both modalities
obj <- FindMultiModalNeighbors(obj,
reduction.list=list("pca","lsi"),
dims.list=list(1:30, 2:30) # skip LSI component 1 (depth)
)
obj <- RunUMAP(obj, nn.name="weighted.nn", reduction.name="wnn.umap")
# Link peaks to genes (key multiome advantage!)
obj <- LinkPeaks(obj, peak.assay="ATAC", expression.assay="SCT")
# Identifies which ATAC peaks regulate which genes
# Based on correlation of peak accessibility with gene expression
import scanpy as sc
import pertpy as pt
# Perturb-seq: CRISPR perturbations + scRNA-seq readout
# Each cell has a guide RNA barcode identifying which gene was targeted
# Load data with guide assignments
adata = sc.read_h5ad("perturb_seq.h5ad")
# adata.obs['guide_identity'] = assigned guide RNA
# adata.obs['gene_target'] = target gene
# Mixscape (Seurat/pertpy): classify perturbed vs escaped cells
ms = pt.tl.Mixscape()
ms.perturbation_signature(adata, pert_key="gene_target",
control="non-targeting")
ms.mixscape(adata, pert_key="gene_target", control="non-targeting",
labels="celltype")
# Cells classified as: perturbed, non-perturbed (KO escaped), NT control
# Differential expression: each KO vs non-targeting controls
de_results = {}
for gene in adata.obs['gene_target'].unique():
if gene == 'non-targeting':
continue
mask_ko = adata.obs['gene_target'] == gene
mask_nt = adata.obs['gene_target'] == 'non-targeting'
sub = adata[mask_ko | mask_nt].copy()
sc.tl.rank_genes_groups(sub, groupby='gene_target',
reference='non-targeting', method='wilcoxon')
de_results[gene] = sc.get.rank_genes_groups_df(sub, group=gene)
# MILO: differential abundance testing
# Tests which cell states are enriched/depleted after perturbation
milo = pt.tl.Milo()
milo.load(adata)
milo.make_nhoods(adata)
milo.count_nhoods(adata, sample_col="replicate")
milo.da_nhoods(adata, design="~perturbation")
"Mixscape separates Hits from Misses"
Not all cells with a guide RNA are successfully perturbed. Mixscape classifies cells as truly perturbed (changed transcriptome) vs escaped (guide present but no effect). Always run Mixscape before DE analysis to avoid diluting your signal with unperturbed cells.
import squidpy as sq
import scanpy as sc
# Load Visium spatial data
adata_vis = sc.read_visium("visium_output/")
adata_vis.var_names_make_unique()
# Standard preprocessing
sc.pp.normalize_total(adata_vis, target_sum=1e4)
sc.pp.log1p(adata_vis)
sc.pp.highly_variable_genes(adata_vis, n_top_genes=2000)
sc.tl.pca(adata_vis)
sc.pp.neighbors(adata_vis)
sc.tl.umap(adata_vis)
sc.tl.leiden(adata_vis)
# Spatial visualization
sq.pl.spatial_scatter(adata_vis, color="leiden", shape=None, size=1.3)
# Spatial neighborhood analysis
sq.gr.spatial_neighbors(adata_vis)
sq.gr.nhood_enrichment(adata_vis, cluster_key="leiden")
sq.pl.nhood_enrichment(adata_vis, cluster_key="leiden")
# Spatially variable genes
sq.gr.spatial_autocorr(adata_vis, mode="moran")
# High Moran's I = spatially structured expression
# Cell2location: deconvolve spots using scRNA-seq reference
import cell2location as c2l
# Train reference model on scRNA-seq
c2l.models.RegressionModel.setup_anndata(
adata_sc, batch_key="sample", labels_key="celltype"
)
mod_ref = c2l.models.RegressionModel(adata_sc)
mod_ref.train(max_epochs=250)
# Map to spatial data
c2l.models.Cell2location.setup_anndata(adata_vis)
mod_sp = c2l.models.Cell2location(adata_vis,
cell_state_df=mod_ref.export_posterior_adata().X)
mod_sp.train(max_epochs=30000)
adata_vis = mod_sp.export_posterior(adata_vis)
# Now each spot has estimated cell type abundances
| Platform | Resolution | Genes | Tool |
|---|---|---|---|
| Visium (10x) | 55um spots (~1-10 cells) | Whole transcriptome | Squidpy, Scanpy, Seurat |
| Visium HD | 2um bins (subcellular) | Whole transcriptome | Squidpy, Seurat |
| MERFISH | Single-molecule | 100-10,000 genes | Squidpy, Baysor |
| Slide-seq | 10um beads | Whole transcriptome | Squidpy, RCTD |
| Stereo-seq | Subcellular | Whole transcriptome | STOmics, Squidpy |
The feather ranking databases are very large (5-10 GB). Solutions: (1) Use the command-line version (not Python API) for memory efficiency, (2) Ensure enough RAM (64 GB minimum), (3) Use the smaller 500bp database alone for initial exploration, (4) Subset to fewer cells (5000-10000) for GRNBoost2 step.
Too few cells in that cluster for MACS2 peak calling. Solutions: (1) Merge small clusters before calling peaks, (2) Lower MACS2 q-value threshold, (3) Increase coverage by combining replicates. Minimum ~200 cells per group for reliable peak calling.
The reference scRNA-seq may not match the spatial tissue well. Check: (1) Reference contains all cell types present in the tissue, (2) Reference and spatial data are from the same species/tissue, (3) Training converged (check ELBO loss). Try increasing max_epochs or filtering lowly-expressed genes.
Low perturbation efficiency. Causes: (1) Weak guide RNAs (low KO efficiency), (2) Gene is not expressed in this cell type (nothing to knock out), (3) Redundant pathways compensate. Check guide efficiency with bulk validation. Focus analysis on the subset of confirmed-perturbed cells.
"GRNBoost finds links, cisTarget Trims, AUCell scores"
The SCENIC pipeline in three words: Find (co-expression links), Trim (keep only those with TF motif support), Score (regulon activity per cell). Each step reduces false positives.
"LSI component 1 = Library Size, Ignore it"
In scATAC-seq, the first LSI component always correlates with sequencing depth. Always use components 2-30 for downstream analysis (neighbors, UMAP, clustering). This is unique to ATAC - in RNA PCA, component 1 is usually biological.