Single-Cell RNA-seq with Scanpy (Python)

Cell Ranger output · AnnData object · QC filtering · doublet detection · normalization · HVGs · PCA · UMAP · Leiden clustering · marker genes · cell type annotation

~120 min Intermediate Python
Byte

Overview

Scanpy is the Python counterpart to R’s Seurat — a comprehensive single-cell RNA-seq analysis toolkit built around the AnnData object format. It is tightly integrated with the scientific Python ecosystem (NumPy, pandas, matplotlib) and is the preferred tool in many computational biology groups. This tutorial follows the standard Scanpy workflow: preprocessing → dimensionality reduction → clustering → annotation.

Byte
Scanpy vs Seurat — which should you use?
  • Scanpy (Python): Faster for large datasets (>100k cells). Better GPU support (RAPIDS/NVIDIA). Tight integration with deep learning frameworks (scVI, scANVI). Standard in many ML-heavy labs.
  • Seurat (R): Richer ecosystem for downstream analysis (Monocle3 pseudotime, Signac ATAC integration). Better SCTransform normalisation. Standard in many biology-focused labs.
  • Key: Both produce virtually identical biological results for standard workflows. Learn the one that matches your lab’s language preference.
  • Interoperability: Convert between them with sceasy or by saving/loading H5AD/RDS files.

Dataset

We use the 10x Genomics PBMC 3k dataset — 2,700 peripheral blood mononuclear cells from a healthy donor, sequenced with 10x Chromium v2. This is the canonical scRNA-seq tutorial dataset used by both Seurat and Scanpy in their official documentation.

PropertyValue
OrganismHomo sapiens — peripheral blood mononuclear cells (PBMCs)
Technology10x Chromium v2 (droplet-based, 3′ end)
Cells~2,700 after QC
Source10x Genomics website
Expected cell typesCD4 T, CD8 T, NK, B cells, monocytes, dendritic cells, platelets
Also available viasc.datasets.pbmc3k() built into Scanpy

Step 1 — Load Data & Create AnnData Object

Python — Load 10x Cell Ranger output into Scanpy AnnData
######################################################################
## STEP 1: Load 10x Cell Ranger output into a Scanpy AnnData object
## AnnData (Annotated Data) is the central data structure in Scanpy:
##   adata.X       = count matrix (cells × genes)
##   adata.obs     = cell metadata (barcodes, QC metrics, cluster labels)
##   adata.var     = gene metadata (gene names, highly variable flag)
##   adata.uns     = unstructured data (UMAP params, colour maps, etc.)
##   adata.obsm    = dimensional reduction embeddings (PCA, UMAP coordinates)
##   adata.obsp    = pairwise distances/connectivities between cells (KNN graph)
######################################################################

## Install dependencies (run once):
## pip install scanpy python-igraph leidenalg scrublet

import scanpy as sc        # single-cell analysis
import numpy as np         # numerical operations
import pandas as pd        # data frames
import matplotlib.pyplot as plt  # plotting
import warnings
warnings.filterwarnings('ignore')

## Scanpy settings
sc.settings.verbosity = 3          # show detailed progress messages
sc.logging.print_header()          # print scanpy version info
sc.settings.set_figure_params(
    dpi     = 150,                  # figure resolution
    facecolor = 'white',            # white background for publication
    figsize = (6, 5)               # default figure size
)

## --- Option A: Load from 10x Cell Ranger filtered_feature_bc_matrix folder ---
## Cell Ranger outputs three files in this folder:
##   matrix.mtx.gz   = sparse count matrix (Market Exchange Format)
##   features.tsv.gz = gene names and IDs
##   barcodes.tsv.gz = cell barcode sequences
## sc.read_10x_mtx: reads all three files and creates an AnnData object
adata = sc.read_10x_mtx(
    "data/filtered_feature_bc_matrix/",  # path to Cell Ranger output folder
    var_names  = 'gene_symbols',          # use gene symbols (GAPDH etc.) not Ensembl IDs
    cache      = True                     # cache parsed data for faster re-loading
)
## After loading: adata.X is a sparse matrix (cells × 33,538 genes)

## --- Option B: Load directly from Scanpy's built-in datasets ---
## adata = sc.datasets.pbmc3k()

print(adata)
## AnnData object with n_obs × n_vars = 2700 × 32738
## Rows (obs) = cells; Columns (var) = genes

## Make gene names unique (some genes have duplicate symbols — mitochondrial, etc.)
adata.var_names_make_unique()

print(f"\nLoaded {adata.n_obs} cells and {adata.n_vars} genes")
Download:

Step 2 — QC & Cell Filtering

Python — Calculate QC metrics, filter low-quality cells, doublet detection
######################################################################
## STEP 2: Quality control — remove dead cells, empty droplets, doublets
## Three key QC metrics:
##   1. n_genes_by_counts : number of detected genes per cell
##      Low = empty droplet or dead cell; Very high = possible doublet
##   2. total_counts : total UMI counts per cell (library size)
##      Low = empty or dying cell
##   3. pct_counts_mt : % of counts from mitochondrial genes
##      High % = dying cell (cytoplasmic mRNA leaks; mt transcripts remain)
##      Typically filter cells with > 10-20% mitochondrial reads
######################################################################

## --- Step 2a: Identify mitochondrial genes ---
## Mitochondrial genes in human data start with "MT-"
## In mouse: "mt-" (lowercase)
## Mark mitochondrial genes in the gene metadata (adata.var)
adata.var['mt'] = adata.var_names.str.startswith('MT-')

## --- Step 2b: Calculate QC metrics for each cell ---
## sc.pp.calculate_qc_metrics:
##   qc_vars=['mt'] : also compute % counts from mitochondrial genes
##   percent_top=None : skip % in top-N genes (not needed here)
##   log1p=False : report raw values, not log-transformed
sc.pp.calculate_qc_metrics(
    adata,
    qc_vars   = ['mt'],
    percent_top = None,
    log1p     = False,
    inplace   = True   # adds results directly to adata.obs
)
## adata.obs now contains:
##   n_genes_by_counts : number of genes with >= 1 count per cell
##   total_counts      : total UMI counts per cell
##   pct_counts_mt     : % of counts from mitochondrial genes

## --- Step 2c: Visualise QC distributions ---
## Violin plots of QC metrics — helps you choose cutoffs
sc.pl.violin(
    adata,
    ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
    jitter   = 0.4,           # show individual cell points
    multi_panel = True,        # separate panel per metric
    save    = '_qc_before_filter.png'
)

## Scatter: n_genes vs total_counts coloured by % mitochondrial
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts',
              color='pct_counts_mt', save='_scatter_qc.png')

## --- Step 2d: Apply QC filters ---
## These thresholds are data-specific — adjust after visualising QC plots
## Minimum thresholds:
##   n_genes >= 200 : require at least 200 detected genes (remove empty droplets)
##   n_genes <= 2500 : remove cells with unusually many genes (likely doublets)
##     (PBMC 3k is v2 chemistry; newer v3 data can go up to 5000)
## Mitochondrial:
##   pct_counts_mt < 5% : remove dying cells
print(f"Cells before QC filter: {adata.n_obs}")

adata = adata[adata.obs.n_genes_by_counts >= 200, :]    # too few genes = empty
adata = adata[adata.obs.n_genes_by_counts  < 2500, :]   # too many genes = doublet
adata = adata[adata.obs.pct_counts_mt      < 5.0, :]    # high mt% = dying cell

print(f"Cells after QC filter:  {adata.n_obs}")

## --- Step 2e: Remove genes expressed in very few cells ---
## Keep only genes detected in >= 3 cells
## (very rare genes add noise without contributing to clustering)
sc.pp.filter_genes(adata, min_cells=3)
print(f"Genes after filtering: {adata.n_vars}")

## --- Step 2f: Doublet detection with Scrublet ---
## Scrublet simulates doublets by combining random pairs of real cells,
## then scores each observed cell for its probability of being a doublet
import scrublet as scr

## Create a Scrublet object
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)  # 6% expected for 10x 3k cells

## Run doublet simulation and scoring
doublet_scores, predicted_doublets = scrub.scrub_doublets(
    min_counts     = 2,
    min_cells      = 3,
    min_gene_variability_pctl = 85,
    n_prin_comps   = 30
)

## Add doublet scores to AnnData
adata.obs['doublet_score']     = doublet_scores
adata.obs['predicted_doublet'] = predicted_doublets

print(f"\nPredicted doublets: {predicted_doublets.sum()}")
print(f"Doublet rate: {predicted_doublets.mean():.1%}")

## Remove predicted doublets
adata = adata[~adata.obs['predicted_doublet'], :]
print(f"Cells after doublet removal: {adata.n_obs}")
Download:
QC thresholds decoded
📌pct_counts_mt thresholdThere is no universal mitochondrial threshold. It depends on cell type (cardiomyocytes naturally have very high mt expression), tissue of origin, and dissociation protocol quality. For PBMCs: 5% is standard. For brain: 10-20%. For heart: up to 50%. Always plot the distribution and choose a threshold that separates a clear “dying cell” tail from the main cell population.
🎯Doublet rate ~6%10x Chromium generates doublets at ~0.8% per 1,000 cells loaded. For 3k cells, that is ~2.4% (technical). Plus homotypic doublets (same cell type) which are hard to detect. Scrublet’s expected_doublet_rate should match the 10x loading density — check the 10x website for the expected doublet rate for your chip version and cell loading concentration.

Step 3 — Normalization & Highly Variable Genes

Python — Normalize to 10k counts, log1p transform, highly variable gene selection
######################################################################
## STEP 3: Normalization and highly variable gene (HVG) selection
## Normalization: correct for differences in sequencing depth between cells
## (cells with more total counts will appear to express all genes more)
## HVG selection: focus downstream analysis on genes that vary between cells
## (most genes are housekeeping genes with constant expression — not informative)
######################################################################

## --- Step 3a: Save raw counts for differential expression later ---
## Before normalization, save the raw counts in adata.raw
## This is important: DESeq2-style tools need RAW counts, not normalized
## In Scanpy, adata.raw stores the pre-normalization count matrix
adata.raw = adata    # stores current state (raw counts) in adata.raw

## --- Step 3b: Total-count normalization ---
## sc.pp.normalize_total: scale each cell so that total counts = target_sum
## target_sum = 1e4 (10,000 counts per cell, "CP10K")
## After this: each cell has the same total expression (10k counts)
## A gene with 500 counts in a 10k-count cell = 5% of all expression
sc.pp.normalize_total(
    adata,
    target_sum = 1e4     # normalize to 10,000 counts per cell (CP10K)
)

## --- Step 3c: Log(1+x) transformation ---
## log1p: apply natural log(count + 1) to each normalized value
## +1 before log prevents log(0) = -infinity for zero counts
## After this: intensities are compressed and approximately normally distributed
## The +1 pseudocount ensures zeros remain as zeros (log(0+1) = 0)
sc.pp.log1p(adata)
## adata.X now contains log-normalized expression values

## --- Step 3d: Select highly variable genes ---
## sc.pp.highly_variable_genes:
##   Identifies genes with high variance relative to their mean expression
##   (using the mean-dispersion relationship from negative binomial model)
## min_mean = 0.0125 : minimum mean expression (filter very lowly expressed genes)
## max_mean = 3      : maximum mean (filter very highly expressed housekeeping genes)
## min_disp = 0.5    : minimum dispersion (variance-to-mean ratio)
##   Higher min_disp = more stringent = fewer HVGs = faster but less complete
## n_top_genes = 2000 : alternatively specify the number of top HVGs directly
## batch_key : if you have multiple batches, compute HVGs per batch and take union
sc.pp.highly_variable_genes(
    adata,
    min_mean   = 0.0125,
    max_mean   = 3,
    min_disp   = 0.5
)
print(f"Highly variable genes selected: {adata.var.highly_variable.sum()}")

## --- Plot dispersion vs mean for HVG selection ---
sc.pl.highly_variable_genes(adata, save='_hvg_selection.png')

## --- Step 3e: Subset to HVGs for all downstream analysis ---
adata = adata[:, adata.var.highly_variable]
print(f"Working with {adata.n_vars} HVGs for dimensionality reduction and clustering")

## --- Step 3f: Scale genes to unit variance (zero mean, unit SD per gene) ---
## sc.pp.scale: z-score each gene across all cells
## max_value=10 : clip extreme outliers at 10 SDs from mean
##   (prevents highly expressed outlier cells from dominating PCA)
sc.pp.scale(adata, max_value=10)
Download:

Step 4 — PCA & Neighbor Graph

Python — PCA, KNN graph, and batch correction with Harmony
######################################################################
## STEP 4: PCA and KNN neighbor graph construction
## PCA: reduces 2000-gene HVG space to 50 principal components
## KNN graph: connects each cell to its K nearest neighbors in PC space
## This graph is the basis for both UMAP embedding and clustering
######################################################################

## --- Step 4a: PCA ---
## sc.tl.pca: computes PCA on the scaled HVG matrix
## n_comps = 50  : compute top 50 principal components
## use_highly_variable : only use HVGs (already subsetted, so redundant here)
## svd_solver = 'arpack' : fast sparse solver (better than 'auto' for large datasets)
sc.tl.pca(
    adata,
    n_comps        = 50,
    svd_solver     = 'arpack',
    random_state   = 42         # for reproducibility
)

## --- Plot variance explained by each PC ---
sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True,
                          save='_pca_variance.png')
## Elbow at PC ~10-15 for PBMC data: use 40-50 PCs for KNN graph
## (including more PCs improves clustering but slows UMAP)

## --- Step 4b: Build KNN neighbor graph ---
## sc.pp.neighbors: builds the K-nearest-neighbor graph in PCA space
## n_neighbors = 15 : each cell connects to its 15 nearest neighbors
##   More neighbors = smoother UMAP, less fine-grained clusters
##   Fewer neighbors = sharper clusters, noisier UMAP
## n_pcs = 40       : use top 40 PCs (ignore PCs 41-50 which are mostly noise)
## metric = 'cosine': use cosine distance (better than Euclidean for sparse data)
##   Cosine distance normalizes for expression magnitude — important because
##   cells with more total counts have larger Euclidean distances even if
##   their expression profiles are identical
sc.pp.neighbors(
    adata,
    n_neighbors = 15,
    n_pcs       = 40,
    metric      = 'cosine',
    random_state = 42
)

## --- Step 4c: (Optional) Batch correction with Harmony ---
## If you have data from multiple samples/batches, use Harmony before building neighbors
## Harmony adjusts the PCA embedding to remove batch effects
## scvi-tools or BBKNN are alternatives for more severe batch effects
## pip install harmonypy
import harmonypy as hm

## If you had a 'batch' column in adata.obs:
## harmony_out = hm.run_harmony(adata.obsm['X_pca'], adata.obs, 'batch', random_state=42)
## adata.obsm['X_pca_harmony'] = harmony_out.Z_corr.T
## Then use use_rep='X_pca_harmony' in sc.pp.neighbors() above

print("PCA and KNN graph complete.")
Download:

Step 5 — UMAP & Leiden Clustering

Python — UMAP embedding and Leiden community detection
######################################################################
## STEP 5: UMAP embedding and Leiden clustering
## UMAP: creates a 2D visualisation of the KNN graph
## Leiden: partitions the KNN graph into communities (= cell clusters)
## IMPORTANT: clustering happens on the KNN GRAPH (not on UMAP coordinates!)
## UMAP is only for visualisation — never cluster on UMAP coordinates
######################################################################

## --- Step 5a: Leiden clustering ---
## sc.tl.leiden: graph-based community detection
## resolution = 0.5 : controls cluster granularity
##   Higher resolution = more, smaller clusters
##   Lower resolution = fewer, larger clusters
##   For PBMC 3k: 0.5 gives ~8 clusters (major cell types)
##   For fine-grained subtypes: try 1.0-2.0
sc.tl.leiden(
    adata,
    resolution   = 0.5,
    random_state = 42     # random seed for reproducibility
)
print(f"Leiden clusters found: {adata.obs['leiden'].nunique()}")
print(adata.obs['leiden'].value_counts())

## --- Step 5b: UMAP embedding ---
## sc.tl.umap: compute 2D UMAP coordinates from the KNN graph
## min_dist = 0.3 : controls how tightly points cluster in UMAP space
##   Smaller min_dist = tighter clusters (good for highlighting structure)
##   Larger min_dist = more even spread (good for seeing relationships)
## spread = 1.0 : global scale of the UMAP embedding
sc.tl.umap(
    adata,
    min_dist     = 0.3,
    spread       = 1.0,
    random_state = 42
)

## --- Step 5c: Visualise UMAP ---
## Color by Leiden cluster
sc.pl.umap(adata, color='leiden',
           legend_loc='on data',     # show cluster numbers on the UMAP
           legend_fontsize=8,
           title='PBMC 3k — Leiden clusters (resolution=0.5)',
           save='_leiden_clusters.png')

## Color by QC metrics to check cluster quality
sc.pl.umap(adata, color=['n_genes_by_counts', 'pct_counts_mt', 'total_counts'],
           ncols=3, save='_qc_umap.png')
## Clusters dominated by high mt% or very low gene count = potential artifact clusters

## Color by known marker genes to preview cell types
known_markers = ['CD3D', 'CD4', 'CD8A', 'MS4A1', 'GNLY', 'NKG7',
                 'CD14', 'FCGR3A', 'FCER1A', 'PPBP']
sc.pl.umap(adata, color=known_markers, ncols=5,
           use_raw=True,   # use raw counts for marker gene display
           save='_known_markers.png')
Download:
🌎
UMAP is NOT for clustering: UMAP is a non-linear dimensionality reduction that preserves local (but not global) structure. Two clusters that appear close on UMAP may not be biologically related — UMAP distances are not meaningful. Always perform clustering on the KNN graph, then visualise those clusters on UMAP. Never run k-means or DBSCAN on UMAP coordinates for final cluster assignments.

Step 6 — Marker Genes & Cell Type Annotation

Python — Rank genes per cluster and annotate cell types
######################################################################
## STEP 6: Find marker genes and annotate cell types
## sc.tl.rank_genes_groups: finds genes that are highly expressed in
## each cluster compared to all other clusters (one-vs-rest comparison)
## Method: Wilcoxon rank-sum test (non-parametric; recommended for scRNA-seq)
######################################################################

## --- Step 6a: Find marker genes for each cluster ---
## sc.tl.rank_genes_groups: differential expression, cluster vs rest
## method = 'wilcoxon' : Wilcoxon rank-sum test
##   More robust than t-test for scRNA-seq (handles zero inflation better)
##   'logreg' (logistic regression) is an alternative — better for rare populations
## use_raw = True : use raw (log-normalized but unscaled) counts
##   Raw counts give cleaner effect sizes for marker genes
## n_genes = 25 : report top 25 markers per cluster
sc.tl.rank_genes_groups(
    adata,
    groupby  = 'leiden',     # compare each Leiden cluster to all others
    method   = 'wilcoxon',
    use_raw  = True,
    n_genes  = 25,
    pts      = True          # also report fraction of cells expressing each gene
)

## --- Visualise top 5 markers per cluster ---
sc.pl.rank_genes_groups(adata, n_genes=5, sharey=False,
                         save='_marker_genes.png')

## --- Dotplot: expression of top markers ---
sc.pl.rank_genes_groups_dotplot(
    adata,
    n_genes   = 5,
    groupby   = 'leiden',
    standard_scale = 'var',   # scale each gene to [0,1] range for visual comparison
    save      = '_dotplot_markers.png'
)

## --- Step 6b: Manual cell type annotation based on known markers ---
## Common PBMC markers:
##   CD4 T cells     : CD3D+, CD4+, IL7R+
##   CD8 T cells     : CD3D+, CD8A+, CD8B+
##   NK cells        : GNLY+, NKG7+, KLRD1+  (no CD3)
##   B cells         : MS4A1 (CD20)+, CD79A+
##   CD14 Monocytes  : CD14+, LYZ+, CST3+
##   FCGR3A Monocytes: FCGR3A (CD16)+, MS4A7+
##   Dendritic cells : FCER1A+, CST3+ (low CD14)
##   Megakaryocytes  : PPBP+, PF4+

## Assign cell type labels based on cluster marker gene review
cell_type_map = {
    '0': 'CD4 T cells',
    '1': 'CD14 Monocytes',
    '2': 'CD4 T cells',
    '3': 'B cells',
    '4': 'CD8 T cells',
    '5': 'NK cells',
    '6': 'FCGR3A Monocytes',
    '7': 'Dendritic cells',
    '8': 'Megakaryocytes'
}
## Note: exact mapping depends on your clustering result
## Always validate by checking expression of canonical markers in each cluster
adata.obs['cell_type'] = adata.obs['leiden'].map(cell_type_map)

## --- Visualise annotated UMAP ---
sc.pl.umap(adata, color='cell_type',
           palette   = 'tab10',           # 10-colour palette for cell types
           legend_loc = 'right margin',
           title     = 'PBMC 3k — Annotated cell types',
           save      = '_annotated_celltypes.png')
Download:

Step 7 — Differential Expression Between Conditions

Python — Pseudo-bulk DE analysis with pydeseq2
######################################################################
## STEP 7: Pseudo-bulk differential expression between conditions
## IMPORTANT: Do NOT run naive single-cell DE by treating each cell
## as an independent sample. This pseudoreplication inflates sample size
## and produces massive numbers of false positives.
## CORRECT approach: pseudo-bulk aggregation
##   1. Sum raw counts per cell type per donor (sample)
##   2. Run bulk DE tools (DESeq2/edgeR) on the aggregated counts
## This is equivalent to bulk RNA-seq per cell type per sample.
######################################################################

## pip install pydeseq2
import pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import anndata

## --- Example: stimulated vs resting CD4 T cells from PBMC data ---
## We use the PBMC 3k stimulated vs resting dataset (Kang et al. 2018)
## Download: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96583
## adata has donor ID in adata.obs['donor'] and condition in adata.obs['condition']

## --- Step 7a: Subset to cell type of interest ---
cd4 = adata[adata.obs['cell_type'] == 'CD4 T cells'].copy()
cd4.X = cd4.raw[:, cd4.var_names].X  # use raw counts for pseudo-bulk DE

## --- Step 7b: Pseudo-bulk aggregation ---
## For each donor × condition, sum counts across all cells of that type
import scipy.sparse as sp

## Create pseudo-bulk count matrix
pb_counts  = {}
pb_meta    = []

for donor in cd4.obs['donor'].unique():
    for condition in cd4.obs['condition'].unique():
        ## Select cells for this donor + condition
        mask   = (cd4.obs['donor'] == donor) & (cd4.obs['condition'] == condition)
        subset = cd4[mask]
        if subset.n_obs < 10:  # skip if fewer than 10 cells (unreliable aggregation)
            continue
        ## Sum counts across all selected cells (axis=0 = sum over cells)
        if sp.issparse(subset.X):
            counts_sum = np.array(subset.X.sum(axis=0)).flatten()
        else:
            counts_sum = subset.X.sum(axis=0)
        sample_id  = f"{donor}_{condition}"
        pb_counts[sample_id] = counts_sum
        pb_meta.append({'sample': sample_id, 'donor': donor, 'condition': condition})

## Build pseudo-bulk AnnData
pb_df   = pd.DataFrame(pb_counts, index=cd4.var_names).T
pb_meta_df = pd.DataFrame(pb_meta).set_index('sample')
pb_adata = anndata.AnnData(X=pb_df.values, obs=pb_meta_df,
                            var=pd.DataFrame(index=cd4.var_names))

print(f"Pseudo-bulk samples: {pb_adata.n_obs}")
print(f"Genes: {pb_adata.n_vars}")

## --- Step 7c: DESeq2 via pydeseq2 ---
## pydeseq2 = Python reimplementation of DESeq2 with identical statistics
dds = DeseqDataSet(
    adata      = pb_adata,
    design_factors = ['condition'],   # test condition effect
    ref_level  = ['condition', 'control'],   # reference = resting cells
    refit_cooks = True
)
dds.deseq2()

## Extract results
stat_res = DeseqStats(dds, contrast=['condition', 'stimulated', 'control'])
stat_res.summary()
res_df   = stat_res.results_df.sort_values('padj')

print(f"\nTop DE genes in CD4 T cells (stimulated vs control):")
print(res_df.head(10))
res_df.to_csv("pseudobulk_CD4_stim_vs_ctrl.csv")
Download:
Pseudo-bulk vs cell-level DE decoded
📊Why pseudo-bulk?Running a Wilcoxon test between 500 stimulated T cells and 400 resting T cells treats each cell as an independent biological replicate. But all 500 stimulated cells came from the same donor — they are not independent. The biological variation between donors is what should drive the statistical test, not the variation between cells within a donor. Pseudo-bulk aggregation respects this: you compare 4 stimulated pseudo-samples (one per donor) vs 4 resting pseudo-samples.
🎯Minimum cells for pseudo-bulkPseudo-bulk requires at least 10–20 cells per cell type per sample to produce reliable aggregated counts. Rare cell types (<10 cells) produce noisy pseudo-bulk profiles. The standard advice is to require ≥10 cells and ≥3 donors per condition for pseudo-bulk DE to be statistically valid.