Single-Cell Genomics — Deep Dive Module 1

Technologies & Raw Processing · QC & Filtering · Normalization · Dimensionality Reduction · Clustering · Annotation · Differential Expression

~200 min Advanced Python / R / Bash
Byte

1.1 — Single-Cell Sequencing Technologies & Chemistry

PlatformTypeCells/RunSensitivityDoublet RateBest For
10x Chromium (3')Droplet500–10,000Moderate (~2000 genes/cell)~4% at 5k cellsLarge-scale surveys, atlas building
10x Chromium (5')Droplet500–10,000Moderate~4%TCR/BCR + transcriptome
Smart-seq2/3Plate96–384High (~5000 genes/cell)Near 0Deep profiling, rare cell types, full-length
Drop-seqDroplet5,000+Lower~5%Cost-effective, large screens
sci-RNA-seq3Combinatorial100,000+LowLowUltra-high throughput, organism-scale
SPLiT-seqCombinatorial100,000+LowLowNo special equipment needed
Choosing Your Technology
  • Need many cells (>5000), standard profiling: 10x Chromium 3'
  • Need full-length transcripts, isoforms, splice variants: Smart-seq2/3
  • Need TCR/BCR repertoire + gene expression: 10x 5' + VDJ
  • Need >100,000 cells, organism-wide atlas: sci-RNA-seq3 or SPLiT-seq
  • Need protein + RNA simultaneously: CITE-seq (10x Feature Barcoding)
Thumb Rule — Sensitivity vs Throughput: Smart-seq2 detects ~2.5× more genes per cell than 10x 3', but captures ~100× fewer cells. You can't have both — choose based on whether you need depth (rare cell types, subtle states) or breadth (comprehensive atlas).

Read Structure & Barcoding

10x Chromium v3 Read Structure
# 10x Chromium v3 chemistry:
# Read 1 (28 bp): [Cell Barcode: 16 bp][UMI: 12 bp]
# Read 2 (91 bp): [cDNA insert → transcript sequence]
# Index Read: [Sample Index: 8 bp]
#
# Cell barcode: identifies which cell (from whitelist of ~6.7M barcodes)
# UMI: deduplicates PCR amplification artifacts
# Each unique CB+UMI+gene combination = 1 original mRNA molecule
#
# Expected doublet rate ≈ 0.008 × (cells_loaded / 1000)
# Loading 10,000 cells → ~8% doublets
# Loading 5,000 cells → ~4% doublets (recommended default)

1.2 — Raw Read Processing & Quantification

Bash — STARsolo & Cell Ranger
# STARsolo: open-source Cell Ranger equivalent (faster, more flexible)
STAR --runMode genomeGenerate \
  --genomeDir star_index/ \
  --genomeFastaFiles genome.fa \
  --sjdbGTFfile genes.gtf \
  --runThreadN 16

STAR --soloType CB_UMI_Simple \
  --soloCBwhitelist 3M-february-2018.txt \
  --soloCBstart 1 --soloCBlen 16 \
  --soloUMIstart 17 --soloUMIlen 12 \
  --genomeDir star_index/ \
  --readFilesIn R2.fq.gz R1.fq.gz \
  --readFilesCommand zcat \
  --runThreadN 16 \
  --outSAMtype BAM SortedByCoordinate \
  --soloFeatures Gene GeneFull Velocyto \
  --soloCellFilter EmptyDrops_CR \
  --outFileNamePrefix starsolo_output/

# GeneFull: counts intronic reads (needed for RNA velocity)
# Velocyto: outputs spliced/unspliced/ambiguous matrices

# Cell Ranger (10x official pipeline):
cellranger count --id=sample1 \
  --transcriptome=/path/to/refdata \
  --fastqs=/path/to/fastqs/ \
  --sample=Sample1 \
  --localcores=16 --localmem=64

# alevin-fry (salmon-based, very fast):
salmon alevin -l ISR \
  -i salmon_index \
  -1 R1.fq.gz -2 R2.fq.gz \
  --chromiumV3 -p 16 -o alevin_output/
# Then process with alevin-fry for USA (unspliced/spliced/ambiguous)
alevin-fry generate-permit-list -d fw -i alevin_output/ -o quant/
alevin-fry collate -i quant/ -r alevin_output/
alevin-fry quant -i quant/ -m quant/map.rad -t 16 \
  --resolution cr-like -o quant_result/
Download:
ToolSpeedVelocity SupportLicenseBest For
Cell RangerSlowNo (external velocyto)Proprietary (free)10x default, regulatory
STARsoloFastYes (built-in)Open sourceMost flexible, recommended
alevin-fryVery fastYes (USA mode)Open sourceSpeed-critical, non-model organisms
kallisto-bustoolsVery fastYesOpen sourcePseudoalignment, lightweight
Mnemonic

"STARsolo Supersedes Cell Ranger"

STARsolo produces nearly identical results to Cell Ranger but is faster, open-source, and directly outputs velocity matrices. Unless your collaborator or journal requires Cell Ranger, use STARsolo.

1.3 — Empty Droplet Detection & Ambient RNA Removal

R — EmptyDrops & SoupX
library(DropletUtils)
library(SoupX)

# --- EmptyDrops: distinguish cells from empty droplets ---
raw_counts <- read10xCounts("starsolo_output/raw/")

# The knee plot: total UMI per barcode, sorted descending
br.out <- barcodeRanks(counts(raw_counts))
plot(br.out$rank, br.out$total, log="xy",
     xlab="Rank", ylab="Total UMIs",
     main="Knee Plot")
abline(h=metadata(br.out)$inflection, col="red", lty=2)  # inflection
abline(h=metadata(br.out)$knee, col="blue", lty=2)        # knee

# Statistical test: is this barcode a real cell?
e.out <- emptyDrops(counts(raw_counts), lower=100)
is.cell <- e.out$FDR <= 0.01
table(is.cell)
# TRUE = real cell, FALSE = empty droplet
# NAs = below lower threshold (definitely empty)

# Keep only real cells
cell_counts <- counts(raw_counts)[, which(is.cell)]

# --- SoupX: remove ambient RNA contamination ---
# Ambient RNA = mRNA from lysed cells floating in the droplet solution
# Contaminates ALL droplets with a background expression profile

# Load raw (unfiltered) and filtered matrices
sc <- load10X("cellranger_output/")
# OR manually:
# sc <- SoupChannel(tod=raw_matrix, toc=filtered_matrix)

# Estimate contamination fraction (rho)
sc <- autoEstCont(sc)
# rho typically 0.01-0.20 (1-20% contamination)
# Print estimate:
cat("Estimated contamination:", sc$metaData$rho[1], "\n")

# Correct the count matrix
corrected <- adjustCounts(sc)

# CellBender (alternative, deep learning-based):
# cellbender remove-background \
#   --input raw_feature_bc_matrix.h5 \
#   --output cellbender_output.h5 \
#   --expected-cells 5000 \
#   --total-droplets-included 25000 \
#   --epochs 150
Download:
Why Ambient RNA Matters

If 10% of counts in every cell come from ambient RNA, highly-expressed genes from abundant cell types will appear in ALL clusters — creating false marker genes and phantom cell states. SoupX or CellBender should be run before any downstream analysis. This is especially critical for tissues with high cell lysis (e.g., brain, tumor).

1.4 — Doublet Detection

Python — Scrublet & scDblFinder
import scrublet as scr
import scanpy as sc
import numpy as np

# Load count matrix
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")

# --- Scrublet: simulation-based doublet detection ---
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets()

# Visualize score distribution
scrub.plot_histogram()
# Bimodal distribution: low scores = singlets, high = doublets
# Threshold is set automatically; check it makes sense

adata.obs['doublet_score'] = doublet_scores
adata.obs['predicted_doublet'] = predicted_doublets

# Remove doublets
adata = adata[~adata.obs['predicted_doublet']].copy()
print(f"Removed {predicted_doublets.sum()} doublets, "
      f"{adata.n_obs} cells remaining")
R — scDblFinder (Recommended)
# scDblFinder: fastest and most accurate doublet detector
library(scDblFinder)
library(SingleCellExperiment)

sce <- SingleCellExperiment(assays=list(counts=count_matrix))
sce <- scDblFinder(sce, dbr=0.06)  # expected doublet rate

# Results in colData
table(sce$scDblFinder.class)
# singlet / doublet

# For multiplexed samples (e.g., pooled donors):
# Use Demuxlet/Vireo for genotype-based demultiplexing
# This identifies doublets from DIFFERENT donors (heterotypic)
# but misses same-donor doublets (homotypic)
Download:
Thumb Rule — Doublet Rate: Expected doublet rate ≈ 0.8% per 1,000 cells loaded. So: 5,000 cells → ~4%, 10,000 cells → ~8%, 20,000 cells → ~16%. Above 10,000 cells loaded, doublets become a serious problem. Always run Scrublet or scDblFinder.

2.1 — Quality Control & Filtering

Python — Scanpy QC
import scanpy as sc
import numpy as np

adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
adata.var_names_make_unique()

# Calculate QC metrics
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['ribo'] = adata.var_names.str.startswith(('RPS','RPL'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','ribo'],
                            percent_top=None, log1p=False, inplace=True)

# Visualize QC distributions
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts',
              'pct_counts_mt', 'pct_counts_ribo'],
              jitter=0.4, multi_panel=True)

# --- Adaptive thresholds (MAD-based) ---
# Much better than hard cutoffs — adapts to each dataset
from scipy.stats import median_abs_deviation

def mad_outlier(data, nmads=5):
    median = np.median(data)
    mad = median_abs_deviation(data)
    return (data < median - nmads * mad) | (data > median + nmads * mad)

# Identify outlier cells
adata.obs['outlier_genes'] = mad_outlier(np.log1p(adata.obs['n_genes_by_counts']), nmads=5)
adata.obs['outlier_counts'] = mad_outlier(np.log1p(adata.obs['total_counts']), nmads=5)
adata.obs['outlier_mt'] = adata.obs['pct_counts_mt'] > 20  # hard cap for mt

# Combined filter
adata.obs['is_outlier'] = (adata.obs['outlier_genes'] |
                            adata.obs['outlier_counts'] |
                            adata.obs['outlier_mt'])
print(f"Filtering {adata.obs['is_outlier'].sum()} outlier cells "
      f"out of {adata.n_obs}")
adata = adata[~adata.obs['is_outlier']].copy()

# Also filter genes
sc.pp.filter_genes(adata, min_cells=10)  # gene in ≥10 cells
Download:
Byte warns
The danger of over-filtering: Strict cutoffs (e.g., >200 genes, <5000 genes, <20% MT) can remove rare cell types. Hepatocytes have high gene counts. Red blood cells have very few genes. Cardiomyocytes have high mitochondrial fraction (biologically, not dying). Always visualize QC distributions and consider cell-type biology before setting thresholds.

2.2 — Normalization Strategies

MethodApproachToolWhen to Use
Library size + logDivide by total counts, multiply by 10k, log1pScanpy/Seurat defaultQuick exploration, simple analyses
scran poolingPool cells to estimate size factors, deconvolvescran (R)Better for heterogeneous datasets
SCTransformVariance-stabilizing transformation (GLM per gene)Seurat v5Best for Seurat workflows, handles zero-inflation
Analytic Pearson residualsResiduals from null model of technical noisescanpy/scryScanpy workflows, theoretically grounded
Python — Scanpy Normalization
# Standard: library size normalization + log transform
adata.layers['counts'] = adata.X.copy()  # save raw counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.layers['log1p'] = adata.X.copy()

# Analytic Pearson residuals (preferred for feature selection)
# From scanpy >=1.9:
sc.experimental.pp.normalize_pearson_residuals(adata)
# Automatically selects HVGs and normalizes in one step
R — SCTransform (Seurat v5)
# SCTransform: regularized negative binomial regression per gene
library(Seurat)
obj <- CreateSeuratObject(counts=count_matrix)
obj <- SCTransform(obj, vst.flavor="v2",
                    vars.to.regress="percent.mt",
                    verbose=FALSE)
# v2 flavor: improved, uses glmGamPoi for speed
# vars.to.regress: removes unwanted variation (MT%, cell cycle, etc.)
# Output: SCT assay with corrected counts, normalized data, scale.data
Download:
Mnemonic

"SCT for Seurat, Pearson for Python"

If using Seurat (R), SCTransform v2 is the best normalization. If using Scanpy (Python), analytic Pearson residuals are the preferred modern approach. Both outperform simple library-size normalization for heterogeneous datasets.

2.3 — Feature Selection & Highly Variable Genes

Python
# Scanpy: identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000,
                             flavor='seurat_v3',
                             layer='counts')  # use raw counts for seurat_v3
sc.pl.highly_variable_genes(adata)

# Restrict to HVGs for downstream analysis
adata = adata[:, adata.var['highly_variable']].copy()

# Alternative: deviance-based feature selection (scry)
# Selects genes with most biological variation relative to a null model
# from scry import deviance_residuals
# deviance_residuals(adata)  # adds 'highly_deviant' to adata.var

# How many genes to select?
# 1000-3000: typical range
# Too few: miss subtle cell states
# Too many: include noise, batch-driven genes
# Rule of thumb: start with 2000, increase if rare types are missed
Thumb Rule — HVG Selection: 2000 genes is the standard default. For very heterogeneous datasets (many cell types), try 3000–5000. For focused datasets (one cell type, subtle states), 1000–2000 is sufficient. Always check if your known marker genes are in the HVG set.

2.4 — Dimensionality Reduction (Deep)

Python — PCA, UMAP, t-SNE
# PCA: the foundation — all downstream analysis uses PC space
sc.pp.scale(adata, max_value=10)  # z-score (only if not using Pearson)
sc.tl.pca(adata, n_comps=50, svd_solver='arpack')

# How many PCs to use? Elbow plot:
sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True)
# Pick the "elbow" where variance explained levels off
# Typical: 15-30 PCs for most datasets

# Molecular cross-validation (more rigorous):
# Split genes randomly, run PCA on each half, compare
# Number of PCs where halves agree = optimal number

# --- UMAP ---
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata, min_dist=0.3, spread=1.0)
sc.pl.umap(adata, color=['leiden', 'n_genes_by_counts', 'pct_counts_mt'])

# UMAP parameters that matter:
# n_neighbors: 5-15 = local structure (tight clusters)
#              30-50 = global structure (preserves distances)
# min_dist: 0.0-0.1 = dense, packed clusters
#           0.3-0.5 = more spread out (default 0.5 in scanpy)

# --- t-SNE ---
sc.tl.tsne(adata, n_pcs=30, perplexity=30)
# perplexity: ~sqrt(n_cells) or 30 for typical datasets
# t-SNE is NOT suitable for trajectory inference
# Use openTSNE for large datasets (faster, more features)

# CRITICAL: UMAP/t-SNE distances are NOT meaningful!
# Cells close together MAY be similar, cells far apart may or may NOT be
# Never interpret inter-cluster distances on UMAP
# All quantitative comparisons should use PCA space or graph-based methods
Download:
Mnemonic

"UMAP Lies about Distances"

UMAP preserves local neighborhood structure but distorts global distances. Two clusters far apart on UMAP may actually be very similar in gene expression. Never say "cluster A is more similar to cluster B than C" based on UMAP position alone. Use dendrogram on PCA space instead.

2.5 — Clustering

Python — Leiden Clustering
# Leiden algorithm (preferred over Louvain — provably better partitions)
# First: build k-nearest-neighbor graph in PCA space
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30, metric='cosine')

# Cluster at multiple resolutions
for res in [0.2, 0.4, 0.6, 0.8, 1.0, 1.5]:
    sc.tl.leiden(adata, resolution=res, key_added=f'leiden_{res}')

# Visualize each
sc.pl.umap(adata, color=[f'leiden_{r}' for r in [0.2,0.5,1.0,1.5]],
            ncols=2)

# Choosing resolution:
# Low (0.1-0.3): broad cell types (T cells, B cells, myeloid)
# Medium (0.5-0.8): refined types (CD4 naive, CD4 memory, CD8 naive)
# High (1.0-2.0): subtypes/states (activated, resting, cycling)

# Strategy: overcluster (high resolution) then merge
# based on marker genes and biological knowledge

# Cluster stability assessment (optional but recommended):
# Subsample 80% of cells 100 times, re-cluster, check consistency
# Stable clusters appear in >90% of subsamples

# Hierarchical relationship between clusters:
sc.tl.dendrogram(adata, groupby='leiden_0.8')
sc.pl.dendrogram(adata, groupby='leiden_0.8')
Download:
Thumb Rule — Resolution: There is no single "correct" resolution. The right resolution depends on your biological question. For a broad atlas: use low resolution. For detailed characterization of one cell type: subcluster that population at higher resolution. Always validate clusters with known marker genes.

2.6 — Cell Type Annotation

Python — Automated Annotation
# --- Manual annotation: find marker genes per cluster ---
sc.tl.rank_genes_groups(adata, groupby='leiden_0.8', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False)

# Check known markers:
marker_genes = {
    'T cells': ['CD3D','CD3E','CD2'],
    'CD4 T': ['CD4','IL7R','CCR7'],
    'CD8 T': ['CD8A','CD8B','GZMK'],
    'B cells': ['CD19','MS4A1','CD79A'],
    'NK cells': ['NKG7','GNLY','KLRB1'],
    'Monocytes': ['CD14','LYZ','S100A9'],
    'DC': ['FCER1A','CST3','CLEC10A'],
}
sc.pl.dotplot(adata, marker_genes, groupby='leiden_0.8')

# --- CellTypist: automated annotation ---
import celltypist
from celltypist import models

# Download model
models.download_models(model='Immune_All_Low.pkl')
model = models.Model.load(model='Immune_All_Low.pkl')

# Predict
predictions = celltypist.annotate(adata, model=model,
                                    majority_voting=True)
adata.obs['celltypist'] = predictions.predicted_labels.values
sc.pl.umap(adata, color='celltypist')

# --- SingleR (R, reference-based) ---
# library(SingleR)
# library(celldex)
# ref <- HumanPrimaryCellAtlasData()
# results <- SingleR(test=sce, ref=ref, labels=ref$label.main)
# sce$singler_labels <- results$labels

# --- Azimuth (Seurat, reference mapping) ---
# library(Seurat)
# library(Azimuth)
# obj <- RunAzimuth(obj, reference="pbmcref")
Download:
Annotation Strategy
  1. Run automated annotation (CellTypist / SingleR / Azimuth) for initial labels
  2. Validate with known marker genes (dotplot, violin plot)
  3. Check for novel or transitional states that references miss
  4. Manually refine labels using domain expertise
  5. For non-model organisms: transfer labels from a closely related species atlas

2.7 — Differential Expression Between Conditions

R — Pseudobulk DE (Gold Standard)
library(Seurat)
library(DESeq2)

# CRITICAL: Single-cell DE treating each cell as independent
# inflates false positives CATASTROPHICALLY
# Reason: cells from the same donor are correlated (pseudoreplication)
# Solution: PSEUDOBULK — aggregate counts per donor per cell type

# Step 1: Aggregate counts per sample per cell type
obj$sample_celltype <- paste(obj$sample_id, obj$celltype, sep="_")
pseudo <- AggregateExpression(obj, assays="RNA",
                               group.by="sample_celltype",
                               return.seurat=FALSE)$RNA

# Step 2: Create DESeq2 dataset
# Metadata for each pseudobulk sample
meta <- data.frame(
  sample = colnames(pseudo),
  condition = c("Control","Control","Control",
                "Treated","Treated","Treated"),
  row.names = colnames(pseudo)
)

dds <- DESeqDataSetFromMatrix(countData=pseudo,
                                colData=meta,
                                design=~condition)
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition","Treated","Control"))
res <- res[order(res$padj),]

# Step 3: Interpret
sig_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 0.5)
cat(nrow(sig_genes), "significant DE genes\n")

# Alternative: MAST (mixed model at single-cell level)
# Includes donor as random effect
# library(MAST)
# zlm(~condition + (1|donor), sca=sca_object)
Download:
The Pseudoreplication Catastrophe

Treating 5,000 cells from 3 donors as 5,000 independent observations (Wilcoxon per cell) gives thousands of false positive DE genes. The statistical unit is the donor/sample, not the cell. Always use pseudobulk (DESeq2/edgeR on aggregated counts) or mixed models (MAST with donor random effect). Minimum: 3 biological replicates per condition.

Mnemonic

"Pseudobulk is the Proper way" — PP

For differential expression between conditions (case vs control), always use pseudobulk. Aggregate counts per sample per cell type → run DESeq2/edgeR. The Wilcoxon test per cell is only appropriate for finding marker genes between clusters within one sample, not for comparing conditions across biological replicates.

AI Prompt Guide

Complete scRNA-seq Analysis Pipeline
I have [NUMBER] 10x Chromium 3' scRNA-seq samples: - [NUMBER] conditions: [CONDITION LIST] - [NUMBER] biological replicates per condition - Species: [SPECIES] - Expected cell types: [LIST] - FASTQ files already generated Please write a complete pipeline (Scanpy or Seurat) that: 1. Runs STARsolo for quantification (with velocity matrices) 2. Performs EmptyDrops cell calling and SoupX ambient RNA removal 3. Detects and removes doublets with scDblFinder 4. Applies adaptive MAD-based QC filtering 5. Normalizes (SCTransform for Seurat OR Pearson residuals for Scanpy) 6. Selects 2000 HVGs, runs PCA with elbow-based PC selection 7. Builds kNN graph, runs Leiden clustering at multiple resolutions 8. Annotates cell types with CellTypist + manual marker validation 9. Runs pseudobulk differential expression with DESeq2 10. Produces publication-quality UMAP, dotplot, volcano plot figures 11. Includes proper logging and QC report at each step

Troubleshooting

UMAP shows one big blob, no clusters

Causes: (1) Too few HVGs — increase to 3000+, (2) Too few PCs — check elbow plot, (3) Batch effect dominating — integrate before clustering (Module 2), (4) Over-correction — SCTransform regression removed real biology, (5) All cells are actually the same type (e.g., sorted population).

CellTypist/SingleR gives nonsensical labels

Reference mismatch: (1) Using a PBMC reference for brain tissue, (2) Species mismatch (human reference for mouse data), (3) Low-quality cells getting random labels. Always validate automated labels with known markers. Never trust automated annotation blindly.

Pseudobulk DE: 0 significant genes despite clear UMAP separation

Low statistical power: (1) Too few replicates (need ≥3, ideally ≥5), (2) High donor-to-donor variability, (3) Effect size is real but small. Solutions: increase replicates, use MAST with donor random effect for more power (at the cost of more assumptions), or relax thresholds (padj < 0.1).

STARsolo: Very few cells detected

Check: (1) Correct whitelist file for your chemistry version, (2) Read structure is correct (R1/R2 order), (3) Reference genome matches your species, (4) Use --soloCellFilter EmptyDrops_CR instead of the default knee filter for low-quality samples.

Mnemonics & Thumb Rules

Mnemonic

"Empty → Doublet → QC → Norm → Cluster → Annotate" — EDQNCA

The sacred order of scRNA-seq preprocessing. Skipping any step contaminates all downstream results. Empty droplet removal → Doublet detection → QC filtering → Normalization → Clustering → Annotation.

Mnemonic

"3 replicates, 3000 cells, 3 conditions max"

Minimum experimental design for condition-comparison scRNA-seq: 3 biological replicates per condition, ~3000 cells per sample, and start with ≤3 conditions. This gives you enough statistical power for pseudobulk DE while keeping costs manageable.

Thumb Rule — Leiden vs Louvain: Always use Leiden. Louvain can produce disconnected clusters (communities that are not internally connected). Leiden guarantees connected communities. The runtime is similar. There is no reason to use Louvain in 2024+.
Thumb Rule — n_neighbors: The k in kNN graph construction controls the granularity of all downstream results (clustering, UMAP). k=15 is the default. k=5-10 for small datasets (<1000 cells). k=30-50 for large datasets (>50,000 cells). Smaller k = more local, tighter clusters. Larger k = smoother, more global structure.