Single-Cell RNA-seq: Cell Ranger & Seurat

Process 10x Chromium scRNA-seq data from the Ciona robusta tunicate using Cell Ranger for alignment and UMI counting, then Seurat for QC, normalization, clustering, and UMAP visualization.

~150 min SRR8111691–4 Advanced Bash + R
Byte

Overview

10x Chromium scRNA-seq captures the transcriptome of thousands of individual cells in a single experiment. Each cell gets a unique barcode; each mRNA molecule gets a unique molecular identifier (UMI). The Cell Ranger pipeline handles the low-level processing, producing a sparse gene × cell count matrix that feeds into Seurat for downstream analysis.

Cell Ranger (HPC/Bash)
  • BCL → FASTQ demultiplexing
  • STAR alignment (internal)
  • Cell barcode detection (EmptyDrops)
  • UMI counting per cell per gene
  • Output: filtered_feature_bc_matrix/
Seurat (R)
  • QC metrics (nCount, nFeature, %mito)
  • Normalization (NormalizeData / SCTransform)
  • Highly variable gene selection
  • PCA → UMAP/t-SNE
  • Clustering + marker gene detection
Byte
About this dataset: Ciona robusta (also known as C. intestinalis type A) is a marine tunicate chordate — the closest invertebrate relative of vertebrates. This scRNA-seq experiment profiled whole-larval transcriptomes at single-cell resolution. Dataset from SRA (SRR8111691–SRR8111694), original tutorial by Rick Masonbrink (Iowa State University GIF).

Ideal Directory Structure

Cell Ranger creates a complex output directory tree. The only folder you need for downstream Seurat analysis is filtered_feature_bc_matrix/ — but keep the full output for QC and provenance.

scrna_ciona_project/
scrna_ciona_project/ ├── reads/ # renamed FASTQ files (Cell Ranger naming) │ ├── CRobusta_S1_L001_R1_001.fastq.gz # barcode+UMI (28bp) │ └── CRobusta_S1_L001_R2_001.fastq.gz # cDNA (91bp) ├── genome/ # reference FASTA + GTF ├── C_robusta/ # Cell Ranger reference index (cellranger mkref) ├── CRobusta_run/ # Cell Ranger count output │ └── outs/ │ ├── web_summary.html # interactive QC report (open in browser!) │ ├── metrics_summary.csv │ └── filtered_feature_bc_matrix/ # Seurat input │ ├── barcodes.tsv.gz │ ├── features.tsv.gz │ └── matrix.mtx.gz └── seurat_analysis/ # all R outputs ├── seurat_obj.rds # checkpoint: save after each major step ├── umap_clusters.png └── marker_genes.csv
📄
filtered vs raw matrix: Cell Ranger produces both filtered_feature_bc_matrix/ (cells only, EmptyDrops-filtered) and raw_feature_bc_matrix/ (all barcodes including empty droplets). Always use filtered for standard Seurat analysis. Use raw only if you want to run your own cell-calling algorithm (e.g. DropUtils).

Dataset

Run IDDescriptionRead type
SRR8111691C. robusta scRNA-seq rep 1R1 = barcode+UMI (28 bp), R2 = cDNA (91 bp)
SRR8111692C. robusta scRNA-seq rep 2R1 = barcode+UMI (28 bp), R2 = cDNA (91 bp)
SRR8111693C. robusta scRNA-seq rep 3R1 = barcode+UMI (28 bp), R2 = cDNA (91 bp)
SRR8111694C. robusta scRNA-seq rep 4R1 = barcode+UMI (28 bp), R2 = cDNA (91 bp)
10x read structure: Read 1 (R1) contains the 16 bp cell barcode + 12 bp UMI = 28 bp total. Read 2 (R2) is the actual cDNA sequence. Cell Ranger knows this structure automatically — do NOT trim R1.

Step 1 — Install Cell Ranger

Bash — Cell Ranger installation (no root required)
# Cell Ranger is a self-contained binary — no compilation needed
# Download from 10x Genomics (free registration required):
# https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest

wget "https://cf.10xgenomics.com/releases/cell-exp/cellranger-8.0.1.tar.gz"
tar -zxvf cellranger-8.0.1.tar.gz

# Add to PATH
export PATH="$(pwd)/cellranger-8.0.1:${PATH}"

# Verify
cellranger --version
# cellranger cellranger-8.0.1

# Note: Only dependency is bcl2fastq for BCL conversion
# (not needed if you already have FASTQ files)
# If on HPC: module load bcl2fastq2
Download:

Step 2 — Get Reads and Genome

Bash — Download SRA reads and C. robusta genome
mkdir -p reads genome

# Download reads from SRA
module load sra-toolkit

for SRR in SRR8111691 SRR8111692 SRR8111693 SRR8111694; do
    echo "Downloading: ${SRR}"
    fastq-dump \
        --split-files \
        --origfmt \
        --gzip \
        --outdir reads/ \
        ${SRR}
    echo "  Done: reads/${SRR}_1.fastq.gz (barcode+UMI)"
    echo "       reads/${SRR}_2.fastq.gz (cDNA)"
done

# Cell Ranger expects files named with a specific convention:
# {SAMPLE}_S{n}_L{lane}_R{1/2}_001.fastq.gz
# Rename files to match Cell Ranger naming convention
cd reads
for SRR in SRR8111691 SRR8111692 SRR8111693 SRR8111694; do
    N=$((${N:-0}+1))
    mv ${SRR}_1.fastq.gz CRobusta_S${N}_L001_R1_001.fastq.gz
    mv ${SRR}_2.fastq.gz CRobusta_S${N}_L001_R2_001.fastq.gz
done
cd ..

# Download C. robusta (KH assembly) genome from NCBI
cd genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/224/145/GCF_000224145.3_KH/GCF_000224145.3_KH_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/224/145/GCF_000224145.3_KH/GCF_000224145.3_KH_genomic.gff.gz
gunzip *.gz
cd ..
Download:
Data download and renaming decoded
📦fastq-dump --split-filesDownloads reads from SRA and splits into separate R1 and R2 files. --origfmt preserves original read names. --gzip compresses output. For newer SRA toolkit, prefer fasterq-dump --split-files which is ~10x faster.
🏷Cell Ranger naming conventionCell Ranger requires FASTQ files named as {SAMPLE}_S{n}_L{lane}_R{1/2}_001.fastq.gz. This is the Illumina bcl2fastq naming convention. S1 = sample number, L001 = lane. If your files don't match this pattern exactly, cellranger count will fail to find them.
📄R1 = barcode+UMI (28 bp)In 10x v3 chemistry: first 16 bp = cell barcode, next 12 bp = UMI = 28 bp total. Cell Ranger knows this structure automatically via the chemistry flag. Do NOT trim R1 — removing barcode/UMI bases will destroy cell-level information.

Step 3 — Build Cell Ranger Reference

Cell Ranger cannot use GFF3 directly — it needs GTF format. Convert with genometools, then build the reference index:

Bash — GFF3 → GTF conversion and Cell Ranger mkref
module load genometools
# or: conda install -c bioconda genometools

GFF="genome/GCF_000224145.3_KH_genomic.gff"
GENOME="genome/GCF_000224145.3_KH_genomic.fna"

# Convert GFF3 to GTF
# Note: gt skips non-standard feature types (pseudogenes, snoRNAs etc.)
# This is fine — Cell Ranger needs gene/exon features
gt gff3_to_gtf ${GFF} > genome/GCF_000224145.3_KH_genomic.gtf

echo "GTF lines:"
wc -l genome/GCF_000224145.3_KH_genomic.gtf

# Filter GTF to only well-supported gene types (optional but recommended)
cellranger mkgtf \
    genome/GCF_000224145.3_KH_genomic.gtf \
    genome/GCF_000224145.3_KH_genomic.filtered.gtf \
    --attribute=gene_biotype:protein_coding

# Build Cell Ranger reference genome index
# This takes ~30 min and uses ~16 GB RAM
cellranger mkref \
    --genome=C_robusta \
    --fasta=${GENOME} \
    --genes=genome/GCF_000224145.3_KH_genomic.filtered.gtf \
    --ref-version=3.0.0

echo "Reference built: C_robusta/"
ls C_robusta/
Download:
Cell Ranger mkref commands decoded
🔄gt gff3_to_gtfConverts GFF3 to GTF format. Cell Ranger requires GTF (Gene Transfer Format), not GFF3. The key difference: GTF uses gene_id and transcript_id attributes in a specific format that Cell Ranger parses to build the gene-to-exon map.
🔨cellranger mkgtf --attribute=gene_biotype:protein_codingFilters the GTF to keep only protein-coding genes. This is optional but recommended: non-coding RNAs, pseudogenes, and repetitive elements can inflate read mapping rates and contaminate cell type assignments.
🏠cellranger mkref --genome --fasta --genesBuilds a STAR genome index + Gene annotation index optimized for Cell Ranger. Takes ~30 min and ~16 GB RAM. The output directory (here: C_robusta/) is reused for all samples from the same species — build once, use many times.
GTF gene_id must be unique. Cell Ranger uses gene_id from the GTF as the gene name in the count matrix. Whatever is in the ID= field of your GFF column 9 becomes the gene name. For NCBI GFFs, gene IDs often look like gene1, gene2 — consider renaming chromosome-specific prefixes to make them descriptive.

Step 4 — Cell Ranger Count

Bash — Cell Ranger count (SLURM)
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=24:00:00
#SBATCH --job-name=cellranger_count
#SBATCH --mem=128G
#SBATCH --output=cellranger.%j.out

export PATH="$(pwd)/cellranger-8.0.1:${PATH}"

cellranger count \
    --id=CRobusta_run \
    --transcriptome=C_robusta \
    --fastqs=reads/ \
    --sample=CRobusta \
    --localcores=16 \
    --localmem=120 \
    --expect-cells=3000

# Key output directory:
# CRobusta_run/outs/filtered_feature_bc_matrix/
#   - barcodes.tsv.gz    (cell barcodes)
#   - features.tsv.gz    (gene IDs)
#   - matrix.mtx.gz      (UMI count matrix in sparse format)
#
# CRobusta_run/outs/web_summary.html  -- interactive QC report

echo "Cell Ranger complete. Results:"
ls CRobusta_run/outs/

# Quick stats from the summary CSV
cat CRobusta_run/outs/metrics_summary.csv
Download:
cellranger count flags decoded
📁--id=CRobusta_runName of the output directory Cell Ranger creates. All outputs go in CRobusta_run/outs/. Choose a descriptive name including sample ID and run date for reproducibility.
📄--fastqs=reads/Path to the directory containing FASTQ files. Cell Ranger scans this directory for files matching the --sample prefix. Can be a comma-separated list of directories if reads are spread across multiple flowcell lanes.
🎯--expect-cells=3000Approximate number of cells you expect to recover. This guides Cell Ranger's EmptyDrops cell-calling algorithm. Set to your experimental target (e.g. you loaded 5,000 cells expecting ~3,000 to be captured). Being off by 2x is fine.
💾--localmem=120Maximum RAM in GB for Cell Ranger to use. Set to ~90% of available node memory. Cell Ranger needs ~40–64 GB for most mammalian genomes. The STAR alignment step is the memory bottleneck.
🌐
web_summary.html is your QC dashboard. Open it in any browser immediately after Cell Ranger finishes. Check: (1) estimated cell count matches expectation, (2) >70% reads mapped confidently to genome, (3) >70% reads in cells (not background). If mapping is low, check your reference genome/GTF compatibility.

Expected Cell Ranger web_summary.html metrics for this dataset:

MetricExpected value
Estimated number of cells~2,000–4,000
Median genes per cell~1,500–2,500
Reads mapped confidently to genome>70%
Sequencing saturation~60–80%

Step 5 — Load Data into Seurat

R — Load Cell Ranger output into Seurat
library(Seurat)
library(tidyverse)
library(patchwork)

# Load the filtered feature-barcode matrix
# Read10X automatically detects barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz
counts <- Read10X(
    data.dir = "CRobusta_run/outs/filtered_feature_bc_matrix/"
)

# Create Seurat object
# min.cells: keep genes detected in at least 3 cells
# min.features: keep cells with at least 200 genes detected
seurat_obj <- CreateSeuratObject(
    counts = counts,
    project = "CRobusta_scRNAseq",
    min.cells = 3,
    min.features = 200
)

cat("Cells:", ncol(seurat_obj), "\n")
cat("Genes:", nrow(seurat_obj), "\n")
cat("Dimensions:", dim(seurat_obj), "\n")

# Quick peek at the data
head(seurat_obj@meta.data)
Download:
Seurat object creation decoded
📄Read10X(data.dir=)Reads the three files from Cell Ranger output: barcodes.tsv.gz (cell IDs), features.tsv.gz (gene IDs), matrix.mtx.gz (sparse UMI count matrix). Returns a sparse dgCMatrix (rows=genes, cols=cells). Handles gzipped files automatically.
🏠CreateSeuratObject(min.cells=3)Creates the Seurat object and applies initial filters. min.cells=3 removes genes detected in fewer than 3 cells (likely noise). min.features=200 removes barcodes with fewer than 200 genes (likely empty droplets not caught by Cell Ranger).
📊seurat_obj@meta.dataThe per-cell metadata slot. After creation it contains: nCount_RNA (total UMIs per cell) and nFeature_RNA (unique genes per cell). You add more metadata columns here (percent.mt, sample ID, cluster labels, etc.).

Step 6 — QC Metrics & Cell Filtering

R — QC visualization and filtering
library(Seurat)
library(ggplot2)

# Calculate mitochondrial gene percentage
# For C. robusta: mitochondrial genes typically start with "MT-" or "mt-"
# Check your organism's gene naming convention
seurat_obj[["percent.mt"]] <- PercentageFeatureSet(
    seurat_obj,
    pattern = "^MT-"   # regex for mitochondrial genes
)

# Visualize QC metrics before filtering
VlnPlot(seurat_obj,
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
        ncol = 3, pt.size = 0.1)
ggsave("qc_violin_before_filter.png", width=12, height=5)

# Scatter plots: detect doublets (unusually high counts)
plot1 <- FeatureScatter(seurat_obj, feature1="nCount_RNA", feature2="percent.mt")
plot2 <- FeatureScatter(seurat_obj, feature1="nCount_RNA", feature2="nFeature_RNA")
plot1 + plot2
ggsave("qc_scatter.png", width=12, height=5)

# Filter cells:
# Remove cells with:
# - Too few genes (likely empty droplets): nFeature_RNA < 200
# - Too many genes (likely doublets): nFeature_RNA > 5000
# - High mitochondrial percentage (dying/damaged cells): percent.mt > 15
seurat_obj <- subset(seurat_obj,
    subset = nFeature_RNA > 200 &
             nFeature_RNA < 5000 &
             percent.mt < 15
)

cat("Cells after QC filtering:", ncol(seurat_obj), "\n")

# Re-visualize after filtering
VlnPlot(seurat_obj,
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
        ncol = 3, pt.size = 0.1)
ggsave("qc_violin_after_filter.png", width=12, height=5)
Download:
Seurat QC parameters decoded
🧬PercentageFeatureSet(pattern="^MT-")Calculates the percentage of UMIs from mitochondrial genes for each cell. High %mito (>15–25%) indicates a dying or damaged cell where the cytoplasm has leaked but the mitochondria (membrane-bound) are retained. These cells have distorted transcriptomes.
🎧VlnPlot(features=c("nFeature_RNA","nCount_RNA","percent.mt"))Three violin plots showing per-cell distributions of gene count, UMI count, and % mitochondrial reads. The visual shapes tell you where to set cutoffs: a long upper tail on nFeature_RNA suggests doublets; a separate high-mito population indicates dead cells.
📊FeatureScatter(nCount vs nFeature)Scatter plot: real cells show a positive linear relationship between UMI count and gene count. Doublets appear as outliers above the line (two cells captured together have ~2x both metrics). Damaged cells appear below the line (few genes despite many UMIs = transcriptionally restricted).
subset(nFeature_RNA > 200 & < 5000 & percent.mt < 15)Applies all three filters simultaneously. These thresholds are starting points only — always adjust based on your violin plots. Overly strict filtering removes real cells; too lenient includes debris that skews clustering.
QC thresholds are dataset-specific. Always look at the distributions before setting cutoffs. For a typical 10x experiment: nFeature > 200 and < 5000 (adjust upper cutoff based on your median), percent.mt < 10–25% (varies by tissue type). Neurons and muscle cells have high mitochondrial content naturally — do not over-filter.

Step 7 — Normalization, HVG Selection, and Scaling

R — Normalization and feature selection
library(Seurat)

# Standard normalization: normalize to total counts per cell,
# multiply by scale factor (10,000), log1p transform
seurat_obj <- NormalizeData(seurat_obj,
    normalization.method = "LogNormalize",
    scale.factor = 10000)

# Find highly variable genes (HVGs) — genes that vary the most across cells
# These drive the biological signal; stable housekeeping genes add noise
seurat_obj <- FindVariableFeatures(seurat_obj,
    selection.method = "vst",
    nfeatures = 2000)   # top 2000 HVGs (standard)

# Visualize top variable genes
top20 <- head(VariableFeatures(seurat_obj), 20)
VariableFeaturePlot(seurat_obj) +
    LabelPoints(plot=VariableFeaturePlot(seurat_obj), points=top20, repel=TRUE)
ggsave("variable_features.png", width=10, height=6)

# Scale data (required for PCA)
# Regress out sources of technical variation
seurat_obj <- ScaleData(seurat_obj,
    features = rownames(seurat_obj),
    vars.to.regress = c("nCount_RNA", "percent.mt"))

cat("Data normalized, HVGs identified, and scaled.\n")
Download:
Normalization and HVG selection decoded
📈NormalizeData(scale.factor=10000)Divides each cell's UMI counts by that cell's total UMI count, multiplies by 10,000 (to get counts-per-ten-thousand, like CPM), then takes log1p. This corrects for differences in sequencing depth between cells. After normalization, each cell's total expression sums to the same value.
🌟FindVariableFeatures(nfeatures=2000)Identifies the 2,000 most variable genes using the VST (variance-stabilizing transformation) method. These genes drive biological differences between cell types. Using all ~20,000+ genes adds noise from stable housekeeping genes and dramatically increases compute time.
ScaleData(vars.to.regress=c("nCount_RNA","percent.mt"))Z-score normalizes each gene (mean=0, variance=1) across cells, and optionally regresses out confounders. Regressing percent.mt removes variation due to cell damage. Regressing nCount_RNA removes sequencing depth effects that survived log-normalization. Required input for PCA.
🍴
NormalizeData vs SCTransform: NormalizeData + FindVariableFeatures + ScaleData is the classic Seurat v2/v3 workflow. SCTransform() (newer) does all three steps in one using regularized negative binomial regression — generally gives better results for datasets with highly variable sequencing depth across cells. Both approaches are widely used.

Step 8 — Dimensionality Reduction, Clustering & UMAP

R — PCA, UMAP, and Louvain clustering
library(Seurat)

# PCA on HVGs
seurat_obj <- RunPCA(seurat_obj,
    features = VariableFeatures(seurat_obj),
    npcs = 50)

# Elbow plot to choose the number of PCs to use for downstream steps
ElbowPlot(seurat_obj, ndims=50)
ggsave("pca_elbow_plot.png", width=8, height=5)
# Choose PCs where the curve flattens (typically PC 10-20)
NUM_PCS <- 20

# Build k-nearest neighbor graph on selected PCs
seurat_obj <- FindNeighbors(seurat_obj,
    dims = 1:NUM_PCS,
    k.param = 20)

# Louvain clustering (resolution controls granularity)
# Higher resolution = more/smaller clusters
# Start with 0.5 and adjust based on biology
seurat_obj <- FindClusters(seurat_obj,
    resolution = 0.5)

cat("Number of clusters:", nlevels(Idents(seurat_obj)), "\n")
table(Idents(seurat_obj))

# UMAP for visualization
seurat_obj <- RunUMAP(seurat_obj,
    dims = 1:NUM_PCS,
    min.dist = 0.3,
    spread = 1.0)

# Plot UMAP
DimPlot(seurat_obj, reduction="umap", label=TRUE, pt.size=0.5) +
    NoLegend() +
    ggtitle("C. robusta scRNA-seq — Louvain clusters")
ggsave("umap_clusters.png", width=8, height=7)

# Plot QC metrics on UMAP
FeaturePlot(seurat_obj,
    features = c("nCount_RNA", "nFeature_RNA", "percent.mt"),
    ncol = 3)
ggsave("umap_qc_features.png", width=15, height=5)
Download:
PCA, clustering, and UMAP decoded
📊RunPCA(npcs=50)Runs Principal Component Analysis on the scaled HVG matrix. Each PC captures a dimension of transcriptional variation. The first few PCs usually represent major biological signals (cell type, cell cycle stage). PCs beyond ~20–30 typically capture noise.
📈ElbowPlot + NUM_PCS=20The elbow plot shows variance explained per PC. Choose the “elbow” — where the curve flattens. Including too few PCs misses biological variation; too many adds noise to clustering. Typical choices: 10–30 PCs depending on dataset complexity.
👥FindNeighbors(dims=1:20)Builds a k-nearest-neighbor (KNN) graph in PC space. Each cell is connected to its 20 most similar cells. This graph is the input for Louvain clustering. The dims here must match the PCs you selected from the elbow plot.
🎯FindClusters(resolution=0.5)Applies the Louvain algorithm to the KNN graph to find communities (clusters). Resolution controls granularity: 0.1–0.3 = fewer large clusters (broad cell types), 0.8–2.0 = many small clusters (fine subtypes). Always try multiple resolutions and compare with known markers.
🌍RunUMAP(min.dist=0.3)Reduces the PCA embedding to 2D for visualization. min.dist controls how tightly cells within a cluster are packed: lower = tighter clusters with more empty space between them. UMAP preserves local structure (cells within a cluster) but global distances between clusters are not meaningful.
🎯
UMAP is for visualization only! Distances between clusters on a UMAP plot are not quantitatively meaningful — two clusters far apart on UMAP might be transcriptionally more similar than two clusters that appear close. Use UMAP to show cluster structure and gene expression patterns, but base statistical conclusions on the actual cluster assignments and marker gene analysis.

Step 9 — Marker Gene Identification & Cell Type Annotation

R — FindAllMarkers + DotPlot annotation
library(Seurat)
library(ggplot2)
library(dplyr)

# Find marker genes for each cluster vs all other clusters
# test.use = "wilcox" (fast, recommended default)
# only.pos = TRUE: only return upregulated markers per cluster
all_markers <- FindAllMarkers(seurat_obj,
    only.pos = TRUE,
    min.pct = 0.25,         # gene must be detected in >= 25% of cluster cells
    logfc.threshold = 0.25, # log2FC threshold for initial screening
    test.use = "wilcox")

# Keep top 10 markers per cluster by avg_log2FC
top10 <- all_markers %>%
    group_by(cluster) %>%
    slice_max(order_by = avg_log2FC, n = 10)

write.csv(top10, "top10_markers_per_cluster.csv", row.names=FALSE)

# Heatmap of top 5 markers per cluster
top5 <- all_markers %>%
    group_by(cluster) %>%
    slice_max(order_by = avg_log2FC, n = 5)
DoHeatmap(seurat_obj,
    features = top5$gene,
    size = 3) +
    theme(axis.text.y = element_text(size=6))
ggsave("marker_heatmap.png", width=14, height=10)

# DotPlot of known C. robusta tissue markers
# (from Cao et al. 2019 and other published C. robusta cell atlases)
known_markers <- c(
    "Ci-Hox3",    # neural
    "Ci-Otx",     # anterior neural plate
    "Ci-Mesp",    # mesoderm
    "Ci-Epi1",    # epidermis
    "Ci-Myl",     # muscle
    "Ci-Noto"     # notochord
)
known_in_data <- known_markers[known_markers %in% rownames(seurat_obj)]

if (length(known_in_data) > 0) {
    DotPlot(seurat_obj, features = known_in_data) +
        RotatedAxis()
    ggsave("known_marker_dotplot.png", width=10, height=6)
}

# Rename clusters with biological identities (after literature review)
# Example based on top markers:
new.cluster.ids <- c("Epidermis", "Muscle", "Notochord", "Neural",
                      "Mesoderm", "Endoderm", "Pharynx", "Unknown")
names(new.cluster.ids) <- levels(seurat_obj)
seurat_obj <- RenameIdents(seurat_obj, new.cluster.ids)

DimPlot(seurat_obj, reduction="umap", label=TRUE, pt.size=0.5) +
    ggtitle("C. robusta — Annotated cell types")
ggsave("umap_annotated.png", width=9, height=7)

saveRDS(seurat_obj, "CRobusta_seurat_final.rds")
cat("Analysis complete. Final object saved.\n")
Download:

Troubleshooting

Cell Ranger: "No input FASTQs found"

Cell Ranger is strict about FASTQ naming. Files must match: {sample}_S{n}_L{lane}_R{1,2}_001.fastq.gz. The --sample flag must match the prefix before _S1_. Verify with ls reads/ | grep CRobusta. If files don't match this pattern, rename them before running.

Cell Ranger reports: 0 cells detected

Almost always a barcode mismatch. Cell Ranger uses a whitelist of valid 10x barcodes per chemistry version. Check the chemistry version used during library prep (e.g., v2 vs v3 barcodes are different). Set --chemistry=SC3Pv3 explicitly if auto-detection fails.

Seurat: too many or too few clusters

Adjust the resolution parameter in FindClusters. Higher resolution (e.g., 1.0) → more clusters. Lower (e.g., 0.1) → fewer clusters. For exploratory analysis, try 0.3, 0.5, and 1.0 and compare UMAP plots. Use clustree package to visualize cluster stability across resolutions.

ScaleData runs out of memory on large datasets

Only scale the HVGs (not all genes): ScaleData(seurat_obj, features = VariableFeatures(seurat_obj)). Scaling all genes is only needed for heatmaps. Alternatively use SCTransform instead of NormalizeData + ScaleData — it is faster and more memory-efficient.