Label-Free Proteomics: MaxQuant & Perseus

Overview

Label-free quantitative (LFQ) proteomics measures protein abundance across samples by comparing the intensity of peptide mass spectra without using isotope labels. MaxQuant performs the raw MS data processing (peptide identification and quantification), and Perseus provides a complete statistical analysis environment specifically designed for proteomics data.

Proteomics vs RNA-seq — key conceptual differences:

Missing values: Proteins not detected in a sample give a 0 or NA intensity — this is “missing not at random” (protein is likely absent, not just unmeasured). Must impute from the lower end of the distribution, not from the mean.
Dynamic range: LC-MS/MS covers ~4 orders of magnitude of protein abundance. Highly abundant proteins (albumin, ribosomal proteins) dominate spectra and suppress rare protein detection.
No count data: LFQ intensities are continuous, not counts. Use t-tests or limma, not DESeq2/edgeR (which require discrete count data).
Log2 intensities: Always work with log2-transformed intensities — raw intensities are log-normally distributed and cannot be compared with parametric tests.

Dataset

We use the PRIDE PXD000279 dataset — a benchmark LFQ proteomics dataset from the Cox/Mann lab (MaxQuant developers) comparing three UPS1 standard spike-in concentrations mixed into a yeast background proteome. This is the standard dataset used to benchmark LFQ proteomics methods.

Property	Value
Dataset	UPS1 standard proteins spiked into yeast at three concentrations
PRIDE accession	PXD000279
Groups	Low (0.25 fmol/μL), Medium (2.5 fmol/μL), High (25 fmol/μL)
Replicates	3 technical replicates per concentration group
Instrument	Orbitrap Velos (Q Exactive), DDA mode
Expected result	48 UPS1 proteins significantly up with increasing concentration; yeast background constant
Publication	Cox et al. 2014 Molecular & Cellular Proteomics

Step 1 — MaxQuant Database Search

MaxQuant is a GUI application for Windows. The settings below describe the key parameters to configure in the MaxQuant interface. A command-line version is also available for HPC: dotnet MaxQuant.dll mqpar.xml where mqpar.xml is the parameter file exported from the GUI.

MaxQuant Configuration — key parameter settings

######################################################################
## STEP 1: MaxQuant database search configuration
## MaxQuant (https://www.maxquant.org/) is a free Windows GUI tool.
## Below are the critical settings to configure in the MaxQuant GUI,
## with explanations for each parameter.
## Download MaxQuant: https://www.maxquant.org/maxquant/
######################################################################

## ====================================================================
## RAW FILE LOADING (MaxQuant GUI → Raw files tab)
## ====================================================================
## 1. Add all .raw files (9 files: 3 groups × 3 replicates)
## 2. Set experiment names: Low_1, Low_2, Low_3, Med_1, Med_2, Med_3, High_1, High_2, High_3
## 3. Group replicates: parameter groups (all same settings, no fractionation)

## ====================================================================
## SEQUENCE DATABASES (MaxQuant GUI → Group-specific parameters → Sequences)
## ====================================================================
## Add FASTA files:
##   1. Human UPS1 standard proteins FASTA (48 proteins)
##      Download: https://www.sigmaaldrich.com/deepweb/assets/sigmaaldrich/product/documents/168/214/ups1ups2datasheet.pdf
##   2. Saccharomyces cerevisiae UniProt reference proteome
##      Download from UniProt: https://www.uniprot.org/proteomes/UP000002311
##
## Why two FASTAs? The search engine identifies peptides from both databases.
## UPS1 proteins are your spike-in "ground truth"; yeast = background.

## ====================================================================
## DIGESTION (MaxQuant GUI → Group-specific parameters → Digestion)
## ====================================================================
## Enzyme: Trypsin/P
##   Trypsin cleaves after K (lysine) or R (arginine) residues
##   EXCEPT when followed by P (proline) — the /P means include KP/RP sites
## Max missed cleavages: 2
##   Allow up to 2 incomplete tryptic cleavages per peptide
##   (some K/R sites are not cleaved by trypsin)

## ====================================================================
## MODIFICATIONS (MaxQuant GUI → Group-specific parameters → Modifications)
## ====================================================================
## Fixed modifications:
##   Carbamidomethyl (C) — standard cysteine alkylation with iodoacetamide
##   This blocks free cysteine thiols to prevent disulfide scrambling
## Variable modifications:
##   Oxidation (M)       — methionine oxidation (common in-solution artifact)
##   Acetyl (Protein N-term) — N-terminal protein acetylation

## ====================================================================
## LABEL-FREE QUANTIFICATION (MaxQuant GUI → Group-specific parameters → Label-free quantification)
## ====================================================================
## LFQ: ON (checked)
## Min. ratio count: 2
##   A protein is quantified if at least 2 unique peptide pairs were used
##   (this filters proteins with only one unique peptide quantification)
## Fast LFQ: OFF (use standard LFQ for better accuracy; fast LFQ for large datasets > 50 samples)
## Match between runs: ON
##   MaxQuant transfers peptide identifications from one run to another
##   based on accurate mass and retention time alignment.
##   This dramatically reduces missing values between samples.
##   In our 9-sample experiment, this is CRITICAL — without it,
##   proteins detected in only 1 replicate appear as "missing" in the others.

## ====================================================================
## GLOBAL PARAMETERS (MaxQuant GUI → Global parameters)
## ====================================================================
## Peptide FDR:    0.01  (1% false discovery rate at peptide level)
## Protein FDR:    0.01  (1% FDR at protein level)
## Min. peptide length: 7 (discard very short peptides — unreliable)
## Max. charge: 7 (include charge states 1 to 7)
##
## Threads: set to available CPU cores (MaxQuant uses parallel processing)
## Temp folder: set to a fast local SSD (MaxQuant creates many temp files)

echo "======================================================="
echo "MaxQuant OUTPUT FILES (in /combined/txt/ directory):"
echo "  proteinGroups.txt  - main output: one row per protein group"
echo "  peptides.txt       - all detected peptides + intensities"
echo "  summary.txt        - run statistics (identification rates, etc.)"
echo "  parameters.txt     - all MaxQuant settings used"
echo "======================================================="
echo ""
echo "Key columns in proteinGroups.txt:"
echo "  Protein IDs      : UniProt accessions (semicolon-separated for protein groups)"
echo "  Gene names       : gene symbols"
echo "  LFQ intensity *  : LFQ-normalised intensity for each sample"
echo "  Razor + unique peptides : number of peptides used for quantification"
echo "  Potential contaminant  : + if contaminant (remove these!)"
echo "  Reverse               : + if decoy (remove these!)"
echo "  Only identified by site: + if only detected via modification (remove these!)"

Download:

MaxQuant key concepts decoded

📈LFQ normalization	MaxQuant’s LFQ algorithm normalizes intensities across runs by finding a small set of “normalization peptides” (peptides with similar intensity across all samples) and using them to compute sample-specific scaling factors. This is equivalent to median normalization but specifically designed for MS data where only a subset of proteins change between conditions.
🎯Match between runs	This feature identifies peptides in one run based on their mass AND their chromatographic retention time alignment to other runs where the peptide was confidently identified. It requires accurate mass (<5 ppm) and a tight retention time window (<0.7 min after calibration). Match between runs reduces missing values by ~30–50% in typical experiments.

Step 2 — Perseus: Load & Filter

R — Load MaxQuant proteinGroups.txt and apply standard filters

######################################################################
## STEP 2: Load MaxQuant output and filter contaminants/decoys
## Can be done in Perseus GUI OR in R (shown here for reproducibility)
## Standard filters to apply FIRST (before any analysis):
##   1. Remove contaminants (marked with "+" in "Potential contaminant" column)
##   2. Remove reverse decoys (marked with "+" in "Reverse" column)
##   3. Remove "only identified by site" proteins
##   4. Require detection in minimum N samples per group
######################################################################

library(tidyverse)

## --- Load proteinGroups.txt ---
pg <- read.delim("proteinGroups.txt", sep="\t", stringsAsFactors=FALSE)
cat("Total protein groups loaded:", nrow(pg), "\n")

## --- Step 2a: Remove contaminants, reverse, site-only ---
## These three filters are ALWAYS applied first in any MaxQuant analysis
## "+" in Potential.contaminant = common lab contaminants (keratin, trypsin, albumin)
## "+" in Reverse = decoy database hits (used to estimate FDR; not real proteins)
## "+" in Only.identified.by.site = protein only detected via modified peptide
pg_filtered <- pg %>%
    filter(
        Potential.contaminant != "+",   # remove contaminants
        Reverse               != "+",   # remove reverse decoys
        Only.identified.by.site != "+"  # remove site-only
    )
cat("After contaminant/decoy removal:", nrow(pg_filtered), "\n")

## --- Step 2b: Extract LFQ intensity columns ---
## LFQ intensity columns are named: "LFQ.intensity.SampleName"
lfq_cols <- grep("^LFQ\\.intensity\\.", colnames(pg_filtered), value=TRUE)
cat("LFQ intensity columns:", length(lfq_cols), "\n")
cat("Samples:", gsub("LFQ\\.intensity\\.", "", lfq_cols), "\n")

## --- Step 2c: Convert to log2 intensities ---
## Raw LFQ intensities are log-normally distributed (range: ~10^6 to 10^11)
## Log2 transformation: compresses range, makes distribution normal
## Replace 0 with NA: 0 in MaxQuant = not detected (missing value, not zero)
lfq_mat <- pg_filtered[, lfq_cols]
lfq_mat[lfq_mat == 0] <- NA           # 0 = missing, convert to NA
lfq_log2 <- log2(lfq_mat)            # log2 transform
rownames(lfq_log2) <- pg_filtered$Gene.names

## Check missing value pattern
missing_pct <- colMeans(is.na(lfq_log2)) * 100
cat("\nMissing values per sample (%):\n")
print(round(missing_pct, 1))
## Expect ~20-40% missing values in typical LFQ experiments

## --- Step 2d: Filter: keep proteins quantified in >= 2 of 3 replicates per group ---
## Define groups
groups <- list(
    Low  = grep("Low",  lfq_cols, value=TRUE),
    Med  = grep("Med",  lfq_cols, value=TRUE),
    High = grep("High", lfq_cols, value=TRUE)
)

## Keep protein if quantified in at least 2/3 replicates in AT LEAST ONE group
valid_low  <- rowSums(!is.na(lfq_log2[, groups$Low]))  >= 2
valid_med  <- rowSums(!is.na(lfq_log2[, groups$Med]))  >= 2
valid_high <- rowSums(!is.na(lfq_log2[, groups$High])) >= 2

keep <- valid_low | valid_med | valid_high
lfq_log2 <- lfq_log2[keep, ]
cat("\nProteins after valid-value filter:", nrow(lfq_log2), "\n")

Download:

Step 3 — Normalization & Missing Value Imputation

R — Median normalization and MNAR imputation from low-end Gaussian

######################################################################
## STEP 3: Normalize intensities and impute missing values
## LFQ intensities from MaxQuant are already normalised within MaxQuant.
## However, additional median centering per sample is recommended in R.
## Missing value imputation: replace NA with draws from the lower end
## of the intensity distribution — because missing = not detected =
## likely low abundance (Missing Not At Random, MNAR imputation)
######################################################################

## --- Step 3a: Median centering (additional normalisation) ---
## Subtract the median of each sample from its log2 intensities
## This ensures all samples have the same median (centered at 0 after centering)
## Plot before/after to confirm
lfq_norm <- sweep(lfq_log2,
                  MARGIN = 2,
                  FUN    = "-",
                  STATS  = apply(lfq_log2, 2, median, na.rm=TRUE))

## --- Boxplot before and after normalisation ---
par(mfrow=c(1,2))
boxplot(lfq_log2, las=2, main="Before normalisation", cex.axis=0.7,
        col=c(rep("lightblue",3), rep("lightgreen",3), rep("salmon",3)))
boxplot(lfq_norm, las=2, main="After median normalisation", cex.axis=0.7,
        col=c(rep("lightblue",3), rep("lightgreen",3), rep("salmon",3)))
## Boxes should align after normalisation

## --- Step 3b: MNAR imputation (Perseus-style: MinProb) ---
## Perseus default: impute missing values from a downshifted Gaussian
## Parameters:
##   width = 0.3 × sample SD    (how wide the imputation distribution is)
##   shift = 1.8 × sample SD below the sample mean
##   (imputed values represent proteins at the detection limit)
## This is called "MinProb" in Perseus
set.seed(42)
lfq_imputed <- lfq_norm

for (col in colnames(lfq_imputed)) {
    ## Find positions with missing values in this sample
    missing_idx <- which(is.na(lfq_imputed[[col]]))
    if (length(missing_idx) == 0) next

    ## Sample parameters
    col_mean <- mean(lfq_imputed[[col]], na.rm=TRUE)
    col_sd   <- sd(lfq_imputed[[col]],   na.rm=TRUE)

    ## Draw from a narrow Gaussian at the low end of the distribution
    ## shift: imputed values are centred 1.8 SDs below the observed mean
    ## width: imputed distribution is 0.3× narrower than the observed distribution
    lfq_imputed[[col]][missing_idx] <- rnorm(
        n    = length(missing_idx),
        mean = col_mean - 1.8 * col_sd,   # shifted downward
        sd   = 0.3 * col_sd               # narrowed
    )
}

cat("Imputation complete. Missing values remaining:",
    sum(is.na(lfq_imputed)), "\n")
## Should be 0 after imputation

Download:

Imputation decoded

📊Why MNAR imputation matters

Missing values in proteomics are NOT random. If a protein is absent from a sample, it’s likely because it’s at or below the detection limit — not because of a random technical failure. Standard mean/median imputation (MAR-based) would impute it as an average-intensity protein, which is biologically wrong. MinProb imputation places missing values at the low end of the distribution, where undetected proteins should be.

🎯Perseus vs R

Perseus (also free from the Mann lab) provides a GUI implementation of all these steps with drag-and-drop workflow building. The R implementation shown here is identical statistically but scriptable and reproducible. For routine use, Perseus GUI is faster to set up; for publication pipelines, R is more reproducible.

Step 4 — Statistical Testing with limma

R — limma moderated t-test for proteomics differential abundance

######################################################################
## STEP 4: Differential abundance testing with limma
## For proteomics: use limma (moderated t-test) NOT DESeq2/edgeR
## Reason: LFQ intensities are continuous (not counts)
## limma uses empirical Bayes to shrink variance estimates toward the
## global mean variance — crucial with small n (n=3 per group here)
######################################################################

library(limma)

## --- Set up design matrix ---
group   <- factor(c("Low","Low","Low","Med","Med","Med","High","High","High"))
design  <- model.matrix(~ 0 + group)   # 0+ = no intercept (group-coded design)
colnames(design) <- levels(group)
rownames(design) <- colnames(lfq_imputed)

## --- Contrast matrix: define all pairwise comparisons ---
## We want: High vs Low, High vs Med, Med vs Low
contrasts_mat <- makeContrasts(
    High_vs_Low  = High - Low,
    High_vs_Med  = High - Med,
    Med_vs_Low   = Med  - Low,
    levels       = design
)

## --- Fit limma model ---
## lmFit: fits a linear model for each protein (row)
## t(lfq_imputed) : transpose so proteins are rows, samples are columns
fit <- lmFit(t(lfq_imputed), design)

## --- Apply contrasts ---
## contrasts.fit: re-parameterises the model for the specified contrasts
fit2 <- contrasts.fit(fit, contrasts_mat)

## --- Empirical Bayes moderation ---
## eBayes: uses information across all proteins to moderate each protein's
## variance estimate. This stabilises the t-statistics for proteins with
## few valid quantifications. Critical for n=3 per group.
fit2 <- eBayes(fit2, trend=TRUE)   # trend=TRUE: model variance-mean trend

## --- Extract results for High vs Low ---
res_HvL <- topTable(
    fit2,
    coef       = "High_vs_Low",  # which contrast to extract
    n          = Inf,             # return all proteins
    adjust.method = "BH",        # Benjamini-Hochberg FDR
    sort.by    = "P"
)
cat("Significant proteins (High vs Low, FDR < 0.05):",
    sum(res_HvL$adj.P.Val < 0.05, na.rm=TRUE), "\n")
## Expected: ~48 UPS1 proteins (up in High); yeast background ~ no change

write.csv(res_HvL, "limma_High_vs_Low.csv")

Download:

🧬

Expected result: The 48 UPS1 spike-in proteins should all be significant with positive log2FC (Higher in High vs Low). The ~2,000 yeast background proteins should be near zero fold-change and mostly non-significant — this is the key validation of the pipeline. If yeast proteins are systematically changing, check your normalisation step.

Step 5 — Volcano Plot & Heatmap

R — Volcano plot and heatmap distinguishing UPS1 from yeast background

######################################################################
## STEP 5: Publication-quality volcano and heatmap for proteomics data
## Colour-code UPS1 spike-in proteins (should be UP) vs yeast (should be flat)
######################################################################

library(ggplot2); library(ggrepel); library(pheatmap)

## --- Volcano plot ---
vol <- res_HvL %>%
    rownames_to_column("protein") %>%
    mutate(
        ups1     = grepl("HUMAN", protein, ignore.case=TRUE),  # UPS1 proteins have _HUMAN suffix
        sig      = adj.P.Val < 0.05 & abs(logFC) > 1,
        category = case_when(
            ups1 & sig  ~ "UPS1 (significant)",
            ups1        ~ "UPS1 (not sig)",
            sig         ~ "Yeast (significant)",
            TRUE        ~ "Yeast (background)"
        )
    )

ggplot(vol, aes(logFC, -log10(adj.P.Val), color=category, size=category)) +
    geom_point(alpha=0.7) +
    geom_hline(yintercept=-log10(0.05), linetype=2, color="grey50") +
    geom_vline(xintercept=c(-1,1), linetype=2, color="grey50") +
    geom_text_repel(data=filter(vol, category=="UPS1 (significant)"),
                    aes(label=protein), size=2.5, max.overlaps=20) +
    scale_color_manual(values=c(
        "UPS1 (significant)"   = "firebrick",
        "UPS1 (not sig)"       = "tomato",
        "Yeast (significant)"  = "steelblue",
        "Yeast (background)"   = "grey80"
    )) +
    scale_size_manual(values=c(1.8, 1.2, 1.2, 0.6)) +
    labs(title="LFQ proteomics: High vs Low UPS1 concentration",
         subtitle="Red = UPS1 spike-in (should be significant), Grey = yeast background",
         x="log2 Fold Change", y="-log10 adjusted p-value") +
    theme_bw(base_size=12)
ggsave("volcano_proteomics.png", width=10, height=7, dpi=150)

## --- Heatmap of top 50 most significant proteins ---
top_prots <- res_HvL %>%
    rownames_to_column("protein") %>%
    filter(!is.na(adj.P.Val), adj.P.Val < 0.05) %>%
    arrange(adj.P.Val) %>%
    head(50) %>% pull(protein)

mat <- t(lfq_imputed)[top_prots, ]
mat_z <- t(scale(t(mat)))

ann_col <- data.frame(
    Group = group,
    row.names = colnames(mat_z)
)
ann_colors <- list(Group=c(Low="lightblue", Med="orange", High="firebrick"))

pheatmap(mat_z, annotation_col=ann_col, annotation_colors=ann_colors,
         show_rownames=TRUE, cluster_cols=FALSE,
         color=colorRampPalette(rev(RColorBrewer::brewer.pal(9,"RdBu")))(100),
         main="Top 50 significant proteins (z-scored LFQ intensities)",
         filename="heatmap_proteomics.png", width=9, height=11)

Download:

Step 6 — GO Enrichment & STRING Network

R — clusterProfiler GO enrichment + STRINGdb protein interaction network

######################################################################
## STEP 6: GO enrichment and STRING protein interaction network
## For proteomics, use gene names (not Ensembl IDs) since MaxQuant
## provides gene names directly in the proteinGroups.txt output.
######################################################################

library(clusterProfiler); library(org.Hs.eg.db); library(STRINGdb)

## --- GO enrichment on significant proteins ---
sig_genes <- res_HvL %>%
    rownames_to_column("gene") %>%
    filter(adj.P.Val < 0.05, logFC > 0) %>%   # upregulated proteins
    pull(gene) %>%
    ## Extract gene symbol (MaxQuant format: GENENAME_ORGANISM or just GENENAME)
    gsub(";.*", "", .) %>%    # take first gene name if multiple
    unique()

all_genes <- rownames(res_HvL) %>% gsub(";.*", "", .) %>% unique()

## enrichGO: over-representation analysis using GO Biological Process terms
ego <- enrichGO(
    gene          = sig_genes,
    universe      = all_genes,
    OrgDb         = org.Hs.eg.db,
    keyType       = "SYMBOL",      # gene symbols from MaxQuant
    ont           = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    minGSSize     = 10
)
dotplot(ego, showCategory=15,
        title="GO: Significantly upregulated proteins (High vs Low)")
ggsave("go_enrichment_proteomics.png", width=9, height=7, dpi=150)

## --- STRING protein interaction network ---
## STRINGdb: fetches protein-protein interactions from the STRING database
## score_threshold = 700 : high-confidence interactions only (0-1000 scale)
string_db <- STRINGdb$new(
    version         = "12",
    species         = 9606,              # 9606 = Homo sapiens
    score_threshold = 700,               # high confidence interactions only
    network_type    = "physical",        # physical interactions (not just co-expression)
    input_directory = ""
)

## Map gene names to STRING IDs
sig_df <- data.frame(gene = sig_genes)
sig_mapped <- string_db$map(sig_df, "gene", removeUnmappedRows=TRUE)

## Plot the interaction network
string_db$plot_network(sig_mapped$STRING_id,
                        add_summary_nodes=FALSE,
                        vertex_size_type="degree")
## Opens a STRING network plot — clusters = functionally related proteins
## UPS1 proteins will form one dense cluster; other proteins will be dispersed

Download:

Somatic Variants scRNA-seq with Scanpy

Label-Free Proteomics: MaxQuant & Perseus

On this page

Overview

Dataset

Step 1 — MaxQuant Database Search

Step 2 — Perseus: Load & Filter

Step 3 — Normalization & Missing Value Imputation

Step 4 — Statistical Testing with limma

Step 5 — Volcano Plot & Heatmap

Step 6 — GO Enrichment & STRING Network