Secreted Protein Prediction: SignalP & TMHMM

Overview & Secretion Strategy

Many pathogens and parasites secrete effector proteins into host cells to manipulate host immunity and physiology. Identifying these secreted proteins from a proteome requires a multi-step filtering pipeline:

SignalP 6.0 — predict N-terminal signal peptides (classical secretion pathway)

TMHMM — remove proteins with transmembrane domains (these are membrane-anchored, not freely secreted)

DeepLoc 2.0 — predict final subcellular localization of candidate effectors

LOCALIZER / TargetP — predict plant organelle targeting (for secreted proteins entering plant cells)

Why this matters: Plant-parasitic nematodes inject hundreds of effector proteins into plant root cells to establish feeding sites. Identifying these effectors is the first step toward understanding host-parasite interactions and developing resistance strategies. This tutorial uses a nematode proteome as the example but the pipeline applies to any organism. Author: Rick Masonbrink (Iowa State University).

Ideal Directory Structure

This pipeline produces a progressively filtered list of secreted proteins. Keep each filter step's output so you can trace back exactly how many proteins passed each gate.

secreted_protein_prediction/

secreted_protein_prediction/ ├── input/ │ └── proteome.fasta # all predicted proteins from MAKER2 ├── 1_signalp/ │ ├── prediction_results.txt # SP/OTHER calls per protein │ └── signalp_positive_ids.txt # IDs with signal peptide ├── 2_tmhmm/ │ ├── tmhmm_results.txt # TM helix predictions │ └── no_tm_ids.txt # IDs with signal peptide but no TM domains ├── 3_deeploc/ │ └── deeploc_results.txt # subcellular localization predictions ├── 4_mature_proteins/ │ └── secreted_mature.fasta # final: signal peptide clipped off └── final_secreted_ids.txt # master list of all passing proteins

🪲

Progressive filtering: Start with all proteins (e.g. 21,000 from MAKER2). After SignalP: ~2,000 have signal peptides (~10%). After removing TM proteins: ~1,800. After DeepLoc "Extracellular" filter: ~400 high-confidence secreted candidates. These numbers vary by organism but the funnel logic is universal!

Step 1 — Install SignalP 6.0

SignalP 6.0 uses deep learning (transformer architecture) and requires registration at DTU Health Tech. It supports all three domains of life.

Bash — SignalP 6.0 Installation

# Register and download SignalP 6.0 from:
# https://services.healthtech.dtu.dk/services/SignalP-6.0/

# Create and activate a Python virtual environment
python3 -m venv signalp6_env
source signalp6_env/bin/activate

# Install the downloaded package
tar -zxvf signalp-6.0h.fast.tar.gz
cd signalp6_fast/
pip install signalp-6-package/

# Fix numpy compatibility (if needed)
pip install "numpy==1.24.3"

# Verify installation and find the model directory
SIGNALP_DIR=$(python3 -c "import signalp; import os; print(os.path.dirname(signalp.__file__))")
echo "SignalP installed at: ${SIGNALP_DIR}"

# Test run
echo ">test_protein" > test.fasta
echo "MKTLLLTLVVVTIVCLDLGAVLGSTREQPGPEASSGRPGAGMSASQSRTSSGLSPQSPETLPHSRQEDASPRNP" >> test.fasta
signalp6 --fastafile test.fasta --organism euk --output_dir test_out/ \
    --format txt --mode fast

Download:

Step 2 — Run SignalP 6.0

Bash — SignalP prediction

source signalp6_env/bin/activate

PROTEINS="YourSpecies_proteins.fasta"

signalp6 \
    --fastafile ${PROTEINS} \
    --organism euk \
    --output_dir Signalp6_out/ \
    --format txt \
    --mode fast \
    --model_dir signalp-6-package/models/

# Output (excerpt from prediction_results.txt):
# SignalP-6.0   Organism: Eukarya   Timestamp: 20241007151909
# ID            Prediction    OTHER     SP(Sec/SPI)    CS Position
# mRNA_1        OTHER         1.000000  0.000000
# mRNA_2        OTHER         1.000000  0.000001
# mRNA_42       SP            0.000023  0.999977       25    # signal peptide at position 25!

# Extract proteins WITH signal peptides (SP prediction)
grep "SP" Signalp6_out/prediction_results.txt \
    | awk '{print $1}' \
    > signalp_positive_ids.txt

echo "Proteins with signal peptide:"
wc -l signalp_positive_ids.txt

# Extract cleavage site positions for later use
grep "SP" Signalp6_out/prediction_results.txt \
    | awk '{print $1, $5}' \
    > cleavage_sites.txt
head -5 cleavage_sites.txt
# mRNA_42  25   <- signal peptide ends at position 25

Download:

SignalP 6.0 flags decoded

🧪--organism euk	Organism type: `euk` (eukaryote), `gram+`, or `gram-`. SignalP uses different neural network models for each kingdom because signal peptide features differ. Using the wrong organism type gives ~20% false positive rate.
⚡--mode fast	Uses the lightweight SignalP model instead of the full ensemble. ~10× faster with minimal accuracy loss for standard signal peptides. Use `--mode slow` for maximum sensitivity on unusual sequences.
📊grep "SP" results.txt	Filters to only proteins predicted as `SP` (Sec/SPI signal peptide). Other outputs include: `TAT` (twin-arginine), `LIPO` (lipoprotein), `OTHER` (no signal). For effector prediction, we want classical `SP` only.
📤awk '{print $1, $5}'	Extracts protein ID and cleavage site position. Column 5 is the position of the signal peptide – mature peptide boundary. You'll need this to clip the signal peptide off before further analysis.

👋

What IS a signal peptide? It's a short N-terminal sequence (typically 15–30 amino acids) that acts as a zip code: it directs the ribosome to dock at the endoplasmic reticulum so the protein gets threaded into the secretory pathway. After secretion, the signal peptide is cleaved off by signal peptidase — the mature protein is shorter than the predicted sequence!

Step 3 — TMHMM: Filter Out Membrane Proteins

Proteins with signal peptides AND transmembrane domains are membrane-anchored (type I/II transmembrane proteins) — they are NOT freely secreted effectors. Remove them.

Bash — TMHMM transmembrane domain prediction

module load tmhmm/2.0c
# or: conda install -c bioconda tmhmm

# First extract only SignalP-positive proteins
module load samtools  # for seqtk, or use any FASTA extraction tool

# Extract SignalP-positive sequences
# Using seqtk (fast FASTA subsetting)
seqtk subseq ${PROTEINS} signalp_positive_ids.txt > signalp_candidates.fasta

echo "Candidates from SignalP:"
grep -c ">" signalp_candidates.fasta

# Run TMHMM on SignalP-positive proteins only
tmhmm signalp_candidates.fasta > tmhmm_results.txt

# Parse TMHMM output: keep proteins with 0 TM helices
# (or with exactly 1 TM helix that overlaps with the signal peptide region)
grep "Number of predicted TMHs:" tmhmm_results.txt \
    | awk '{
        split($0, a, "\t")
        split(a[2], b, "=")
        if (b[2]+0 == 0) print a[1]
    }' \
    | sed 's/# //' \
    > no_tmhmm_ids.txt

# More robust: use the short summary format
tmhmm --short signalp_candidates.fasta \
    | awk '$5 == "PredHel=0" {print $1}' \
    > no_tmhmm_ids.txt

echo "Secreted candidates (SP+ and no TM domains):"
wc -l no_tmhmm_ids.txt

# Extract final secreted protein candidates
seqtk subseq signalp_candidates.fasta no_tmhmm_ids.txt \
    > secreted_candidates.fasta

Download:

TMHMM commands decoded

🪡seqtk subseq	Extracts a subset of sequences from a FASTA file using a list of IDs. Faster than writing a Python/awk script. Takes your proteome FASTA + a list of IDs and outputs only those sequences.
🧠tmhmm --short	Runs TMHMM in compact output mode. Each line gives: protein ID, length, expected TM residues, expected TM in first 60 aa, predicted TM helices count, topology. The `PredHel=0` field is what we filter on.
🚫awk '$5 == "PredHel=0"'	Keeps only proteins where column 5 is `PredHel=0` — no predicted transmembrane helices. Proteins with ≥1 TM helix are membrane-anchored and excluded from the secretome.

🔁

The one-TM-helix exception: Some secreted proteins have exactly 1 TM helix that overlaps the signal peptide region itself. This is fine — the "TM helix" is just the signal peptide being misclassified. You can be more lenient: proteins with PredHel=1 AND a SignalP cleavage site in the first 30 aa are likely secreted, not membrane-anchored.

Step 4 — Extract Mature Protein Sequences

The signal peptide is cleaved co-translationally. The "mature" protein (after signal peptide removal) is what actually functions in the host. Extract mature sequences for downstream functional analysis:

Python — Extract mature proteins after signal peptide cleavage

#!/usr/bin/env python3
"""Extract mature protein sequences by removing the signal peptide."""

from Bio import SeqIO
import sys

def extract_mature_proteins(fasta_file, cleavage_file, output_file):
    # Load cleavage positions from SignalP output
    cleavage_pos = {}
    with open(cleavage_file) as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) >= 2:
                protein_id = parts[0]
                try:
                    pos = int(parts[1])
                    cleavage_pos[protein_id] = pos
                except ValueError:
                    pass

    # Extract mature sequences
    with open(output_file, 'w') as out:
        for record in SeqIO.parse(fasta_file, 'fasta'):
            if record.id in cleavage_pos:
                cut = cleavage_pos[record.id]
                mature_seq = record.seq[cut:]  # trim signal peptide
                out.write(f">{record.id}_mature\n{mature_seq}\n")

    print(f"Written mature sequences to {output_file}")

extract_mature_proteins(
    fasta_file="secreted_candidates.fasta",
    cleavage_file="cleavage_sites.txt",
    output_file="mature_secreted_proteins.fasta"
)

Download:

Python mature protein extraction decoded

📖SeqIO.parse()	BioPython's FASTA reader. Returns an iterator of `SeqRecord` objects, each with `.id` and `.seq` attributes. Memory-efficient — reads one sequence at a time rather than loading the whole file.
✂️record.seq[cut:]	Python slice syntax — takes everything from position `cut` to the end. This removes the signal peptide (positions 0 to cut-1) and keeps only the mature protein sequence. The cleavage position comes from SignalP's column 5.
🏷️f">{record.id}_mature"	F-string formatting to rename the output sequence with `_mature` suffix. This makes it clear in downstream BLAST or functional annotation runs that you're working with the clipped mature form, not the full predicted protein.

Step 5 — DeepLoc 2.0: Subcellular Localization

Bash — DeepLoc 2.0 installation and prediction

# Install DeepLoc 2.0 (requires registration at DTU Health Tech)
# https://services.healthtech.dtu.dk/services/DeepLoc-2.0/

pip install deeploc2

# Run DeepLoc on mature secreted candidate proteins
deeploc2 \
    --fasta mature_secreted_proteins.fasta \
    --output deeploc_output/ \
    --model Accurate \
    --device cpu   # or --device gpu if available

# Localization categories predicted:
# Cytoplasm, Nucleus, Extracellular, Cell membrane,
# Mitochondrion, Plastid, Endoplasmic reticulum, Lysosome/Vacuole,
# Golgi apparatus, Peroxisome

cat deeploc_output/results.csv | head -10

# Filter for extracellular or cell membrane proteins
# (most likely effectors/secreted proteins)
grep -E "Extracellular|Cell membrane" deeploc_output/results.csv \
    | cut -d',' -f1 > extracellular_ids.txt

echo "Extracellular/membrane-targeted secreted candidates:"
wc -l extracellular_ids.txt

Download:

DeepLoc 2.0 flags decoded

🧠--model Accurate	Use the high-accuracy ensemble model instead of the fast single model. Slower but recommended for the final secretome — you've already filtered to a small set of candidates so runtime isn't a concern.
💻--device cpu/gpu	DeepLoc uses deep learning (protein language model). `gpu` is ~50× faster if a CUDA GPU is available. On CPU, expect ~1–2 seconds per protein.
🌍grep Extracellular	Filters results to proteins predicted as extracellular — these are the highest-confidence secreted effectors. Also keep "Cell membrane" if you're interested in surface-exposed proteins.

🌟

Why DeepLoc after SignalP? SignalP only says "has a signal peptide" — it doesn't say where the protein ends up. Some signal-peptide proteins are retained in the ER or Golgi. DeepLoc uses the full protein sequence context to predict the final destination. Together, the tools give you a much cleaner secretome prediction.

Step 6 — Combine Results & Final Summary

Python — Combine all predictions into a summary table

#!/usr/bin/env python3
"""Combine SignalP, TMHMM, and DeepLoc results into a final secretome table."""

import pandas as pd

# Load SignalP results
signalp = pd.read_csv("Signalp6_out/prediction_results.txt",
                       sep="\t", comment="#",
                       names=["ID","Prediction","OTHER","SP_prob","CS_pos"])
signalp_pos = signalp[signalp["Prediction"] == "SP"][["ID","SP_prob","CS_pos"]]

# Load TMHMM results (short format)
tmhmm = pd.read_csv("tmhmm_short.txt", sep="\t", comment="#",
                     names=["ID","len","ExpAA","First60","PredHel","Topology"])
tmhmm_neg = tmhmm[tmhmm["PredHel"] == "PredHel=0"]["ID"].tolist()

# Load DeepLoc results
deeploc = pd.read_csv("deeploc_output/results.csv")
deeploc.columns = ["ID","Localization","Confidence"] + list(deeploc.columns[3:])

# Merge all
secretome = signalp_pos[signalp_pos["ID"].isin(tmhmm_neg)]
secretome = secretome.merge(deeploc[["ID","Localization","Confidence"]],
                             on="ID", how="left")

# Sort by signal peptide probability
secretome = secretome.sort_values("SP_prob", ascending=False)

print(f"Total proteome: {len(signalp)} proteins")
print(f"Signal peptide positive: {len(signalp_pos)} proteins")
print(f"After TMHMM filter: {len(secretome)} proteins")
print(f"\nLocalization breakdown:")
print(secretome["Localization"].value_counts())

secretome.to_csv("final_secretome.csv", index=False)
print("\nSaved: final_secretome.csv")

Download:

Typical secretome size: For a ~10,000–20,000 protein nematode proteome, expect ~500–1500 proteins with signal peptides (~5–10%), of which ~300–800 pass the TMHMM filter. The fraction varies greatly by organism — plant-pathogenic fungi can have 20–30% secreted proteins.

Troubleshooting

SignalP 6.0: numpy version conflict

SignalP 6.0 requires numpy < 2.0. Run pip install "numpy==1.24.3" after installing. If using a shared HPC environment, use a dedicated virtual environment to avoid conflicts with other tools.

TMHMM: "Perl not found" or module errors

TMHMM 2.0c is a legacy Perl-based tool. Ensure Perl is in your PATH. Some HPC systems require loading a Perl module first (module load perl). Alternatively, use the web server at DTU Health Tech for small datasets.

False positives: many "secreted" proteins with no known function

This is expected. To prioritize candidates: (1) BLAST against databases of known effectors (effectorP), (2) check for plant-specific expressed sequence tags (ESTs from the host plant expressing these genes), (3) prioritize those with conserved domains (InterProScan), (4) check expression in parasitic stage vs free-living stage.

MAKER2 Annotation Next: QIIME2 Metagenomics

Secreted Protein Prediction: SignalP 6.0, TMHMM & DeepLoc

On this page

Overview & Secretion Strategy

Ideal Directory Structure

Step 1 — Install SignalP 6.0

Step 2 — Run SignalP 6.0

Step 3 — TMHMM: Filter Out Membrane Proteins

Step 4 — Extract Mature Protein Sequences

Step 5 — DeepLoc 2.0: Subcellular Localization

Step 6 — Combine Results & Final Summary

Troubleshooting

SignalP 6.0: numpy version conflict

TMHMM: "Perl not found" or module errors

False positives: many "secreted" proteins with no known function