Identify effector proteins secreted by plant-parasitic nematodes using signal peptide prediction, transmembrane domain filtering, and subcellular localization classification.
Many pathogens and parasites secrete effector proteins into host cells to manipulate host immunity and physiology. Identifying these secreted proteins from a proteome requires a multi-step filtering pipeline:
This pipeline produces a progressively filtered list of secreted proteins. Keep each filter step's output so you can trace back exactly how many proteins passed each gate.
SignalP 6.0 uses deep learning (transformer architecture) and requires registration at DTU Health Tech. It supports all three domains of life.
# Register and download SignalP 6.0 from:
# https://services.healthtech.dtu.dk/services/SignalP-6.0/
# Create and activate a Python virtual environment
python3 -m venv signalp6_env
source signalp6_env/bin/activate
# Install the downloaded package
tar -zxvf signalp-6.0h.fast.tar.gz
cd signalp6_fast/
pip install signalp-6-package/
# Fix numpy compatibility (if needed)
pip install "numpy==1.24.3"
# Verify installation and find the model directory
SIGNALP_DIR=$(python3 -c "import signalp; import os; print(os.path.dirname(signalp.__file__))")
echo "SignalP installed at: ${SIGNALP_DIR}"
# Test run
echo ">test_protein" > test.fasta
echo "MKTLLLTLVVVTIVCLDLGAVLGSTREQPGPEASSGRPGAGMSASQSRTSSGLSPQSPETLPHSRQEDASPRNP" >> test.fasta
signalp6 --fastafile test.fasta --organism euk --output_dir test_out/ \
--format txt --mode fast
source signalp6_env/bin/activate
PROTEINS="YourSpecies_proteins.fasta"
signalp6 \
--fastafile ${PROTEINS} \
--organism euk \
--output_dir Signalp6_out/ \
--format txt \
--mode fast \
--model_dir signalp-6-package/models/
# Output (excerpt from prediction_results.txt):
# SignalP-6.0 Organism: Eukarya Timestamp: 20241007151909
# ID Prediction OTHER SP(Sec/SPI) CS Position
# mRNA_1 OTHER 1.000000 0.000000
# mRNA_2 OTHER 1.000000 0.000001
# mRNA_42 SP 0.000023 0.999977 25 # signal peptide at position 25!
# Extract proteins WITH signal peptides (SP prediction)
grep "SP" Signalp6_out/prediction_results.txt \
| awk '{print $1}' \
> signalp_positive_ids.txt
echo "Proteins with signal peptide:"
wc -l signalp_positive_ids.txt
# Extract cleavage site positions for later use
grep "SP" Signalp6_out/prediction_results.txt \
| awk '{print $1, $5}' \
> cleavage_sites.txt
head -5 cleavage_sites.txt
# mRNA_42 25 <- signal peptide ends at position 25
| 🧪--organism euk | Organism type: euk (eukaryote), gram+, or gram-. SignalP uses different neural network models for each kingdom because signal peptide features differ. Using the wrong organism type gives ~20% false positive rate. |
| ⚡--mode fast | Uses the lightweight SignalP model instead of the full ensemble. ~10× faster with minimal accuracy loss for standard signal peptides. Use --mode slow for maximum sensitivity on unusual sequences. |
| 📊grep "SP" results.txt | Filters to only proteins predicted as SP (Sec/SPI signal peptide). Other outputs include: TAT (twin-arginine), LIPO (lipoprotein), OTHER (no signal). For effector prediction, we want classical SP only. |
| 📤awk '{print $1, $5}' | Extracts protein ID and cleavage site position. Column 5 is the position of the signal peptide – mature peptide boundary. You'll need this to clip the signal peptide off before further analysis. |
Proteins with signal peptides AND transmembrane domains are membrane-anchored (type I/II transmembrane proteins) — they are NOT freely secreted effectors. Remove them.
module load tmhmm/2.0c
# or: conda install -c bioconda tmhmm
# First extract only SignalP-positive proteins
module load samtools # for seqtk, or use any FASTA extraction tool
# Extract SignalP-positive sequences
# Using seqtk (fast FASTA subsetting)
seqtk subseq ${PROTEINS} signalp_positive_ids.txt > signalp_candidates.fasta
echo "Candidates from SignalP:"
grep -c ">" signalp_candidates.fasta
# Run TMHMM on SignalP-positive proteins only
tmhmm signalp_candidates.fasta > tmhmm_results.txt
# Parse TMHMM output: keep proteins with 0 TM helices
# (or with exactly 1 TM helix that overlaps with the signal peptide region)
grep "Number of predicted TMHs:" tmhmm_results.txt \
| awk '{
split($0, a, "\t")
split(a[2], b, "=")
if (b[2]+0 == 0) print a[1]
}' \
| sed 's/# //' \
> no_tmhmm_ids.txt
# More robust: use the short summary format
tmhmm --short signalp_candidates.fasta \
| awk '$5 == "PredHel=0" {print $1}' \
> no_tmhmm_ids.txt
echo "Secreted candidates (SP+ and no TM domains):"
wc -l no_tmhmm_ids.txt
# Extract final secreted protein candidates
seqtk subseq signalp_candidates.fasta no_tmhmm_ids.txt \
> secreted_candidates.fasta
| 🪡seqtk subseq | Extracts a subset of sequences from a FASTA file using a list of IDs. Faster than writing a Python/awk script. Takes your proteome FASTA + a list of IDs and outputs only those sequences. |
| 🧠tmhmm --short | Runs TMHMM in compact output mode. Each line gives: protein ID, length, expected TM residues, expected TM in first 60 aa, predicted TM helices count, topology. The PredHel=0 field is what we filter on. |
| 🚫awk '$5 == "PredHel=0"' | Keeps only proteins where column 5 is PredHel=0 — no predicted transmembrane helices. Proteins with ≥1 TM helix are membrane-anchored and excluded from the secretome. |
PredHel=1 AND a SignalP cleavage site in the first 30 aa are likely secreted, not membrane-anchored.The signal peptide is cleaved co-translationally. The "mature" protein (after signal peptide removal) is what actually functions in the host. Extract mature sequences for downstream functional analysis:
#!/usr/bin/env python3
"""Extract mature protein sequences by removing the signal peptide."""
from Bio import SeqIO
import sys
def extract_mature_proteins(fasta_file, cleavage_file, output_file):
# Load cleavage positions from SignalP output
cleavage_pos = {}
with open(cleavage_file) as f:
for line in f:
parts = line.strip().split()
if len(parts) >= 2:
protein_id = parts[0]
try:
pos = int(parts[1])
cleavage_pos[protein_id] = pos
except ValueError:
pass
# Extract mature sequences
with open(output_file, 'w') as out:
for record in SeqIO.parse(fasta_file, 'fasta'):
if record.id in cleavage_pos:
cut = cleavage_pos[record.id]
mature_seq = record.seq[cut:] # trim signal peptide
out.write(f">{record.id}_mature\n{mature_seq}\n")
print(f"Written mature sequences to {output_file}")
extract_mature_proteins(
fasta_file="secreted_candidates.fasta",
cleavage_file="cleavage_sites.txt",
output_file="mature_secreted_proteins.fasta"
)
| 📖SeqIO.parse() | BioPython's FASTA reader. Returns an iterator of SeqRecord objects, each with .id and .seq attributes. Memory-efficient — reads one sequence at a time rather than loading the whole file. |
| ✂️record.seq[cut:] | Python slice syntax — takes everything from position cut to the end. This removes the signal peptide (positions 0 to cut-1) and keeps only the mature protein sequence. The cleavage position comes from SignalP's column 5. |
| 🏷️f">{record.id}_mature" | F-string formatting to rename the output sequence with _mature suffix. This makes it clear in downstream BLAST or functional annotation runs that you're working with the clipped mature form, not the full predicted protein. |
# Install DeepLoc 2.0 (requires registration at DTU Health Tech)
# https://services.healthtech.dtu.dk/services/DeepLoc-2.0/
pip install deeploc2
# Run DeepLoc on mature secreted candidate proteins
deeploc2 \
--fasta mature_secreted_proteins.fasta \
--output deeploc_output/ \
--model Accurate \
--device cpu # or --device gpu if available
# Localization categories predicted:
# Cytoplasm, Nucleus, Extracellular, Cell membrane,
# Mitochondrion, Plastid, Endoplasmic reticulum, Lysosome/Vacuole,
# Golgi apparatus, Peroxisome
cat deeploc_output/results.csv | head -10
# Filter for extracellular or cell membrane proteins
# (most likely effectors/secreted proteins)
grep -E "Extracellular|Cell membrane" deeploc_output/results.csv \
| cut -d',' -f1 > extracellular_ids.txt
echo "Extracellular/membrane-targeted secreted candidates:"
wc -l extracellular_ids.txt
| 🧠--model Accurate | Use the high-accuracy ensemble model instead of the fast single model. Slower but recommended for the final secretome — you've already filtered to a small set of candidates so runtime isn't a concern. |
| 💻--device cpu/gpu | DeepLoc uses deep learning (protein language model). gpu is ~50× faster if a CUDA GPU is available. On CPU, expect ~1–2 seconds per protein. |
| 🌍grep Extracellular | Filters results to proteins predicted as extracellular — these are the highest-confidence secreted effectors. Also keep "Cell membrane" if you're interested in surface-exposed proteins. |
#!/usr/bin/env python3
"""Combine SignalP, TMHMM, and DeepLoc results into a final secretome table."""
import pandas as pd
# Load SignalP results
signalp = pd.read_csv("Signalp6_out/prediction_results.txt",
sep="\t", comment="#",
names=["ID","Prediction","OTHER","SP_prob","CS_pos"])
signalp_pos = signalp[signalp["Prediction"] == "SP"][["ID","SP_prob","CS_pos"]]
# Load TMHMM results (short format)
tmhmm = pd.read_csv("tmhmm_short.txt", sep="\t", comment="#",
names=["ID","len","ExpAA","First60","PredHel","Topology"])
tmhmm_neg = tmhmm[tmhmm["PredHel"] == "PredHel=0"]["ID"].tolist()
# Load DeepLoc results
deeploc = pd.read_csv("deeploc_output/results.csv")
deeploc.columns = ["ID","Localization","Confidence"] + list(deeploc.columns[3:])
# Merge all
secretome = signalp_pos[signalp_pos["ID"].isin(tmhmm_neg)]
secretome = secretome.merge(deeploc[["ID","Localization","Confidence"]],
on="ID", how="left")
# Sort by signal peptide probability
secretome = secretome.sort_values("SP_prob", ascending=False)
print(f"Total proteome: {len(signalp)} proteins")
print(f"Signal peptide positive: {len(signalp_pos)} proteins")
print(f"After TMHMM filter: {len(secretome)} proteins")
print(f"\nLocalization breakdown:")
print(secretome["Localization"].value_counts())
secretome.to_csv("final_secretome.csv", index=False)
print("\nSaved: final_secretome.csv")
SignalP 6.0 requires numpy < 2.0. Run pip install "numpy==1.24.3" after installing. If using a shared HPC environment, use a dedicated virtual environment to avoid conflicts with other tools.
TMHMM 2.0c is a legacy Perl-based tool. Ensure Perl is in your PATH. Some HPC systems require loading a Perl module first (module load perl). Alternatively, use the web server at DTU Health Tech for small datasets.
This is expected. To prioritize candidates: (1) BLAST against databases of known effectors (effectorP), (2) check for plant-specific expressed sequence tags (ESTs from the host plant expressing these genes), (3) prioritize those with conserved domains (InterProScan), (4) check expression in parasitic stage vs free-living stage.