Estimate genome properties before assembly using k-mer frequency profiles, then perform de novo assembly with SPAdes (bacteria/small genomes) or Canu (large eukaryotic genomes).
Before spending days running a genome assembler, always perform k-mer analysis to understand your genome's properties: estimated size, ploidy level, heterozygosity rate, and repeat content. This informs which assembler and parameters to use.
Genome assembly projects accumulate large files quickly. Keep each step isolated so you can re-run or compare assembler outputs independently.
reads.jf can be 20–60 GB for large genomes — as large as the raw reads themselves! Store it on scratch space and delete it once you have the reads.histo file. The histogram is all GenomeScope needs and is only ~10 KB.A k-mer is a substring of length k in a DNA sequence. For a 21-mer of sequence AATTGGCCGTA..., we slide a window of 21 bases across the entire read dataset and count how many times each unique 21-mer appears.
| k-mer size | Possible k-mers (4 bases) | Notes |
|---|---|---|
| k=17 | 17 billion | Too small — many false matches |
| k=21 | 4.4 trillion | Standard for genome size estimation |
| k=31 | 4.6 × 1018 | Used for high-coverage assemblies |
| k=51 | ~4.5 × 1030 | Long reads only |
The key insight: in a diploid genome sequenced at 30× coverage, each unique genomic position appears ~30 times in the reads. A k-mer frequency histogram shows a prominent peak at ~30×. The genome size = (total k-mers) / (peak coverage).
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=24
#SBATCH --time=12:00:00
#SBATCH --job-name=jellyfish
#SBATCH --mem=64G
module load jellyfish/2.3.0
# Input: raw genomic reads (can be paired-end or single-end)
# Example: 30x coverage of a ~500 Mb genome
R1="reads_R1.fastq.gz"
R2="reads_R2.fastq.gz"
# Step 1: Count k-mers (k=21)
# -C = canonical k-mers (count both strands together)
# -m = k-mer size
# -s = initial hash size (set to ~genome size; can over-estimate)
# -t = threads
# -o = output file
jellyfish count \
-C -m 21 \
-s 1000000000 \
-t 24 \
-o reads.jf \
<(zcat ${R1}) <(zcat ${R2})
# Step 2: Generate frequency histogram
jellyfish histo \
-t 24 \
reads.jf \
> reads.histo
echo "Histogram generated: reads.histo"
echo "First 20 lines of histogram (coverage: count):"
head -20 reads.histo
# reads.histo format:
# 1 12405832 <- 1x coverage: mostly errors
# 2 8234521
# ...
# 30 4981234 <- main genome peak near the sequencing depth
# ...
| 🔄-C (canonical) | Count Canonical k-mers — treat a k-mer and its reverse complement as the same thing. Since DNA is double-stranded, AATTGG and CCAATT are the same sequence. Without -C, you'd count every k-mer twice. |
| 🔢-m 21 | K-mer size = 21. Longer k-mers are more specific but require more memory. k=21 is the standard for genome size estimation. k=31 is used for assemblies. |
| 💾-s 1000000000 | Initial hash size in bytes (1 GB here). Set this to roughly your expected genome size. If it's too small, Jellyfish does multiple passes (slower). It's safe to overestimate. |
| ⚡-t 24 | Use 24 CPU threads. Jellyfish is one of the most efficiently parallelized tools in bioinformatics — 24 threads is ~20× faster than single-threaded. |
| 🔐<(zcat R1) <(zcat R2) | Process substitution — decompress the gzipped FASTQs on-the-fly and feed them to Jellyfish without writing uncompressed files to disk. Saves ~50 GB of disk space. |
| 📊jellyfish histo | Converts the binary k-mer hash (.jf) into a two-column text histogram: (coverage, count). This is the tiny file that GenomeScope reads. Once you have it, you can delete the large .jf file. |
GenomeScope fits a mathematical model to the k-mer histogram to estimate: genome size (haploid), heterozygosity rate, repeat fraction, and sequencing error rate.
module load R # GenomeScope is an R script
# Download GenomeScope if not installed
# wget https://raw.githubusercontent.com/schatzlab/genomescope/master/genomescope.R
Rscript genomescope.R \
reads.histo \
21 \
150 \
genomescope_output \
1000
# Arguments:
# reads.histo = k-mer histogram from jellyfish histo
# 21 = k-mer size used (must match jellyfish -m)
# 150 = read length (check your FASTQ)
# genomescope_output = output directory
# 1000 = max k-mer coverage to model (exclude high repeats)
# GenomeScope also available as web app: http://genomescope.org
# Upload reads.histo directly for instant results
ls genomescope_output/
| 📊reads.histo | The k-mer frequency histogram from Jellyfish. Two columns: frequency (how many times a k-mer appears) and count (how many k-mers appear that many times). |
| 🔢21 (k-mer size) | Must match the -m value used in Jellyfish. GenomeScope uses this in its mathematical model to correct for overlapping k-mers at read boundaries. |
| 📏150 (read length) | Your sequencing read length. Check with head -8 reads.fastq.gz | zcat | awk 'NR==2{print length($0)}'. Used to estimate the number of k-mers per read. |
| 📁genomescope_output | Output directory name. GenomeScope creates: summary.txt (genome estimates), plot.png (k-mer histogram + fitted model), and model.txt (model parameters). |
| ✂️1000 (max coverage) | Truncate the histogram at 1000× coverage. Highly repetitive k-mers (transposons, rDNA) form a long tail that confuses the model fitter. Truncating focuses the model on the main genome peak. |
reads.histo to genomescope.org for an interactive visualization with sliders to adjust the model. Much easier for exploring the parameters before committing to command-line runs.GenomeScope output files:
GenomeScope version 1.0 k = 21 property min max Heterozygosity 0.502% 0.519% Genome Haploid Length 246,401,629 bp 246,580,171 bp Genome Repeat Length 55,423,551 bp 55,463,789 bp Genome Unique Length 190,978,078 bp 191,116,382 bp Model Fit 97.3% Read Error Rate 0.233% 0.233% # Output plots: # genomescope_output/plot.png -- k-mer histogram fit # genomescope_output/model.txt -- numerical results
genomeSize=).| Assembler | Input | Best for | Key limitation |
|---|---|---|---|
| SPAdes | Short reads (Illumina) | Bacteria, fungi, small eukaryotes | Not recommended for large genomes (>300 Mb) |
| Canu | Long reads (PacBio/ONT) | Large complex eukaryotes | Slow; needs high coverage (>30×) |
| Flye | Long reads | All sizes; faster than Canu | May need polishing with short reads |
| MaSuRCA | Short + optional long reads | Eukaryotes, hybrid | Complex parameter setup |
| hifiasm | PacBio HiFi (CCS) | High-quality diploid assemblies | Requires HiFi reads specifically |
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --time=24:00:00
#SBATCH --job-name=spades
#SBATCH --mem=128G
module load spades/3.15.4
# Standard paired-end assembly
spades.py \
-1 reads_R1.fastq.gz \
-2 reads_R2.fastq.gz \
-o spades_output/ \
--threads 16 \
--memory 128 \
-k 21,33,55,77,99 # multiple k-mers (SPAdes iterates automatically)
# For highly heterozygous diploid (add --careful only for small genomes):
# spades.py -1 R1.fastq.gz -2 R2.fastq.gz -o spades_out/ --careful
# Outputs:
# spades_output/scaffolds.fasta -- final scaffolded assembly
# spades_output/contigs.fasta -- contigs before scaffolding
# spades_output/spades.log -- full log
echo "Assembly complete:"
grep -c ">" spades_output/scaffolds.fasta
grep "^>" spades_output/scaffolds.fasta | awk -F'_' '{print $4}' \
| sort -rn | head -20
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=32
#SBATCH --time=120:00:00
#SBATCH --job-name=canu
#SBATCH --mem=256G
module load canu/2.2
# genomeSize: use GenomeScope estimate (e.g., 246m = 246 Mb)
# -nanopore or -pacbio: specify read type
canu \
-p genome_assembly \
-d canu_output/ \
genomeSize=246m \
-nanopore raw_nanopore_reads.fastq.gz \
maxThreads=32 \
maxMemory=256
# For PacBio CLR reads:
# canu -p asm -d canu_out/ genomeSize=246m -pacbio reads.fastq.gz
# For PacBio HiFi (CCS) reads — use hifiasm instead:
# hifiasm -o asm.asm -t 32 hifi_reads.fastq.gz
# awk '/^S/{print ">"$2; print $3}' asm.asm.bp.p_ctg.gfa > asm.p_ctg.fa
# Canu key output:
ls canu_output/
# genome_assembly.contigs.fasta -- primary contigs
# genome_assembly.unassembled.fasta -- reads that didn't assemble
# genome_assembly.report -- assembly statistics
module load quast
module load busco/5.4.3
ASSEMBLY="spades_output/scaffolds.fasta"
# or: ASSEMBLY="canu_output/genome_assembly.contigs.fasta"
# QUAST: assembly contiguity statistics
quast.py ${ASSEMBLY} \
-o quast_output/ \
--threads 8
# Key QUAST metrics:
# N50: half the genome is in contigs >= this length
# L50: number of contigs needed to cover 50% of genome
# Largest contig, total length, number of contigs
echo "=== QUAST Summary ==="
cat quast_output/report.txt
# BUSCO: gene content completeness
# -l: lineage dataset (check busco --list-datasets)
# Use a lineage appropriate to your organism:
# bacteria_odb10, fungi_odb10, embryophyta_odb10, metazoa_odb10, etc.
busco \
-i ${ASSEMBLY} \
-o busco_output \
-m genome \
-l embryophyta_odb10 \
-c 16
echo "=== BUSCO Summary ==="
cat busco_output/short_summary*.txt
| Metric | Good assembly | Poor assembly |
|---|---|---|
| BUSCO Complete | >90% | <70% |
| BUSCO Duplicated | <5% | >20% (suggests collapse or false duplication) |
| N50 | Close to expected chromosome size | Much shorter than expected |
| Total length | Within 10% of GenomeScope estimate | Very different suggests contamination or collapse |
Try a different k-mer size (k=17 or k=31). Check that you used canonical k-mers (-C flag in jellyfish). Very low coverage (<15×) will produce a flat histogram with no clear peak — more data is needed. Contamination also flattens the histogram.
SPAdes is not designed for large genomes. Switch to MaSuRCA or use a HPC node with >256 GB RAM. For very large genomes (>1 Gb), long reads are essentially mandatory for a useful assembly.
Check: (1) Did you use the correct lineage database? (2) Is total assembly length close to GenomeScope estimate? Low BUSCO often means the assembly is highly fragmented — improve with long reads, scaffolding (Hi-C or optical mapping), or polishing with Pilon.