Everything you need to know before diving into bioinformatics analysis — Linux essentials, file formats, environment management, HPC basics, and the full tool landscape.
Nearly all bioinformatics tools run on Linux. If you're coming from Windows or macOS, you'll live in the terminal. Here are the commands you'll use daily:
# Where am I? pwd # List files (long format, human-readable sizes) ls -lh # Move around cd /path/to/directory cd .. # go up one level cd ~ # go to home directory # Create directories and files mkdir -p project/data project/results project/scripts touch README.md # Copy, move, remove cp file.txt backup.txt mv file.txt new_name.txt rm unwanted_file.txt # CAREFUL: no recycle bin! rm -r unwanted_directory/ # remove directory recursively
# View files
head -20 file.txt # first 20 lines
tail -20 file.txt # last 20 lines
less file.txt # scroll through (press q to quit)
cat file.txt # print entire file
# Count lines, words, characters
wc -l file.txt # line count (critical for checking FASTQ!)
# Search inside files
grep "gene_name" results.txt # find lines containing text
grep -c ">" sequences.fasta # count FASTA sequences
grep -v "^#" file.vcf | head # skip comment lines
# Column extraction and manipulation
cut -f1,3 file.tsv # extract columns 1 and 3
sort -k2,2n file.tsv # sort by column 2 numerically
awk '{print $1, $4-$3}' genes.bed # compute column differences
sed 's/old/new/g' file.txt # find-and-replace
# Pipes: chain commands together
cat reads.fastq | grep -c "^@" # count FASTQ reads
zcat file.fastq.gz | head -8 # peek at gzipped files
grep, awk, sort, cut, and wc. Master these and you can parse any bioinformatics output file without writing a full script.
# Make a script executable chmod +x my_script.sh # Connect to a remote server / HPC cluster ssh username@cluster.university.edu # Transfer files to/from server scp local_file.txt username@cluster:~/data/ scp username@cluster:~/results/output.csv ./ # Transfer entire directories rsync -avz local_dir/ username@cluster:~/project/
screen or tmux when running long jobs via SSH. If your connection drops, the job keeps running. Start with screen -S myanalysis, run your command, then detach with Ctrl+A, D. Reattach with screen -r myanalysis.
Knowing your file formats is non-negotiable. Here's every format you'll encounter:
| Format | What It Stores | Extension | Key Commands |
|---|---|---|---|
| FASTA | Reference sequences (genome, protein) | .fa, .fasta, .fna | grep -c ">" ref.fa |
| FASTQ | Raw sequencing reads + quality scores | .fq, .fastq, .fq.gz | zcat reads.fq.gz | wc -l ÷4 |
| SAM/BAM | Aligned reads (text/binary) | .sam, .bam | samtools view, flagstat, sort, index |
| CRAM | Highly compressed BAM | .cram | samtools view (needs reference) |
| VCF/BCF | Variant calls (SNPs, indels) | .vcf, .vcf.gz, .bcf | bcftools view, stats, query |
| BED | Genomic intervals (chr, start, end) | .bed | bedtools intersect, merge, sort |
| GFF3/GTF | Gene annotations (coordinates, features) | .gff3, .gtf | awk '$3=="gene"' file.gff3 |
| WIG/bigWig | Continuous coverage signal | .wig, .bw | View in IGV / UCSC Genome Browser |
| BIOM | Microbiome OTU/ASV tables | .biom | biom summarize-table |
| h5ad/Loom | Single-cell count matrices | .h5ad, .loom | Python: scanpy.read_h5ad() |
0x1 = paired | 0x4 = unmapped | 0x10 = reverse strand | 0x100 = secondary alignment | 0x400 = duplicate | 0x800 = supplementary. Use Picard's flag explainer to decode any flag.
Conda is the package manager for bioinformatics. It handles dependencies, versioning, and isolation so your tools don't conflict.
# Install Miniforge (recommended over full Anaconda) wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh bash Miniforge3-Linux-x86_64.sh # Restart shell, then: # Create an environment for RNA-Seq analysis conda create -n rnaseq -c bioconda -c conda-forge \ fastqc fastp star subread samtools multiqc # Activate / deactivate conda activate rnaseq conda deactivate # List all environments conda env list # Export environment for reproducibility conda env export > rnaseq_env.yml # Recreate on another machine conda env create -f rnaseq_env.yml
-c bioconda -c conda-forgeconda install -n base mamba then mamba create ...samtools=1.19
Solving environment: failed — conflicting package versions. Try creating a fresh env with fewer packages, or use mamba.conda activate doesn't work — run conda init bash and restart your shell.conda clean --all.Most bioinformatics analyses require more resources than a laptop. High-Performance Computing (HPC) clusters use job schedulers (usually SLURM) to manage shared resources.
#!/bin/bash #SBATCH --job-name=my_analysis #SBATCH --output=logs/%x_%j.out #SBATCH --error=logs/%x_%j.err #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=32G #SBATCH --time=12:00:00 #SBATCH --partition=normal #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=your_email@university.edu # Load your conda environment source activate rnaseq # Your analysis command STAR --runThreadN 8 \ --genomeDir /path/to/genome_index \ --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix results/sample_ echo "Job finished at: $(date)"
| SLURM Directive | What It Controls | How to Choose |
|---|---|---|
--cpus-per-task | CPU cores | Match the -t or --threads flag of your tool. Don't request more than you use. |
--mem | RAM | Check tool docs. Alignment: ~32G. GATK: ~16G. WGCNA: depends on gene count. |
--time | Wall time limit | Overestimate by 50%. Job gets killed if it exceeds this. |
--partition | Queue/partition | Ask your HPC admins. Common: short, normal, long, gpu. |
# Submit a job sbatch my_script.sh # Check your queued/running jobs squeue -u $USER # Cancel a job scancel 12345678 # Check job efficiency after completion seff 12345678 # Interactive session (for testing) srun --cpus-per-task=4 --mem=16G --time=2:00:00 --pty bash
head -40000 reads.fq > test.fq = 10,000 reads). Once the test works, scale up. Use seff on completed test jobs to right-size your resource requests for the full run.
R is the primary language for statistical analysis in bioinformatics (DESeq2, Seurat, WGCNA, ggplot2 all run in R).
# Install Bioconductor (the bioinformatics R package repository)
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install a package from Bioconductor
BiocManager::install("DESeq2")
# Install from CRAN
install.packages("tidyverse")
# Load packages
library(DESeq2)
library(tidyverse)
# Read data
df <- read.csv("my_data.csv")
df <- read.delim("my_data.tsv") # tab-separated
# Basic data exploration
dim(df) # rows x columns
head(df) # first 6 rows
str(df) # structure / types
summary(df) # statistical summary
# The tidyverse way
df %>%
filter(padj < 0.05) %>%
arrange(desc(abs(log2FoldChange))) %>%
head(20)
Python is dominant for sequence manipulation (Biopython), single-cell analysis (Scanpy), and machine learning. Key packages:
Parse FASTA/FASTQ, BLAST, GenBank records
SequencesPython alternative to Seurat for scRNA-Seq
Single-CellDataFrame manipulation (like R's dplyr)
DataPlotting (like ggplot2)
VisualizationRead BAM/SAM files from Python
GenomicsML classifiers, clustering, dimensionality reduction
ML# Parse a FASTA file with Biopython
from Bio import SeqIO
for record in SeqIO.parse("genome.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")
# Read a count matrix with pandas
import pandas as pd
counts = pd.read_csv("counts.tsv", sep="\t", index_col=0)
print(counts.shape)
print(counts.head())
| Database | What It Contains | URL |
|---|---|---|
| NCBI GenBank | All publicly available nucleotide sequences | ncbi.nlm.nih.gov |
| NCBI SRA | Raw sequencing data (FASTQ) | ncbi.nlm.nih.gov/sra |
| Ensembl | Genome assemblies, gene annotations (GTF/GFF) | ensembl.org |
| UCSC Genome Browser | Interactive genome visualization, track hubs | genome.ucsc.edu |
| UniProt | Protein sequences and functional annotations | uniprot.org |
| Gene Ontology (GO) | Functional categories for gene annotation | geneontology.org |
| KEGG | Metabolic and signaling pathway maps | genome.jp/kegg |
| dbSNP | Known genetic variants (SNPs, indels) | ncbi.nlm.nih.gov/snp |
| GEO | Published expression datasets (microarray, RNA-Seq) | ncbi.nlm.nih.gov/geo |
| Phytozome | Plant genomes and gene families | phytozome.jgi.doe.gov |
| FlyBase / WormBase / TAIR | Model organism databases | flybase.org / wormbase.org / arabidopsis.org |
| ClinVar | Clinical significance of human variants | ncbi.nlm.nih.gov/clinvar |
fasterq-dump (from SRA Toolkit) or prefetch + fasterq-dump. Example: fasterq-dump --split-3 SRR12345678 -e 8. The --split-3 flag separates paired-end reads into R1 and R2 files.
Every major bioinformatics tool organized by analysis type. Click any section to expand.
Visual QC report for FASTQ files
QCAggregate reports from multiple QC tools
QCUltra-fast trimmer + QC in one tool
TrimmingJava-based flexible trimmer
TrimmingAdapter removal specialist
TrimmingSwiss army knife for read processing
MultiShort-read DNA aligner (WGS, WES)
DNASplice-aware RNA-Seq aligner (fast)
RNASplice-aware aligner (memory efficient)
RNAFast short-read aligner (ChIP-Seq, small genomes)
DNALong-read aligner (ONT, PacBio)
Long-ReadQuasi-mapping RNA-Seq quantifier (no BAM needed)
RNAPseudo-alignment based RNA-Seq quantifier
RNA10X Genomics scRNA-Seq pipeline (alignment + counting)
scRNAGold standard for bulk RNA-Seq DE
RFast, flexible DE analysis
RDE for large datasets, microarray background
RDE from kallisto transcript quantification
RDE for single-cell data (hurdle model)
scRNAPython implementation of DESeq2
PythonMost popular scRNA-Seq toolkit (R)
RPython scRNA-Seq framework (AnnData)
PythonDeep learning for single-cell (integration, DE)
PythonAutomated cell type annotation
AnnotationReference-based cell annotation (R)
RTrajectory / pseudotime analysis
RCell-cell communication inference
RCopy number from scRNA-Seq (cancer)
RGold standard germline/somatic variant calling
VariantsLightweight variant caller + VCF manipulation
VariantsGoogle's deep learning variant caller
AIFast germline + somatic calling
VariantsSomatic mutation calling (tumor-normal)
CancerVariant annotation (impact, gene, protein change)
AnnotationMulti-database variant annotation
AnnotationGWAS, population genetics, linkage
PopulationBest for PacBio HiFi assembly
Long-ReadONT/PacBio long-read assembler
Long-ReadShort-read (Illumina) assembler, also metagenomics
Short-ReadAssembly quality assessment (N50, BUSCO)
QCGene completeness assessment
QCAssembly polishing with short reads
PolishReference-free assembly QC with k-mers
QCEstimate genome size, heterozygosity from k-mers
EstimationComplete 16S/ITS amplicon pipeline
AmpliconFast k-mer taxonomic classification
ShotgunMarker-gene based profiling
ShotgunFunctional pathway profiling
FunctionASV denoising (inside QIIME2 or standalone R)
AmpliconMetagenomic assembly
AssemblyMAG quality assessment
QCProkaryotic genome annotation
AnnotationGrammar of graphics (R)
RAdvanced heatmaps with annotations
RInteractive charts (R + Python)
InteractiveIntegrative Genomics Viewer (BAM, VCF, BED)
DesktopNetwork visualization and analysis
NetworksCircular genome plots
GenomicsPublication-ready volcano plots
RInteractive taxonomic composition charts
MicrobiomeDSL2 workflow language (most popular in bio)
WorkflowCommunity Nextflow pipelines (rnaseq, sarek, etc.)
PipelinePython-based workflow manager
WorkflowWeb-based GUI for bioinformatics (no coding needed)
GUIBroad Institute workflow definition (GATK pipelines)
WorkflowAI assistants (ChatGPT, Claude, Copilot) are game-changers for bioinformatics. Here's how to use them effectively:
Copy this template prompt and fill in the bracketed sections:
Memorize these and you'll never feel lost in a bioinformatics pipeline.
"Quality → Trim → Align → Count → Diff" — QTACD
The universal RNA-Seq workflow: QC reads → Trim adapters → Align to genome → Count features → Differential expression. Every bulk-seq analysis follows this pattern.
"FASTQ has Quality, FASTA does Not"
FASTQ = sequences + quality scores (raw reads). FASTA = sequences only (reference genomes, proteins). The "Q" in FASTQ stands for Quality.
"SAM is Human, BAM is Binary"
SAM = text (human-readable alignment). BAM = compressed binary (machine-efficient). CRAM = even smaller. Always store BAM, never SAM.
"Conda Create, Conda Activate" — CCA
conda create -n myenv → conda activate myenv → install tools → do work → conda deactivate. One environment per project. Never install into base.
seff <jobid> output. Over-requesting wastes queue priority; under-requesting kills your job.
"Grep Awk Sed — the GAS that powers bioinformatics"
grep = search text. awk = extract columns. sed = find & replace. Master these three and you can wrangle any TSV, GFF, or BED file.
"Padj, Not Pvalue" — PNP
Always filter by adjusted p-value (padj / FDR / q-value), never raw p-value. With 20,000 genes, p < 0.05 gives 1,000 false positives. padj < 0.05 controls false discovery rate.
Now pick a tutorial and dive in! We recommend starting with RNA-Seq Differential Expression or Data Visualization.