Bioinformatics 101

1 Linux / Bash Essentials

Nearly all bioinformatics tools run on Linux. If you're coming from Windows or macOS, you'll live in the terminal. Here are the commands you'll use daily:

Navigation & Files

Bash

# Where am I?
pwd

# List files (long format, human-readable sizes)
ls -lh

# Move around
cd /path/to/directory
cd ..           # go up one level
cd ~            # go to home directory

# Create directories and files
mkdir -p project/data project/results project/scripts
touch README.md

# Copy, move, remove
cp file.txt backup.txt
mv file.txt new_name.txt
rm unwanted_file.txt         # CAREFUL: no recycle bin!
rm -r unwanted_directory/    # remove directory recursively

Text Processing Power Tools

Bash

# View files
head -20 file.txt       # first 20 lines
tail -20 file.txt       # last 20 lines
less file.txt           # scroll through (press q to quit)
cat file.txt            # print entire file

# Count lines, words, characters
wc -l file.txt          # line count (critical for checking FASTQ!)

# Search inside files
grep "gene_name" results.txt            # find lines containing text
grep -c ">" sequences.fasta             # count FASTA sequences
grep -v "^#" file.vcf | head            # skip comment lines

# Column extraction and manipulation
cut -f1,3 file.tsv                      # extract columns 1 and 3
sort -k2,2n file.tsv                    # sort by column 2 numerically
awk '{print $1, $4-$3}' genes.bed       # compute column differences
sed 's/old/new/g' file.txt              # find-and-replace

# Pipes: chain commands together
cat reads.fastq | grep -c "^@"          # count FASTQ reads
zcat file.fastq.gz | head -8            # peek at gzipped files

The 5 Commands You'll Use Most: grep, awk, sort, cut, and wc. Master these and you can parse any bioinformatics output file without writing a full script.

Permissions & SSH

Bash

# Make a script executable
chmod +x my_script.sh

# Connect to a remote server / HPC cluster
ssh username@cluster.university.edu

# Transfer files to/from server
scp local_file.txt username@cluster:~/data/
scp username@cluster:~/results/output.csv ./

# Transfer entire directories
rsync -avz local_dir/ username@cluster:~/project/

Use screen or tmux when running long jobs via SSH. If your connection drops, the job keeps running. Start with screen -S myanalysis, run your command, then detach with Ctrl+A, D. Reattach with screen -r myanalysis.

2 Bioinformatics File Formats

Knowing your file formats is non-negotiable. Here's every format you'll encounter:

Format	What It Stores	Extension	Key Commands
FASTA	Reference sequences (genome, protein)	.fa, .fasta, .fna	`grep -c ">" ref.fa`
FASTQ	Raw sequencing reads + quality scores	.fq, .fastq, .fq.gz	`zcat reads.fq.gz \| wc -l` ÷4
SAM/BAM	Aligned reads (text/binary)	.sam, .bam	`samtools view, flagstat, sort, index`
CRAM	Highly compressed BAM	.cram	`samtools view` (needs reference)
VCF/BCF	Variant calls (SNPs, indels)	.vcf, .vcf.gz, .bcf	`bcftools view, stats, query`
BED	Genomic intervals (chr, start, end)	.bed	`bedtools intersect, merge, sort`
GFF3/GTF	Gene annotations (coordinates, features)	.gff3, .gtf	`awk '$3=="gene"' file.gff3`
WIG/bigWig	Continuous coverage signal	.wig, .bw	View in IGV / UCSC Genome Browser
BIOM	Microbiome OTU/ASV tables	.biom	`biom summarize-table`
h5ad/Loom	Single-cell count matrices	.h5ad, .loom	Python: `scanpy.read_h5ad()`

FASTQ Format Anatomy

Example: One FASTQ Read (4 lines) @SRR12345678.1 HWI-ST1234:100:AB01:1:1101:1234:2045 length=150 ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG... + IIIIIIIIIIIIIIIIIIIIIHHHHHHHHHGGGGGFFFFEEEDDDCCCCBBBBAAAA####... Line 1: Read identifier (starts with @) Line 2: DNA sequence Line 3: Separator (+) Line 4: Quality scores (ASCII-encoded Phred scores) I = Q40 (99.99% accurate), # = Q2 (37% accurate)

SAM Format Anatomy

Key SAM fields QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL read1 99 chr1 100 60 150M = 350 400 ATCG... IIII... FLAG 99 = paired, mapped, mate mapped, first in pair, on forward strand MAPQ 60 = very high confidence mapping CIGAR 150M = 150 bases perfectly matched

3 Conda & Environment Management

Conda is the package manager for bioinformatics. It handles dependencies, versioning, and isolation so your tools don't conflict.

Bash

# Install Miniforge (recommended over full Anaconda)
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh
# Restart shell, then:

# Create an environment for RNA-Seq analysis
conda create -n rnaseq -c bioconda -c conda-forge \
  fastqc fastp star subread samtools multiqc

# Activate / deactivate
conda activate rnaseq
conda deactivate

# List all environments
conda env list

# Export environment for reproducibility
conda env export > rnaseq_env.yml

# Recreate on another machine
conda env create -f rnaseq_env.yml

Best Practices for Conda

One environment per project or workflow — don't install everything in base
Always specify channels: -c bioconda -c conda-forge
Use Mamba for faster solving: conda install -n base mamba then mamba create ...
Pin versions in your YAML for reproducibility: samtools=1.19
Export your env and include the YAML in your paper's supplementary

Common Conda Headaches:

Solving environment: failed — conflicting package versions. Try creating a fresh env with fewer packages, or use mamba.
conda activate doesn't work — run conda init bash and restart your shell.
Disk space: conda caches downloads. Clean with conda clean --all.

4 HPC & SLURM Job Submission

Most bioinformatics analyses require more resources than a laptop. High-Performance Computing (HPC) clusters use job schedulers (usually SLURM) to manage shared resources.

Bash

#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --partition=normal
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your_email@university.edu

# Load your conda environment
source activate rnaseq

# Your analysis command
STAR --runThreadN 8 \
  --genomeDir /path/to/genome_index \
  --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
  --readFilesCommand zcat \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix results/sample_

echo "Job finished at: $(date)"

Download this template:

SLURM Directive	What It Controls	How to Choose
`--cpus-per-task`	CPU cores	Match the `-t` or `--threads` flag of your tool. Don't request more than you use.
`--mem`	RAM	Check tool docs. Alignment: ~32G. GATK: ~16G. WGCNA: depends on gene count.
`--time`	Wall time limit	Overestimate by 50%. Job gets killed if it exceeds this.
`--partition`	Queue/partition	Ask your HPC admins. Common: short, normal, long, gpu.

Essential SLURM Commands

Bash

# Submit a job
sbatch my_script.sh

# Check your queued/running jobs
squeue -u $USER

# Cancel a job
scancel 12345678

# Check job efficiency after completion
seff 12345678

# Interactive session (for testing)
srun --cpus-per-task=4 --mem=16G --time=2:00:00 --pty bash

Pro Workflow: Write a SLURM script for each step. Test with a small subset first (head -40000 reads.fq > test.fq = 10,000 reads). Once the test works, scale up. Use seff on completed test jobs to right-size your resource requests for the full run.

5 R & RStudio Quick Start

R is the primary language for statistical analysis in bioinformatics (DESeq2, Seurat, WGCNA, ggplot2 all run in R).

R

# Install Bioconductor (the bioinformatics R package repository)
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install a package from Bioconductor
BiocManager::install("DESeq2")

# Install from CRAN
install.packages("tidyverse")

# Load packages
library(DESeq2)
library(tidyverse)

# Read data
df <- read.csv("my_data.csv")
df <- read.delim("my_data.tsv")   # tab-separated

# Basic data exploration
dim(df)       # rows x columns
head(df)      # first 6 rows
str(df)       # structure / types
summary(df)   # statistical summary

# The tidyverse way
df %>%
  filter(padj < 0.05) %>%
  arrange(desc(abs(log2FoldChange))) %>%
  head(20)

6 Python for Bioinformatics

Python is dominant for sequence manipulation (Biopython), single-cell analysis (Scanpy), and machine learning. Key packages:

Biopython

Parse FASTA/FASTQ, BLAST, GenBank records

Sequences

Scanpy

Python alternative to Seurat for scRNA-Seq

Single-Cell

pandas

DataFrame manipulation (like R's dplyr)

Data

matplotlib / seaborn

Plotting (like ggplot2)

Visualization

pysam

Read BAM/SAM files from Python

Genomics

scikit-learn

ML classifiers, clustering, dimensionality reduction

ML

Python

# Parse a FASTA file with Biopython
from Bio import SeqIO

for record in SeqIO.parse("genome.fasta", "fasta"):
    print(f"{record.id}: {len(record.seq)} bp")

# Read a count matrix with pandas
import pandas as pd
counts = pd.read_csv("counts.tsv", sep="\t", index_col=0)
print(counts.shape)
print(counts.head())

7 Key Biological Databases

Database	What It Contains	URL
NCBI GenBank	All publicly available nucleotide sequences	ncbi.nlm.nih.gov
NCBI SRA	Raw sequencing data (FASTQ)	ncbi.nlm.nih.gov/sra
Ensembl	Genome assemblies, gene annotations (GTF/GFF)	ensembl.org
UCSC Genome Browser	Interactive genome visualization, track hubs	genome.ucsc.edu
UniProt	Protein sequences and functional annotations	uniprot.org
Gene Ontology (GO)	Functional categories for gene annotation	geneontology.org
KEGG	Metabolic and signaling pathway maps	genome.jp/kegg
dbSNP	Known genetic variants (SNPs, indels)	ncbi.nlm.nih.gov/snp
GEO	Published expression datasets (microarray, RNA-Seq)	ncbi.nlm.nih.gov/geo
Phytozome	Plant genomes and gene families	phytozome.jgi.doe.gov
FlyBase / WormBase / TAIR	Model organism databases	flybase.org / wormbase.org / arabidopsis.org
ClinVar	Clinical significance of human variants	ncbi.nlm.nih.gov/clinvar

Downloading Data from SRA: Use fasterq-dump (from SRA Toolkit) or prefetch + fasterq-dump. Example: fasterq-dump --split-3 SRR12345678 -e 8. The --split-3 flag separates paired-end reads into R1 and R2 files.

8 The Complete Tool Landscape

Every major bioinformatics tool organized by analysis type. Click any section to expand.

QC & Trimming

FastQC

Visual QC report for FASTQ files

QC

MultiQC

Aggregate reports from multiple QC tools

QC

fastp

Ultra-fast trimmer + QC in one tool

Trimming

Trimmomatic

Java-based flexible trimmer

Trimming

Cutadapt

Adapter removal specialist

Trimming

BBTools / BBDuk

Swiss army knife for read processing

Multi

Read Alignment

BWA-MEM2

Short-read DNA aligner (WGS, WES)

DNA

STAR

Splice-aware RNA-Seq aligner (fast)

RNA

HISAT2

Splice-aware aligner (memory efficient)

RNA

Bowtie2

Fast short-read aligner (ChIP-Seq, small genomes)

DNA

minimap2

Long-read aligner (ONT, PacBio)

Long-Read

Salmon

Quasi-mapping RNA-Seq quantifier (no BAM needed)

RNA

kallisto

Pseudo-alignment based RNA-Seq quantifier

RNA

CellRanger

10X Genomics scRNA-Seq pipeline (alignment + counting)

scRNA

Differential Expression

DESeq2

Gold standard for bulk RNA-Seq DE

R

edgeR

Fast, flexible DE analysis

R

limma-voom

DE for large datasets, microarray background

R

Sleuth

DE from kallisto transcript quantification

R

MAST

DE for single-cell data (hurdle model)

scRNA

PyDESeq2

Python implementation of DESeq2

Python

Single-Cell Analysis

Seurat (v5)

Scanpy

Python scRNA-Seq framework (AnnData)

Python

scvi-tools

Deep learning for single-cell (integration, DE)

Python

CellTypist

Automated cell type annotation

Annotation

SingleR

Reference-based cell annotation (R)

R

Monocle3

Trajectory / pseudotime analysis

R

CellChat

Cell-cell communication inference

R

InferCNV

Copy number from scRNA-Seq (cancer)

R

Variant Calling & Genomics

GATK4

Gold standard germline/somatic variant calling

Variants

bcftools

Lightweight variant caller + VCF manipulation

Variants

DeepVariant

Google's deep learning variant caller

AI

Strelka2

Fast germline + somatic calling

Variants

Mutect2

Somatic mutation calling (tumor-normal)

Cancer

SnpEff / VEP

Variant annotation (impact, gene, protein change)

Annotation

ANNOVAR

Multi-database variant annotation

Annotation

PLINK

GWAS, population genetics, linkage

Population

Genome Assembly

hifiasm

Best for PacBio HiFi assembly

Long-Read

Flye

ONT/PacBio long-read assembler

Long-Read

SPAdes

Short-read (Illumina) assembler, also metagenomics

Short-Read

QUAST

Assembly quality assessment (N50, BUSCO)

QC

BUSCO

Gene completeness assessment

QC

Pilon

Assembly polishing with short reads

Polish

Merqury

Reference-free assembly QC with k-mers

QC

GenomeScope2

Estimate genome size, heterozygosity from k-mers

Estimation

Metagenomics & Microbiome

QIIME2

Complete 16S/ITS amplicon pipeline

Amplicon

Kraken2 / Bracken

Fast k-mer taxonomic classification

Shotgun

MetaPhlAn4

Marker-gene based profiling

Shotgun

HUMAnN3

Functional pathway profiling

Function

DADA2

ASV denoising (inside QIIME2 or standalone R)

Amplicon

MEGAHIT / metaSPAdes

Metagenomic assembly

Assembly

CheckM2

MAG quality assessment

QC

Prokka / Bakta

Prokaryotic genome annotation

Annotation

Visualization

ggplot2

Grammar of graphics (R)

R

ComplexHeatmap

Advanced heatmaps with annotations

R

plotly

Interactive charts (R + Python)

Interactive

IGV

Integrative Genomics Viewer (BAM, VCF, BED)

Desktop

Cytoscape

Network visualization and analysis

Networks

Circos / BioCircos

Circular genome plots

Genomics

EnhancedVolcano

Publication-ready volcano plots

R

Krona

Interactive taxonomic composition charts

Microbiome

Workflow Managers & Pipelines

Nextflow

DSL2 workflow language (most popular in bio)

Workflow

nf-core

Community Nextflow pipelines (rnaseq, sarek, etc.)

Pipeline

Snakemake

Python-based workflow manager

Workflow

Galaxy

Web-based GUI for bioinformatics (no coding needed)

GUI

WDL / Cromwell

Broad Institute workflow definition (GATK pipelines)

Workflow

9 Using AI to Accelerate Your Bioinformatics

AI assistants (ChatGPT, Claude, Copilot) are game-changers for bioinformatics. Here's how to use them effectively:

Example: Adapt a Script to Your Data

Copy this template prompt and fill in the bracketed sections:

I have an RNA-Seq count matrix with [NUMBER] genes and [NUMBER] samples. My conditions are: [LIST YOUR CONDITIONS, e.g., "3 Control, 3 Drought-Stressed"]. My organism is [ORGANISM, e.g., "Arabidopsis thaliana"]. My count file is tab-separated with gene IDs as row names and sample names as column headers. Please write a complete DESeq2 R script that: 1. Loads the count matrix from "counts.txt" 2. Creates the sample metadata with my conditions 3. Runs DESeq2 with the appropriate design formula 4. Extracts results with padj < 0.05 and |log2FC| > 1 5. Creates a volcano plot and PCA plot 6. Saves results to a CSV file 7. Runs GO enrichment using clusterProfiler with the correct OrgDb for my organism Please use the [ORGANISM OrgDb package, e.g., org.At.tair.db] for annotation.

Tips for Better AI Prompts:

Be specific: Include your organism, sample count, file format, and exact column names
Provide context: "I'm running this on an HPC with SLURM" vs. "I'm on my laptop"
Paste error messages: Include the full error text when debugging
Ask for explanations: "Explain each parameter choice" gets better code than "Write a script"
Iterate: Ask follow-ups like "Now add batch correction" or "Convert this to a SLURM script"

Mnemonics & Thumb Rules

Memorize these and you'll never feel lost in a bioinformatics pipeline.

Mnemonic

"Quality → Trim → Align → Count → Diff" — QTACD

The universal RNA-Seq workflow: QC reads → Trim adapters → Align to genome → Count features → Differential expression. Every bulk-seq analysis follows this pattern.

Mnemonic

"FASTQ has Quality, FASTA does Not"

FASTQ = sequences + quality scores (raw reads). FASTA = sequences only (reference genomes, proteins). The "Q" in FASTQ stands for Quality.

Mnemonic

"SAM is Human, BAM is Binary"

SAM = text (human-readable alignment). BAM = compressed binary (machine-efficient). CRAM = even smaller. Always store BAM, never SAM.

Mnemonic

"Conda Create, Conda Activate" — CCA

conda create -n myenv → conda activate myenv → install tools → do work → conda deactivate. One environment per project. Never install into base.

Thumb Rule — SLURM Resources: Start with 8 CPUs, 32 GB RAM, 24 hours. Adjust after your first run based on seff <jobid> output. Over-requesting wastes queue priority; under-requesting kills your job.

Thumb Rule — File Sizes: Raw FASTQ: ~1 GB per 3M reads. BAM: ~50% of FASTQ. VCF: tiny (MBs). Always compress with gzip. Delete intermediate files after QC.

Mnemonic

"Grep Awk Sed — the GAS that powers bioinformatics"

grep = search text. awk = extract columns. sed = find & replace. Master these three and you can wrangle any TSV, GFF, or BED file.

Mnemonic

"Padj, Not Pvalue" — PNP

Always filter by adjusted p-value (padj / FDR / q-value), never raw p-value. With 20,000 genes, p < 0.05 gives 1,000 false positives. padj < 0.05 controls false discovery rate.

Thumb Rule — Replicates: Minimum 3 biological replicates per condition for differential analysis. 2 replicates = underpowered. Technical replicates are NOT substitutes for biological replicates.

Summary

You're Ready! You now know:

Essential Linux/Bash commands for file manipulation
Every major bioinformatics file format and how to inspect them
Conda environment management for reproducible analyses
SLURM/HPC job submission with proper resource requests
R and Python basics for bioinformatics
Key biological databases and how to download data
The complete landscape of bioinformatics tools
How to use AI effectively for scripting and debugging

Now pick a tutorial and dive in! We recommend starting with RNA-Seq Differential Expression or Data Visualization.

All Tutorials Start: RNA-Seq Tutorial

On this page

1 Linux / Bash Essentials

Navigation & Files

Text Processing Power Tools

Permissions & SSH

2 Bioinformatics File Formats

FASTQ Format Anatomy

SAM Format Anatomy

3 Conda & Environment Management

Best Practices for Conda

4 HPC & SLURM Job Submission

Essential SLURM Commands

5 R & RStudio Quick Start

6 Python for Bioinformatics

Biopython

Scanpy

pandas

matplotlib / seaborn

pysam

scikit-learn

7 Key Biological Databases

8 The Complete Tool Landscape

FastQC

MultiQC

fastp

Trimmomatic

Cutadapt

BBTools / BBDuk

BWA-MEM2

STAR

HISAT2

Bowtie2

minimap2

Salmon

kallisto

CellRanger

DESeq2

edgeR

limma-voom

Sleuth

MAST

PyDESeq2

Seurat (v5)

Scanpy

scvi-tools

CellTypist

SingleR

Monocle3

CellChat

InferCNV

GATK4

bcftools

DeepVariant

Strelka2

Mutect2

SnpEff / VEP

ANNOVAR

PLINK

hifiasm

Flye

SPAdes

QUAST

BUSCO

Pilon

Merqury

GenomeScope2

QIIME2

Kraken2 / Bracken

MetaPhlAn4

HUMAnN3

DADA2

MEGAHIT / metaSPAdes

CheckM2

Prokka / Bakta

ggplot2

ComplexHeatmap

plotly

IGV

Cytoscape