Recommended Pipeline Directory Structure
Benefits:
- separates workflow logic from data
- easier debugging
- easier collaboration
Common practice:
- config -> parameters and sample tables
- envs -> reproducible environments
- rules -> modular workflow steps
- results -> generated outputs
A clean directory structure makes pipelines easier to maintain and reproduce.
Snakefile Breakdown
- Fastq files that need trimming - input:
sample.fastq - Cutadapt - output:
sample-trimmed.fastq - BWA - align trimmed fastq to assembly output:
sample-aligned.sam - Samtools sorting, indexing - output:
sample-sorted.bam - Freebayes variant calling - output:
sample-variants.vcf
Example snakefile
rule all:
input:
"variants/sample1.vcf"
rule trim:
input:
"reads/sample1.fastq"
output:
"trimmed_reads/sample1-trimmed.fastq"
shell:
"cutadapt -A TCCGGGTS -o {output} {input}"
rule align:
input:
"trimmed_reads/sample1-trimmed.fastq"
output:
"bam/sample1.bam"
threads: 1
shell:
"bwa mem -t {threads} ref.fa {input} | samtools view -Sb - > {output}"
Snakemake takes the first rule as the target, then constructs a graph of dependencies.
Wildcards serve as placeholders within rules to operate on multiple files via pattern matching.