<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Reproducibility in Bioinformatics | RC Learning Portal</title>
    <link>/notes/bioinfo-reproducibility/</link>
      <atom:link href="/notes/bioinfo-reproducibility/index.xml" rel="self" type="application/rss+xml" />
    <description>Reproducibility in Bioinformatics</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>©&nbsp;2026 The Rector and Visitors of the University of Virginia</copyright><lastBuildDate>Wed, 25 Mar 2026 19:08:46 +0000</lastBuildDate>
    <image>
      <url>/images/icon_hu13341279237897646923.png</url>
      <title>Reproducibility in Bioinformatics</title>
      <link>/notes/bioinfo-reproducibility/</link>
    </image>
    
    <item>
      <title>Reproducibility in Science</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_4/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_4/</guid>
      <description>&lt;h2 id=&#34;reproducibility-vs-replication&#34;&gt;Reproducibility vs Replication&lt;/h2&gt;
&lt;h3 id=&#34;reproducibility&#34;&gt;Reproducibility&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Redo a scientific experiment &amp;amp; generate similar results&lt;/li&gt;
&lt;li&gt;Same sample, software, data, code - same result?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;replication&#34;&gt;Replication&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Different data, same methods - conclusions consistent?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;reusability&#34;&gt;Reusability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Will someone be able to use your pipeline in the future?&lt;/li&gt;
&lt;li&gt;Will you be able to use it?&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;the-reproducibility-problem&#34;&gt;The Reproducibility Problem&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Where did you do the analysis - laptop, server, lab computer, environment&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Are you using the most recent version (scripts, datasets, analyses)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&amp;quot; We just used the default settings!&amp;quot;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;studies-in-reproducibility&#34;&gt;Studies in Reproducibility&lt;/h2&gt;
&lt;h3 id=&#34;nature-2016&#34;&gt;Nature (2016)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Found that 70% of researchers have failed in reproducing another researcher’s results&lt;/li&gt;
&lt;li&gt;50% of researchers failed to reproduce their own&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;plos-biology-2024&#34;&gt;PLoS Biology (2024)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Biomedical researchers - 72% reported “reproducibility crisis”&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;genome-biol-2024&#34;&gt;Genome Biol (2024)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Reproducibility in bioinformatics era&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;challenges-of-bioinformatics&#34;&gt;Challenges of Bioinformatics&lt;/h2&gt;
&lt;p&gt;So many tools, often with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multiple versions &amp;amp; releases&lt;/li&gt;
&lt;li&gt;Complex dependencies &amp;amp; hidden parameters, starting seeds&lt;/li&gt;
&lt;li&gt;Running tools locally vs on HPC&lt;/li&gt;
&lt;li&gt;Formatting conversions between software&lt;/li&gt;
&lt;li&gt;Scalability - how tools handle datasets increasing in size&lt;/li&gt;
&lt;li&gt;Keeping codes organized!&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Aspects of Reproducibility</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_9/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_9/</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Version control&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Environment management&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data storage&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Containers&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tool/software maintenance&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;version-control&#34;&gt;Version Control&lt;/h2&gt;
&lt;h3 id=&#34;github&#34;&gt;GitHub&lt;/h3&gt;
&lt;p&gt;Click the following link to visit the GitHub site: 
&lt;a href=&#34;https://github.com&#34;

 
    target=&#34;_blank&#34; 
    rel=&#34;noopener&#34;

&gt;https://github.com&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Track and manage changes to your code &amp;amp; files&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Store and label changes at every step&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Small or large projects&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Collaborate on projects and minimize conflicting edits&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Works on multiple platforms (MacOS, Windows, Linux)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Website for github, cutadapt repository&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;envoronment-management&#34;&gt;Envoronment Management&lt;/h2&gt;
&lt;h3 id=&#34;condamamba-environments&#34;&gt;Conda/Mamba environments&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Isolated spaces for each project with specific tool versions&lt;/li&gt;
&lt;li&gt;Manage Python versions and dependencies&lt;/li&gt;
&lt;li&gt;Install packages and software directly into environment&lt;/li&gt;
&lt;li&gt;Stable and reproducible place to run code and applications&lt;/li&gt;
&lt;li&gt;Not limited to Python, can run bash, Rscript&lt;/li&gt;
&lt;li&gt;YAML configuration file to create or export and transfer an environment&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;storing-results&#34;&gt;Storing Results&lt;/h2&gt;
&lt;h3 id=&#34;public-repositories-for-sequence-data---required-for-most-journals&#34;&gt;Public repositories for sequence data - required for most journals&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Click the following link for the NCBI database: 
&lt;a href=&#34;https://www.ncbi.nlm.nih.gov&#34;

 
    target=&#34;_blank&#34; 
    rel=&#34;noopener&#34;

&gt;https://www.ncbi.nlm.nih.gov&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Click the following link for the Ensembl database: 
&lt;a href=&#34;https://www.ensembl.org/index.html&#34;

 
    target=&#34;_blank&#34; 
    rel=&#34;noopener&#34;

&gt;https://www.ensembl.org/index.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Always document and archive changes, especially if unpublished:
&lt;ul&gt;
&lt;li&gt;genome assembly versions&lt;/li&gt;
&lt;li&gt;sequence data: SNPs, isoforms&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;containers&#34;&gt;Containers&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Containers are portable environments that run across different computing environments&lt;/li&gt;
&lt;li&gt;They contain packages, software and dependencies that remain isolated from host infrastructure&lt;/li&gt;
&lt;li&gt;Standalone unit of software and can produce same results on different machine or server&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Bioinformatic Pipelines</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_14/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_14/</guid>
      <description>&lt;h2 id=&#34;typical-bioinformatics-workflows-involve-many-steps&#34;&gt;Typical bioinformatics workflows involve many steps:&lt;/h2&gt;
&lt;h3 id=&#34;fastq--qc--alignment--sorting--variant-calling--annotation&#34;&gt;FASTQ → QC → Alignment → Sorting → Variant Calling → Annotation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;FASTQ files need quality check&lt;/li&gt;
&lt;li&gt;Cutadapt for trimming&lt;/li&gt;
&lt;li&gt;BWA - genome alignment&lt;/li&gt;
&lt;li&gt;Samtools - file formatting and conversions&lt;/li&gt;
&lt;li&gt;Freebayes - variant calling&lt;/li&gt;
&lt;li&gt;VCFtools - manipulating files&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Create pipeline to string software together for “final” output&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;bioinformatic-pipeline-challenges&#34;&gt;Bioinformatic Pipeline Challenges&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Complex dependencies between steps&lt;/li&gt;
&lt;li&gt;Formatting inconsistencies&lt;/li&gt;
&lt;li&gt;Hard to reproduce results - scalability, parameters, version changes&lt;/li&gt;
&lt;li&gt;Difficult to parallelize efficiently&lt;/li&gt;
&lt;li&gt;Manual scripts often fail on HPC&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;bioinformatic-pipelines-on-hpc&#34;&gt;Bioinformatic Pipelines on HPC&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Which modules were loaded?&lt;/li&gt;
&lt;li&gt;Where are scripts being run?&lt;/li&gt;
&lt;li&gt;Tracking paths - hard-coded in scripts?&lt;/li&gt;
&lt;li&gt;Out/error files - software vs slurm conflicts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Automate and track these workflows&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Snakemake</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_17/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_17/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Snakemake&lt;/strong&gt; is a workflow management system designed for scientific pipelines&lt;/p&gt;
&lt;p&gt;Click the following link to visit the snakemake site: 
&lt;a href=&#34;https://snakemake.github.io/&#34;

 
    target=&#34;_blank&#34; 
    rel=&#34;noopener&#34;

&gt;https://snakemake.github.io/&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Created by Johannes Köster, first released in 2012&lt;/li&gt;
&lt;li&gt;Based on UNIX make -  originally created in 1976 but still standard use&lt;/li&gt;
&lt;li&gt;Python based - “ &lt;em&gt;snake-make&lt;/em&gt; ”&lt;/li&gt;
&lt;li&gt;Free and open source, available on Mac, Windows, Unix&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;Make&lt;/code&gt; is a command-line interface software tool that performs actions ordered by configured dependencies as defined in a configuration file called a makefile. It is commonly used for build automation to build executable code from source code. &lt;/p&gt;
&lt;h3 id=&#34;snakemake-format&#34;&gt;Snakemake Format&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Similar to writing shell scripts but snake files contains sets of rules&lt;/li&gt;
&lt;li&gt;Format is based on Python structure&lt;/li&gt;
&lt;li&gt;Snakemake reads from snakefile that defines the rules&lt;/li&gt;
&lt;li&gt;Snakefile rules have a target output&lt;/li&gt;
&lt;li&gt;Snakemake uses pattern matching to follow the inputs, outputs and commands contained in rules to reach final target output&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;snakemake-core-idea&#34;&gt;Snakemake Core Idea&lt;/h3&gt;
&lt;p&gt;Instead of defining steps, you define &lt;strong&gt;rules that produce files&lt;/strong&gt;
rule align:&lt;/p&gt;
&lt;p&gt;input:
&amp;ldquo;reads.fastq&amp;rdquo;&lt;/p&gt;
&lt;p&gt;output:
&amp;ldquo;aligned.bam&amp;rdquo;&lt;/p&gt;
&lt;p&gt;shell:
&amp;ldquo;bwa mem ref.fa {input} &amp;gt; {output}&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Snakemake builds a &lt;strong&gt;directed acyclic graph (DAG)&lt;/strong&gt;  automatically.&lt;/p&gt;
&lt;p&gt;Fastq → Cutadapt → BWA → Sorted BAM → Freebayes → VCF&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Recommended Pipeline Directory Structure</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_20/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_20/</guid>
      <description>&lt;h2 id=&#34;benefits&#34;&gt;Benefits:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;separates  &lt;strong&gt;workflow logic from data&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;easier debugging&lt;/li&gt;
&lt;li&gt;easier collaboration&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-practice&#34;&gt;Common practice:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;config → parameters and sample tables&lt;/li&gt;
&lt;li&gt;envs → reproducible environments&lt;/li&gt;
&lt;li&gt;rules → modular workflow steps&lt;/li&gt;
&lt;li&gt;results → generated outputs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A clean directory structure makes pipelines easier to maintain and reproduce.&lt;/p&gt;
&lt;h2 id=&#34;snakefile-breakdown&#34;&gt;Snakefile Breakdown&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Fastq files that need trimming - input:  sample.fastq&lt;/li&gt;
&lt;li&gt;Cutadapt - output: sample-trimmed.fastq&lt;/li&gt;
&lt;li&gt;BWA - align trimmed fastq to assembly output: sample-aligned.sam&lt;/li&gt;
&lt;li&gt;Samtools sorting, indexing - output: sample-sorted.bam&lt;/li&gt;
&lt;li&gt;Freebayes variant calling - output: sample-variants.vcf&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;example-snakefile&#34;&gt;Example snakefile&lt;/h2&gt;
&lt;p&gt;rule all:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt; input:
 
variants/sample1.vcf
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Snakemake take first rule as target then constructs graph of dependencies&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;rule trim:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt; input:
 
”reads/sample1.fastq”
output:

 ”trimmed_reads/sample1-trimmed.fastq”
shell:

  cutadapt -A TCCGGGTS -o {output} {input} 
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;rule align:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt; input:
 
&amp;quot;trimmed_reads/sample1-trimmed.fastq&amp;quot;
 output:
 
    &amp;quot;bam/sample1.bam&amp;quot;

 threads: 1
 
 shell:
 
    bwa mem -t {threads} ref.fa {input} | samtools view -Sb -&amp;gt; {output}&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;wildcards&lt;/strong&gt; serve as placeholders within rules to operate on multiple files via pattern matching&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Snakemake Exercises on HPC</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_23/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_23/</guid>
      <description>&lt;h2 id=&#34;class-data&#34;&gt;Class data:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;GCF_000005845.2_ASM584v2_genomic.fna - genome assembly&lt;/li&gt;
&lt;li&gt;SRR2584863_1.fastq - fastq sequence file, paired-1&lt;/li&gt;
&lt;li&gt;SRR2584863_2.fastq - fastq sequence file, paired-2&lt;/li&gt;
&lt;li&gt;*.smk - snakemake files&lt;/li&gt;
&lt;li&gt;config_variant.yml - configuration file&lt;/li&gt;
&lt;li&gt;submit_snakemake.sh - sample slurm file&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;run-interactively---good-for-testing&#34;&gt;Run interactively - good for testing&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;ijob -c 1 -A allocation -p interactive –v -t 2:00:00
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;modules&#34;&gt;Modules&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;module spider &amp;lt;package&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;specifics and version of package available&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;module spider snakemake
module load snakemake/9.8.1
module list
snakemake -help
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;other-modules-needed&#34;&gt;Other modules needed&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;module load bwa/0.7.17
module load  cutadapt/4.9
module load snakemake/9.8.1
module load freebayes/1.3.10
module load  samtools/1.21

&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;running-snakemake---genome-alignment&#34;&gt;Running snakemake - genome alignment&lt;/h2&gt;
&lt;p&gt;Snakefile - file.smk, contains rules for snakemake&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;
snakemake -c 1 -s align.smk
- --dry-run -np good to test first without producing output
- -n only show steps, don&#39;t run, -p print shell commands
- -c number of cores
- -s needed if using a named snakefile (if just called &amp;quot;snakefile&amp;quot;,  don&#39;t need the –s flag)

snakemake --dag| dot -Tpng &amp;gt; dag_align.png

&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;running-snakemake---variant-detection&#34;&gt;Running snakemake - variant detection&lt;/h2&gt;
&lt;p&gt;Snakefile - file.smk, contains rules for snakemake&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;snakemake -c 1 -s variant-call.smk
- --dry-run
- -c number of cores
- -s needed if using a named snakefile (if just called &amp;quot;snakefile&amp;quot;, don&#39;t need)

snakemake --dag -s variant-call.smk | dot -Tpng &amp;gt; dag_variant.png

&lt;/code&gt;&lt;/pre&gt;
</description>
    </item>
    
    <item>
      <title>Snakemake Examples on HPC</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_29/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_29/</guid>
      <description>&lt;ul&gt;
&lt;li&gt;Not recommended to hard-code files within snake file&lt;/li&gt;
&lt;li&gt;Can organize sample names, file paths, and software parameters in a YAML configuration file&lt;/li&gt;
&lt;li&gt;YAML - serialization language that transforms data into a format that can be shared between systems&lt;/li&gt;
&lt;li&gt;With snakemake, configuration file is a reference for the workflow&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;running-snakemake-with-config-file&#34;&gt;Running Snakemake with Config File&lt;/h2&gt;
&lt;p&gt;Snakefile - file.smk, contains rules for snakemake&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;snakemake -c 1 -s variant-yml.smk --configfile config_variant.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ndash;configfile – directing snakemake to a config file
-c number of cores
-s needed if using a named snakefile&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;reproducible-environments&#34;&gt;Reproducible Environments&lt;/h2&gt;
&lt;h3 id=&#34;snakemake-supports-reproducible-environments&#34;&gt;Snakemake supports reproducible environments&lt;/h3&gt;
&lt;p&gt;Example with Conda:&lt;/p&gt;
&lt;p&gt;rule fastqc:
input: &amp;ldquo;reads.fastq&amp;rdquo;
output: &amp;ldquo;qc.html&amp;rdquo;
conda:        ”~/.conda/envs/fastqc_env” #path to conda environment&lt;/p&gt;
&lt;pre&gt;&lt;code&gt; shell:        &amp;quot;fastqc {input}&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Benefits: Easy dependency management, portable workflows&lt;/p&gt;
&lt;p&gt;Can also create a environment.yml file, list conda envs and what to install&lt;/p&gt;
&lt;h2 id=&#34;snakemake-with-conda-environment&#34;&gt;Snakemake with Conda Environment&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;module load miniforge
conda create
conda activate
snakemake command
screen/tmux
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Keeps session running when disconnected&lt;/p&gt;
&lt;p&gt;Can create different conda environment for different rules&lt;/p&gt;
&lt;h2 id=&#34;smakemake-and-containers&#34;&gt;Smakemake and containers&lt;/h2&gt;
&lt;p&gt;rule align:
container: &amp;ldquo;docker://biocontainers/bwa&amp;rdquo;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;identical software environments&lt;/li&gt;
&lt;li&gt;portable across HPC systems&lt;/li&gt;
&lt;li&gt;easier collaboration&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Best Practices for HPC</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_35/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_35/</guid>
      <description>&lt;h2 id=&#34;recommendations&#34;&gt;Recommendations:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Use threads and resources properly&lt;/li&gt;
&lt;li&gt;Avoid huge single jobs&lt;/li&gt;
&lt;li&gt;Break workflows into modular rules&lt;/li&gt;
&lt;li&gt;Use conda or containers&lt;/li&gt;
&lt;li&gt;Use &amp;ndash;dry-run before submitting large workflows&lt;/li&gt;
&lt;li&gt;Store configuration in YAML files&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-hpc-pitfalls-with-workflow-managers&#34;&gt;Common HPC Pitfalls with Workflow Managers&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requesting too many cores per rule&lt;/li&gt;
&lt;li&gt;Forgetting to specify memory&lt;/li&gt;
&lt;li&gt;Submitting thousands of tiny jobs&lt;/li&gt;
&lt;li&gt;Running Snakemake or Nextflow themselves on a login node&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;key-takeaways-with-workflow-managers&#34;&gt;Key Takeaways with Workflow Managers&lt;/h2&gt;
&lt;p&gt;Snakemake &amp;amp; Nextflow provide:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reproducible pipelines&lt;/li&gt;
&lt;li&gt;Automatic dependency tracking&lt;/li&gt;
&lt;li&gt;Scalable HPC execution&lt;/li&gt;
&lt;li&gt;Environment management&lt;/li&gt;
&lt;li&gt;Workflow portability&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>What is Nextflow?</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_39/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_39/</guid>
      <description>&lt;p&gt;Nextflow is a workflow management system that helps automate and organize multi-step computational pipelines&lt;/p&gt;
&lt;p&gt;At a high level, it connects software steps together, manages how data moves between them,
and handles execution across local machines, HPC schedulers like SLURM, or cloud platforms&lt;/p&gt;
&lt;h2 id=&#34;nextflow-pipelines&#34;&gt;Nextflow pipelines&lt;/h2&gt;
&lt;p&gt;Key concepts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Processes, workflows, and parameters
In general, we are going to:&lt;/li&gt;
&lt;li&gt;Create processes to execute desired commands&lt;/li&gt;
&lt;li&gt;Specify parameters to represent workflow settings&lt;/li&gt;
&lt;li&gt;Define a workflow to execute processes in a specific order
Key files:&lt;/li&gt;
&lt;li&gt;main.nf and nextflow.config&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Nextflow Example</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_41/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_41/</guid>
      <description>&lt;h2 id=&#34;example&#34;&gt;Example&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s start with a very simple toy example for echo&amp;rsquo;ing the text &amp;ldquo;Hello World!&amp;rdquo; And then we&amp;rsquo;ll build to our bioinformatics example.&lt;/p&gt;
&lt;p&gt;First, create a process called HELLO with our shell command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;process HELLO {
 script:
 &amp;quot;&amp;quot;&amp;quot;
 echo &amp;quot;Hello World!&amp;quot;
 &amp;quot;&amp;quot;&amp;quot;
}

Then we execute this process in our workflow:

workflow {
 HELLO()
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;create-a-new-file-called-mainnf&#34;&gt;Create a New File called main.nf&lt;/h2&gt;
&lt;p&gt;We can create a new file called main.nf with these lines.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;process HELLO {
 script:
 &amp;quot;&amp;quot;&amp;quot;
 echo &amp;quot;Hello World!&amp;quot;
 &amp;quot;&amp;quot;&amp;quot;
}

workflow {
 HELLO()
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;make-some-changes&#34;&gt;Make some changes&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;process  hello {
 output:
 path &#39;hello.txt&#39;
 script:
 &amp;quot;&amp;quot;&amp;quot;
 echo &#39;Hello world!&#39; &amp;gt; hello.txt
 &amp;quot;&amp;quot;&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We want to send the text to a file called &amp;lsquo;hello.txt.&amp;rsquo; Now we can update our shell command to send the text to a file, and we can add an output in our process to define our file name and since out output is a file, we&amp;rsquo;ll specify the type of output as a path.&lt;/p&gt;
&lt;p&gt;Run &lt;code&gt;main.nf&lt;/code&gt; in terminal and show it still went to &amp;lsquo;work&amp;rsquo; directory&lt;/p&gt;
&lt;p&gt;This was better, but we still have to dig around for the file, so let&amp;rsquo;s add one more thing to our process.&lt;/p&gt;
&lt;h2 id=&#34;adda-publishdir&#34;&gt;Add a publishDir&lt;/h2&gt;
&lt;p&gt;Now let&amp;rsquo;s try sending our output to a directory called &amp;lsquo;results&amp;rsquo; - we can add a publishDir to our process and specify the mode &amp;ldquo;copy&amp;rdquo; is safest, but you can do other things like move or even create links to the file.&lt;/p&gt;
&lt;p&gt;Re-run the main.nf in the terminal and show where the file goes to results but since we did copy, it still does go to work.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;process  hello {
 publishDir &amp;quot;results/&amp;quot; , mode: &amp;quot;copy&amp;quot;

 output:
 path &#39;hello.txt&#39;
 script:
 &amp;quot;&amp;quot;&amp;quot;
 echo &#39;Hello world!&#39; &amp;gt; hello.txt
 &amp;quot;&amp;quot;&amp;quot;
}&lt;/code&gt;&lt;/pre&gt;
</description>
    </item>
    
    <item>
      <title>Look at a Trim Rule</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_45/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_45/</guid>
      <description>&lt;h2 id=&#34;lets-look-at-our-snakemake-trim-rule-from-earlier&#34;&gt;Let&amp;rsquo;s look at our snakemake &amp;ldquo;trim&amp;rdquo; rule from earlier:&lt;/h2&gt;
&lt;p&gt;Here we specified our inputs/outputs and our shell command.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;rule trim:
 input:
 ”reads/sample1.fastq”

 output:
 ”trimmed_reads/sample1-trimmed.fastq”

 shell:
 cutadapt -A TCCGGGTS -o {output} {input}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;what-to-update-in-nextflow&#34;&gt;What to Update in Nextflow?&lt;/h2&gt;
&lt;p&gt;In looking at our HELLO process, what do we need to add? We already have a publishDir, an output, and script, so let&amp;rsquo;s update those for cutadapt.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;process HELLO {
 publishDir &amp;quot;results/&amp;quot; , mode: &amp;quot;copy&amp;quot;
 output:
 path &#39;hello.txt&#39;

 script:
 &amp;quot;&amp;quot;&amp;quot;
 echo &#39;Hello world!&#39; &amp;gt; hello.txt
 &amp;quot;&amp;quot;&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;update-for-running-cutadapt&#34;&gt;Update for Running Cutadapt&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;process  CUTADAPT {
 publishDir &amp;quot;results/&amp;quot; , mode: &amp;quot;copy&amp;quot;

 output:
 path &#39;trimmed.fastq&#39;

 script:
 &amp;quot;&amp;quot;&amp;quot;
 cutadapt -a AACCGGTT -o trimmed.fastq ~/sample1.fastq
 &amp;quot;&amp;quot;&amp;quot;
}

workflow {

CUTADAPT()
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can keep &amp;lsquo;results&amp;rsquo; as our publishDir for this example, but we&amp;rsquo;ll need to change our output to trimmed.fastq and we&amp;rsquo;ll change the command for cutadapt with our adapter and our input and output file names. Because Nextflow executes each task in its own work directory, we need to provide the full path. Our workflow just becomes running the CUTADAPT process.&lt;/p&gt;
&lt;p&gt;Does this work? Yes, it does. However, note that  we are hard-coding everything and this not really flexible and does not really allow us to scale.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>A More Common Approach</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_48/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_48/</guid>
      <description>&lt;p&gt;A better approach is to pass the file into the process with Channel.fromPath() and use input: path reads. The &amp;ldquo;input:&amp;rdquo; declares an input variable, not a literal source file location. And we use this variable &amp;ldquo;reads&amp;rdquo; our shell command and here $reads means: the local process input variable and use the actual input file that was provided to Nextflow for this task via our workflow. We can also use the reads variable to other things like dynamically name files&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;process  CUTADAPT {
 publishDir &amp;quot;results/&amp;quot; , mode: &amp;quot;copy&amp;quot;
 input:
 path reads_var

 output:
 path &#39;trimmed.fastq&#39;

 script:
 &amp;quot;&amp;quot;&amp;quot;
 cutadapt -a AACCGGTT -o trimmed.fastq $reads_var
 &amp;quot;&amp;quot;&amp;quot;
}

workflow {
CUTADAPT(Channel.fromPath(&#39;~/sample1.fastq&#39;, checkIfExists: true))
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;dynamically-scaling-to-many-samples&#34;&gt;Dynamically Scaling to Many Samples&lt;/h2&gt;
&lt;p&gt;Now we can start to use the flexibility nextflow provides to name our output files dynamically based on sample name and we also can start to scale upby using the wildcard to grab all the fastq files in our example &amp;lsquo;reads&amp;rsquo; directory. Here nextflow is going to create a new separate process for each of our samples.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;process CUTADAPT {
 publishDir &amp;quot;results/&amp;quot;, mode: &amp;quot;copy&amp;quot;

 input:
 path reads_var

 output:
 path &amp;quot;${reads_var.simpleName}_trimmed.fastq&amp;quot;

 script:
 &amp;quot;&amp;quot;&amp;quot;
 cutadapt -a AACCGGTT -o ${reads_var.simpleName}_trimmed.fastq $reads_var
 &amp;quot;&amp;quot;&amp;quot;
}

workflow {
 CUTADAPT(Channel.fromPath(&#39;*.fastq&#39;,checkIfExists: true))
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;parameter-options-for-input-files&#34;&gt;Parameter Options for Input Files&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Add a parameter for &amp;lsquo;&amp;ndash;reads&amp;rsquo; in your &amp;rsquo;nextflow run&amp;rsquo; command&lt;/li&gt;
&lt;li&gt;Add a params.reads at the top of your main.nf file&lt;/li&gt;
&lt;li&gt;Add a params.reads to a nextflow.config file&lt;/li&gt;
&lt;li&gt;Works for one file (&amp;lsquo;reads/sample1.fastq&amp;rsquo;) or many (&amp;lsquo;reads/*.fastq&amp;rsquo;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;less-hard-coding--more-reproducibility&#34;&gt;Less Hard-coding = More Reproducibility&lt;/h2&gt;
&lt;p&gt;If we use one of those parameter methods, instead of our workflow having a hard-coded path for our inputs, we can dynamically provide our input file names and clean things up in our workflow even further.&lt;/p&gt;
&lt;p&gt;From:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;workflow {
CUTADAPT(Channel.fromPath(~/sample1.fastq&#39;, checkIfExists: true))
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;workflow {
CUTADAPT(Channel.fromPath(params.reads, checkIfExists: true))
}
&lt;/code&gt;&lt;/pre&gt;
</description>
    </item>
    
    <item>
      <title>Loading Software</title>
      <link>/notes/bioinfo-reproducibility/bioinfo-reproducibility_52/</link>
      <pubDate>Wed, 25 Mar 2026 19:08:46 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/bioinfo-reproducibility_52/</guid>
      <description>&lt;p&gt;main.nf&lt;/p&gt;
&lt;p&gt;Use a &amp;lsquo;beforeScript&amp;rsquo; in the CUTADAPT process in main.nf&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;beforeScript runs specified shell command(s) before running the script command&lt;/li&gt;
&lt;li&gt;Load the cutadapt module: beforeScript &amp;lsquo;module load cutadapt&amp;rsquo;&lt;/li&gt;
&lt;li&gt;Can also do other things like export variables or create directories&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;beforeScript&amp;quot;&amp;quot;&amp;quot;
 module purge
 module load cutadapt
 mkdir results
 export PATH=&amp;quot;$PATH:/opt/tools&amp;quot;&#39;
&amp;quot;&amp;quot;&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can definitely load the software in our process, but we just cleaned that thing up, so let&amp;rsquo;s put it somewhere better to keep our main.nf focused on workflow logic. To do this, let&amp;rsquo;s go ahead and start to build a nextflow.config file.&lt;/p&gt;
&lt;h2 id=&#34;loading-software--nextflowconfig&#34;&gt;Loading Software – nextflow.config&lt;/h2&gt;
&lt;p&gt;Again, we use a &amp;lsquo;beforeScript&amp;rsquo; specific to the CUTADAPT process&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;Process{
withName: CUTADAPT {
 beforeScript = &#39;&#39;&#39;
 module purge
 module load cutadapt
&#39;&#39;&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now when we specifically run our CUTADAPT process, these commands will run before our script command and set up our process environment. Now we haveour software dialed in for cutadapt. But we need to think about where we are running these processes. By default, nextflow is running shell commands locally, so that means if we&amp;rsquo;re just at the command line, we&amp;rsquo;d be running the processes on the login nodes, which is discouraged&lt;/p&gt;
&lt;h2 id=&#34;adding-slurm-options&#34;&gt;Adding SLURM Options&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;Process {
withName: CUTADAPT {
 beforeScript = &#39;&#39;&#39;
 module purge
 module load cutadapt
&#39;&#39;&#39;

 executor = &#39;slurm&#39;
 queue = &#39;standard&#39;
 cpus = 2
 mem = &#39;16 GB&#39;
 time = &#39;1h&#39;
 clusterOptions = &#39;--account=hpc_build&#39;

}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We need to let Nextflow know that we want to use SLURM to execute our processes and with do this by specifying SLURM as our executor. We can also usethis to specify various other options – nextflow doesn&amp;rsquo;t have explicit options for all possible slurm commands, so we can supplement with any additional options we need with &amp;lsquo;clusterOptions.&amp;rsquo;&lt;/p&gt;
&lt;h2 id=&#34;now-we-have&#34;&gt;Now we have&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Workflow logic in main.nf&lt;/li&gt;
&lt;li&gt;Software and slurm options in nextflow.config&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;extendtocutadapt-bwa_align--freebayes&#34;&gt;Extend to CUTADAPT → BWA_ALIGN → FREEBAYES&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Same rules apply – largely rinse and repeat for additional processes&lt;/li&gt;
&lt;li&gt;Create processes for each step: inputs/outputs, commands, etc.&lt;/li&gt;
&lt;li&gt;Software and slurm options in nextflow.config&lt;/li&gt;
&lt;li&gt;Main difference is our workflow - more processes and channels
&lt;ul&gt;
&lt;li&gt;Send channel into process&lt;/li&gt;
&lt;li&gt;Process produces output&lt;/li&gt;
&lt;li&gt;Output becomes new channel for next process.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With Nextflow, channels carry data and processes do work on that data. You link them together by sending a channel into a process, and if that process produces output, its output can become a new channel for the next process.&lt;/p&gt;
&lt;h2 id=&#34;workflow-forcutadapt-bwa_align--freebayes&#34;&gt;Workflow for CUTADAPT → BWA_ALIGN → FREEBAYES&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s how we could link the trim, align and variant calling together. So now we&amp;rsquo;ll put it all together and run the entire workflow from end to end on the system.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;workflow {
 reads_ch = Channel.fromPath(&amp;quot;${params.reads_dir}/*.fastq&amp;quot;, checkIfExists:  true)
 trimmed_ch = CUTADAPT(reads_ch)
 aligned_ch = BWA_ALIGN(trimmed_ch)
 FREEBAYES(aligned_ch)
}
&lt;/code&gt;&lt;/pre&gt;
</description>
    </item>
    
    <item>
      <title></title>
      <link>/notes/bioinfo-reproducibility/src/out/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>/notes/bioinfo-reproducibility/src/out/</guid>
      <description>&lt;h1 id=&#34;reproducibility-in-bioinformatics&#34;&gt;Reproducibility in Bioinformatics&lt;/h1&gt;
&lt;h1 id=&#34;deb-triant--marcus-bobar&#34;&gt;Deb Triant &amp;amp; Marcus Bobar&lt;/h1&gt;
&lt;p&gt;Research Computing, University of Virginia

&lt;a href=&#34;mailto:dtriant@virginia.edu&#34;


&gt;dtriant@virginia.edu&lt;/a&gt;, 
&lt;a href=&#34;mailto:mb5wt@virginia.edu&#34;


&gt;mb5wt@virginia.edu&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_0.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Introductions&lt;/p&gt;
&lt;h1 id=&#34;workshop-outline&#34;&gt;Workshop outline&lt;/h1&gt;
&lt;p&gt;Difficulties in achieving reproducibility&lt;/p&gt;
&lt;p&gt;Potential problems with bioinformatics pipelines&lt;/p&gt;
&lt;p&gt;Some helpful tools&lt;/p&gt;
&lt;p&gt;Snakemake &amp;amp; Nextflow examples&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_1.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;reproducibility-in-science&#34;&gt;Reproducibility in science&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Reproducibility - redo a scientific experiment &amp;amp; generate similar results
&lt;ul&gt;
&lt;li&gt;Same sample, software, data, code - same result?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Replication - different data, same methods - conclusions consistent?&lt;/li&gt;
&lt;li&gt;Reusability - Will someone be able to use your pipeline in the future?
&lt;ul&gt;
&lt;li&gt;- Will you be able to use it?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_2.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;reproducibility-problem&#34;&gt;Reproducibility Problem&lt;/h1&gt;
&lt;p&gt;Where did you do the analysis - laptop, server, lab computer, environment&lt;/p&gt;
&lt;p&gt;Are you using the most recent version (scripts, datasets, analyses)&lt;/p&gt;
&lt;p&gt;We just used the default settings!&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_3.png&#34;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_4.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;studies-in-reproducibility&#34;&gt;Studies in reproducibility&lt;/h1&gt;
&lt;p&gt;Nature (2016) - Found that 70% of researchers have failed in reproducing another researcher’s results &amp;amp; &amp;gt;50% failed to reproduce their own&lt;/p&gt;
&lt;p&gt;PLoS Biology (2024) - Biomedical researchers - 72% reported “reproducibility crisis”&lt;/p&gt;
&lt;p&gt;Genome Biol (2024) - Reproducibility in bioinformatics era&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_5.png&#34;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_6.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;challenges-of-bioinformatics&#34;&gt;Challenges of Bioinformatics&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;So many tools, often with:
&lt;ul&gt;
&lt;li&gt;Multiple versions &amp;amp; releases&lt;/li&gt;
&lt;li&gt;Complex dependencies &amp;amp; hidden parameters, starting seeds&lt;/li&gt;
&lt;li&gt;Running tools locally vs on HPC&lt;/li&gt;
&lt;li&gt;Formatting conversions between software&lt;/li&gt;
&lt;li&gt;Scalability - how tools handle datasets increasing in size&lt;/li&gt;
&lt;li&gt;Keeping codes organized!&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_7.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;aspects-of-reproducibility&#34;&gt;Aspects of reproducibility&lt;/h1&gt;
&lt;p&gt;Version control&lt;/p&gt;
&lt;p&gt;Environment management&lt;/p&gt;
&lt;p&gt;Data storage&lt;/p&gt;
&lt;p&gt;Containers&lt;/p&gt;
&lt;p&gt;Tool/software maintenance&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_8.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;saving-document-versions&#34;&gt;Saving document versions&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_9.gif&#34;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_10.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;version-control&#34;&gt;Version Control&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_11.png&#34;&gt;&lt;/p&gt;
&lt;p&gt;GitHub: https://github.com&lt;/p&gt;
&lt;p&gt;Track and manage changes to your code &amp;amp; files&lt;/p&gt;
&lt;p&gt;Store and label changes at every step&lt;/p&gt;
&lt;p&gt;Small or large projects&lt;/p&gt;
&lt;p&gt;Collaborate on projects and minimize conflicting edits&lt;/p&gt;
&lt;p&gt;Works on multiple platforms (MacOS, Windows, Linux)&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_12.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Website for github, cutadapt repository&lt;/p&gt;
&lt;h1 id=&#34;environment-management&#34;&gt;Environment Management&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Conda/Mamba environments
&lt;ul&gt;
&lt;li&gt;Isolated spaces for each project with specific tool versions&lt;/li&gt;
&lt;li&gt;Manage Python versions and dependencies&lt;/li&gt;
&lt;li&gt;Install packages and software directly into environment&lt;/li&gt;
&lt;li&gt;Stable and reproducible place to run code and applications&lt;/li&gt;
&lt;li&gt;Not limited to Python, can run bash, Rscript&lt;/li&gt;
&lt;li&gt;YAML configuration file to create or export and transfer an environment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_13.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;storing-results&#34;&gt;Storing results&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Public repositories for sequence data - required for most journals
&lt;ul&gt;
&lt;li&gt;NCBI: https://www.ncbi.nlm.nih.gov&lt;/li&gt;
&lt;li&gt;Ensembl: https://www.ensembl.org/index.html&lt;/li&gt;
&lt;li&gt;Always document and archive changes, especially if unpublished:&lt;/li&gt;
&lt;li&gt;- genome assembly versions&lt;/li&gt;
&lt;li&gt;- sequence data: SNPs, isoforms&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_14.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Websites: NCBI, Ensembl, Santa Cruz&lt;/p&gt;
&lt;h1 id=&#34;containers&#34;&gt;Containers&lt;/h1&gt;
&lt;p&gt;Portable environments that run across different computing environments&lt;/p&gt;
&lt;p&gt;Contain packages, software and dependencies that remain isolated from host infrastructure&lt;/p&gt;
&lt;p&gt;Standalone unit of software and can produce same results on different machine or server&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#002060&#34;&gt; &lt;strong&gt;Ruoshi&lt;/strong&gt; &lt;/span&gt;  &lt;span style=&#34;color:#002060&#34;&gt; __ Sun - Research Computing Workshop Series__ &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;1. Using Containers on HPC - Monday, March 30, 2026 - 9:00AM&lt;/p&gt;
&lt;p&gt;2. Building Containers on HPC - Monday April 6, 2026 - 9:00AM&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_15.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;bioinformatic-pipelines&#34;&gt;Bioinformatic Pipelines&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Typical bioinformatics workflows involve many steps:&lt;/li&gt;
&lt;li&gt;FASTQ → QC → Alignment → Sorting → Variant Calling → Annotation
&lt;ul&gt;
&lt;li&gt;- FASTQ files need quality check and trimming&lt;/li&gt;
&lt;li&gt;Cutadapt&lt;/li&gt;
&lt;li&gt;BWA&lt;/li&gt;
&lt;li&gt;Samtools&lt;/li&gt;
&lt;li&gt;Freebayes&lt;/li&gt;
&lt;li&gt;VCFtools&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Create pipeline to string software together for “final” output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_16.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;bioinformatic-pipeline-challenges&#34;&gt;Bioinformatic Pipeline challenges&lt;/h1&gt;
&lt;p&gt;Complex dependencies between steps&lt;/p&gt;
&lt;p&gt;Formatting inconsistencies&lt;/p&gt;
&lt;p&gt;Hard to reproduce results - scalability, parameters, version changes&lt;/p&gt;
&lt;p&gt;Difficult to parallelize efficiently&lt;/p&gt;
&lt;p&gt;Manual scripts often fail on HPC&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_17.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;bioinformatic-pipelines-on-hpc&#34;&gt;Bioinformatic Pipelines on HPC&lt;/h1&gt;
&lt;p&gt;Which modules were loaded?&lt;/p&gt;
&lt;p&gt;Where are scripts being run&lt;/p&gt;
&lt;p&gt;Tracking paths - hard-coded in scripts?&lt;/p&gt;
&lt;p&gt;Out/error files - software vs slurm conflicts&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#002060&#34;&gt; &lt;strong&gt;Goal:&lt;/strong&gt; &lt;/span&gt;  &lt;span style=&#34;color:#002060&#34;&gt; &lt;/span&gt; Automate and track these workflows&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_18.png&#34;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_19.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;snakemake&#34;&gt;Snakemake&lt;/h1&gt;
&lt;p&gt;https://snakemake.github.io/&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Snakemake&lt;/strong&gt;  is a workflow management system designed for scientific pipelines&lt;/p&gt;
&lt;p&gt;Created by Johannes Köster, first released in 2012&lt;/p&gt;
&lt;p&gt;Based on UNIX make -  originally created in 1976 but still standard use&lt;/p&gt;
&lt;p&gt;Python based - “ &lt;em&gt;snake-make&lt;/em&gt; ”&lt;/p&gt;
&lt;p&gt;Free and open source, available on Mac, Windows, Unix&lt;/p&gt;
&lt;p&gt;https://snakemake.readthedocs.io/en/stable/&lt;/p&gt;
&lt;p&gt;https://github.com/snakemake&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Make is a command-line interface software tool that performs actions ordered by configured dependencies as defined in a configuration file called a makefile. It is commonly used for build automation to build executable code from source code. &lt;/p&gt;
&lt;h1 id=&#34;snakemake-format&#34;&gt;Snakemake format&lt;/h1&gt;
&lt;p&gt;Similar to writing shell scripts but snake files contains sets of rules&lt;/p&gt;
&lt;p&gt;Format is based on Python structure&lt;/p&gt;
&lt;p&gt;Snakemake reads from snakefile that defines the rules&lt;/p&gt;
&lt;p&gt;Snakefile rules have a target output&lt;/p&gt;
&lt;p&gt;Snakemake uses pattern matching to follow the inputs, outputs and commands contained in rules to reach final target output&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_20.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;snakemake-core-idea&#34;&gt;Snakemake Core Idea&lt;/h1&gt;
&lt;p&gt;Instead of defining  &lt;em&gt;steps&lt;/em&gt; , you define  &lt;strong&gt;rules that produce files&lt;/strong&gt; .&lt;/p&gt;
&lt;p&gt;rule align:&lt;/p&gt;
&lt;p&gt;input:&lt;/p&gt;
&lt;p&gt;&amp;ldquo;reads.fastq&amp;rdquo;&lt;/p&gt;
&lt;p&gt;output:&lt;/p&gt;
&lt;p&gt;&amp;ldquo;aligned.bam&amp;rdquo;&lt;/p&gt;
&lt;p&gt;shell:&lt;/p&gt;
&lt;p&gt;&amp;ldquo;bwa mem ref.fa {input} &amp;gt; {output}&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Snakemake builds a  &lt;strong&gt;directed acyclic graph (DAG)&lt;/strong&gt;  automatically.&lt;/p&gt;
&lt;p&gt;Fastq → Cutadapt → BWA → Sorted BAM → Freebayes → VCF&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_21.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;recommended-pipeline-directory-structure&#34;&gt;Recommended Pipeline Directory Structure&lt;/h1&gt;
&lt;p&gt;Benefits:&lt;/p&gt;
&lt;p&gt;separates  &lt;strong&gt;workflow logic from data&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;easier debugging&lt;/p&gt;
&lt;p&gt;easier collaboration&lt;/p&gt;
&lt;p&gt;Common practice:&lt;/p&gt;
&lt;p&gt;config/ → parameters and sample tables&lt;/p&gt;
&lt;p&gt;envs/ → reproducible environments&lt;/p&gt;
&lt;p&gt;rules/ → modular workflow steps&lt;/p&gt;
&lt;p&gt;results/ → generated outputs&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;p&gt;bioinformatics_pipeline/├── Snakefile├── config/│   └── config.yml├── envs/│   └── bwa.yml├── rules/│   ├── alignment.smk│   ├── qc.smk│   └── variant_calling.smk├── scripts/│   └── custom_processing.py├── data/│   └── raw/├── results/│   ├── bam/│   ├── qc/│   └── variants/└── logs/&lt;/p&gt;
&lt;p&gt;A clean directory structure makes pipelines easier to maintain and reproduce.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;.yml file can indicate how to make conda environment and what packages and dependencies you need&lt;/p&gt;
&lt;h1 id=&#34;snakefile-breakdown&#34;&gt;Snakefile breakdown&lt;/h1&gt;
&lt;p&gt;Fastq files that need trimming - input:  sample.fastq&lt;/p&gt;
&lt;p&gt;Cutadapt - output: sample-trimmed.fastq&lt;/p&gt;
&lt;p&gt;BWA - align trimmed fastq to assembly output: sample-aligned.sam&lt;/p&gt;
&lt;p&gt;Samtools sorting, indexing - output: sample-sorted.bam&lt;/p&gt;
&lt;p&gt;Freebayes variant calling - output: sample-variants.vcf&lt;/p&gt;
&lt;h1 id=&#34;example-snakefile&#34;&gt;Example snakefile&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;rule&lt;/strong&gt;  all:    input:        &amp;ldquo;variants/sample1.vcf”&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;rule&lt;/strong&gt;  trim:&lt;/p&gt;
&lt;p&gt;input:&lt;/p&gt;
&lt;p&gt;”reads/sample1.fastq”&lt;/p&gt;
&lt;p&gt;output:&lt;/p&gt;
&lt;p&gt;”trimmed_reads/sample1-trimmed.fastq”&lt;/p&gt;
&lt;p&gt;shell:&lt;/p&gt;
&lt;p&gt;cutadapt -A TCCGGGTS -o {output} {input}&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;rule&lt;/strong&gt;  align:    input:        &amp;ldquo;trimmed_reads/sample1-trimmed.fastq&amp;rdquo;    output:        &amp;ldquo;bam/sample1.bam&amp;rdquo;    threads: 1    shell:        &amp;ldquo;bwa mem -t {threads} ref.fa {input} | samtools view -Sb - &amp;gt; {output}”&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;rule&lt;/strong&gt;  call_variants:    input:        &amp;ldquo;bam/sample1.bam&amp;rdquo;    output:        &amp;ldquo;variants/sample1.vcf&amp;rdquo;    shell:        &amp;ldquo;freebayes -f ref.fa {input} &amp;gt; {output}”&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#0070c0&#34;&gt;Snakemake&lt;/span&gt;  &lt;span style=&#34;color:#0070c0&#34;&gt; takes first rule as the target &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#0070c0&#34;&gt;then constructs graph of dependencies&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#0070c0&#34;&gt;{wildcards} serve as placeholders within rules to operate&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#0070c0&#34;&gt;on multiple files via pattern matching&lt;/span&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Snakemake builds the entire pipeline graph automatically.&lt;/p&gt;
&lt;h1 id=&#34;snakemake-exercises-on-hpc&#34;&gt;Snakemake exercises on HPC&lt;/h1&gt;
&lt;p&gt;Class data:&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#1a1a1a&#34;&gt;/project/&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;hpc_training&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;/reproducibility/&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;snakemake&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;$ cp  &lt;span style=&#34;color:#1a1a1a&#34;&gt;/project/&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;hpc_training&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;/reproducibility/&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;snakemake&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt; .&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#1a1a1a&#34;&gt;   &lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;- GCF_000005845.2_ASM584v2_genomic.fna - genome assembly&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#1a1a1a&#34;&gt;    - SRR2584863_1.fastq - &lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;fastq&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt; sequence file, paired-1&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#1a1a1a&#34;&gt;    - SRR2584863_2.fastq - &lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;fastq&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt; sequence file, paired-2&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#1a1a1a&#34;&gt;    - *.&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;smk&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;  - &lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;snakemake&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt; files&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#1a1a1a&#34;&gt;    - &lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;config_variant.yml&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;  - configuration file&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#1a1a1a&#34;&gt;    - submit_snakemake.sh - sample &lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;slurm&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt; file&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_22.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Yet another markup language- YAML Ain&amp;rsquo;t Markup Language &lt;/p&gt;
&lt;h1 id=&#34;running-jobs-on-interactive-node&#34;&gt;Running jobs on interactive node&lt;/h1&gt;
&lt;p&gt;Run interactively - good for testing&lt;/p&gt;
&lt;p&gt;$ ijob -c 1 -A hpc_training -p interactive –v -t 2:00:00&lt;/p&gt;
&lt;p&gt;$ cp  &lt;span style=&#34;color:#1a1a1a&#34;&gt;/project/&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;hpc_training&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;/reproducibility/&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt;snakemake&lt;/span&gt;  &lt;span style=&#34;color:#1a1a1a&#34;&gt; .&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_23.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Default execution here is local so everything is running in my ijob session on a compute node. If we wanted to have these processes run non-interactively we would want to make sure we are using the executor flag in our snakemake call: &amp;ldquo;&amp;ndash;executor slurm&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Work in scratch&lt;/p&gt;
&lt;h1 id=&#34;modules&#34;&gt;Modules&lt;/h1&gt;
&lt;p&gt;$ module spider &amp;lt;package&amp;gt;&lt;/p&gt;
&lt;p&gt;- specifics and version of package available&lt;/p&gt;
&lt;p&gt;$ module spider snakemake&lt;/p&gt;
&lt;p&gt;$ module load snakemake/9.8.1&lt;/p&gt;
&lt;p&gt;$ module list&lt;/p&gt;
&lt;p&gt;$ snakemake -help&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_24.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;other-modules-needed-for-today&#34;&gt;Other modules needed for today&lt;/h1&gt;
&lt;p&gt;$ module load bwa/0.7.17&lt;/p&gt;
&lt;p&gt;$ module load  cutadapt/4.9&lt;/p&gt;
&lt;p&gt;$ module load snakemake/9.8.1&lt;/p&gt;
&lt;p&gt;$ module load freebayes/1.3.10&lt;/p&gt;
&lt;p&gt;$ module load  samtools/1.21&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_25.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;running-snakemake--genome-alignment&#34;&gt;Running snakemake - genome alignment&lt;/h1&gt;
&lt;p&gt;Snakefile - file.smk, contains rules for snakemake&lt;/p&gt;
&lt;p&gt;$ snakemake -c 1 -s align.smk&lt;/p&gt;
&lt;p&gt;--dry-run -np good to test first without producing output&lt;/p&gt;
&lt;p&gt;-n only show steps, don&amp;rsquo;t run, -p print shell commands&lt;/p&gt;
&lt;p&gt;-c number of cores&lt;/p&gt;
&lt;p&gt;-s needed if using a named snakefile (if just called &amp;ldquo;snakefile&amp;rdquo;,  don&amp;rsquo;t need the –s flag)&lt;/p&gt;
&lt;p&gt;$ snakemake --dag| dot -Tpng &amp;gt; dag_align.png&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_26.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;running-snakemake--variant-detection&#34;&gt;Running snakemake - variant detection&lt;/h1&gt;
&lt;p&gt;Snakefile - file.smk, contains rules for snakemake&lt;/p&gt;
&lt;p&gt;$ snakemake -c 1 -s variant-call.smk&lt;/p&gt;
&lt;p&gt;--dry-run&lt;/p&gt;
&lt;p&gt;-c number of cores&lt;/p&gt;
&lt;p&gt;-s needed if using a named snakefile (if just called &amp;ldquo;snakefile&amp;rdquo;, don&amp;rsquo;t need)&lt;/p&gt;
&lt;p&gt;$ snakemake --dag -s variant-call.smk | dot -Tpng \ &amp;gt; dag_variant.png&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_27.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;snakemake-examples-on-hpc&#34;&gt;Snakemake Examples on HPC&lt;/h1&gt;
&lt;p&gt;Not recommended to hard-code files within snake file&lt;/p&gt;
&lt;p&gt;Can organize sample names, file paths, and software parameters in a YAML configuration file&lt;/p&gt;
&lt;p&gt;YAML - serialization language that transforms data into a format that can be shared between systems&lt;/p&gt;
&lt;p&gt;With snakemake, configuration file is a reference for the workflow&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_28.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Yet another markup language- YAML Ain&amp;rsquo;t Markup Language 
Easy to keep things organized within a single file
While showing, good to have separate config files rather than one huge one and commenting sections out&lt;/p&gt;
&lt;h1 id=&#34;running-snakemake-with-config-file&#34;&gt;Running snakemake with config file&lt;/h1&gt;
&lt;p&gt;Snakefile - file.smk, contains rules for snakemake&lt;/p&gt;
&lt;p&gt;$  &lt;span style=&#34;color:#000000&#34;&gt;snakemake&lt;/span&gt;  &lt;span style=&#34;color:#000000&#34;&gt; -c 1 -s variant-&lt;/span&gt;  &lt;span style=&#34;color:#000000&#34;&gt;yml.smk&lt;/span&gt;  &lt;span style=&#34;color:#000000&#34;&gt; --&lt;/span&gt;  &lt;span style=&#34;color:#000000&#34;&gt;configfile&lt;/span&gt;  &lt;span style=&#34;color:#000000&#34;&gt; &lt;/span&gt;  &lt;span style=&#34;color:#000000&#34;&gt;config_variant.yml&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;--configfile – directing snakemake to a config file&lt;/p&gt;
&lt;p&gt;-c number of cores&lt;/p&gt;
&lt;p&gt;-s needed if using a named snakefile&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_29.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;reproducible-environments&#34;&gt;Reproducible environments&lt;/h1&gt;
&lt;p&gt;Snakemake supports reproducible environments.&lt;/p&gt;
&lt;p&gt;Example with Conda:&lt;/p&gt;
&lt;p&gt;rule fastqc:    input: &amp;ldquo;reads.fastq&amp;rdquo;    output: &amp;ldquo;qc.html&amp;rdquo;    conda:        ”~/.conda/envs/fastqc_env” #path to conda environment    shell:        &amp;ldquo;fastqc {input}&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Benefits: Easy dependency management, portable workflows&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;.yml file can indicate how to make conda environment and what packages and dependencies you need&lt;/p&gt;
&lt;h1 id=&#34;using-environments&#34;&gt;Using Environments&lt;/h1&gt;
&lt;p&gt;├── Snakefile├── config/│   └── config.yml├── envs/│   └── bwa.yml├── rules/│   ├── alignment.smk│   ├── qc.smk│   └── variant_calling.smk├── scripts/│   └── custom_processing.py├── data/│   └── raw/├── results/│   ├── bam/│   ├── qc/│   └── variants/└── logs/&lt;/p&gt;
&lt;p&gt;Can also create a environment.yml file, list conda envs and what to install&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;name&lt;/strong&gt; : bwa.yml&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;channels:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;- conda-forge&lt;/p&gt;
&lt;p&gt;- bioconda&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;dependencies&lt;/strong&gt; :&lt;/p&gt;
&lt;p&gt;-bwa= &lt;span style=&#34;color:#000000&#34;&gt;0.7.17&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_30.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;.yml file can indicate how to make conda environment and what packages and dependencies you need&lt;/p&gt;
&lt;h1 id=&#34;snakemake-with-conda-environment&#34;&gt;Snakemake with conda environment&lt;/h1&gt;
&lt;p&gt;$ module load miniforge&lt;/p&gt;
&lt;p&gt;$ conda create&lt;/p&gt;
&lt;p&gt;$ conda activate&lt;/p&gt;
&lt;p&gt;$ snakemake command&lt;/p&gt;
&lt;p&gt;$ screen/tmux&lt;/p&gt;
&lt;p&gt;- keeps session running when disconnected&lt;/p&gt;
&lt;p&gt;- make sure to connect to same login node,&lt;/p&gt;
&lt;p&gt;- confirm login node with:  hostname&lt;/p&gt;
&lt;p&gt;Can create different conda environment for different rules&lt;/p&gt;
&lt;h1 id=&#34;smakemake-and-containers&#34;&gt;Smakemake and containers&lt;/h1&gt;
&lt;p&gt;Snakemake also supports containers:&lt;/p&gt;
&lt;p&gt;rule align:    container:        &amp;ldquo;docker://biocontainers/bwa&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Advantages:&lt;/p&gt;
&lt;p&gt;identical software environments&lt;/p&gt;
&lt;p&gt;portable across HPC systems&lt;/p&gt;
&lt;p&gt;easier collaboration&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#002060&#34;&gt; &lt;strong&gt;Ruoshi&lt;/strong&gt; &lt;/span&gt;  &lt;span style=&#34;color:#002060&#34;&gt; __ Sun - Research Computing Workshop Series__ &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;1. Using Containers on HPC - Monday, March 30, 2026 - 9:00AM&lt;/p&gt;
&lt;p&gt;2. Building Containers on HPC - Monday April 6, 2026 - 9:00AM&lt;/p&gt;
&lt;h1 id=&#34;best-practices-for-hpc&#34;&gt;Best Practices for HPC&lt;/h1&gt;
&lt;p&gt;Recommendations:&lt;/p&gt;
&lt;p&gt;Use threads and resources properlyAvoid huge single jobsBreak workflows into modular rulesUse conda or containersUse --dry-run before submitting large workflowsStore configuration in YAML files&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_31.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;common-hpc-pitfalls-with-workflow-managers&#34;&gt;Common HPC Pitfalls with workflow managers&lt;/h1&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;p&gt;requesting too many cores per rule&lt;/p&gt;
&lt;p&gt;forgetting to specify memory&lt;/p&gt;
&lt;p&gt;submitting thousands of tiny jobs&lt;/p&gt;
&lt;p&gt;running Snakemake or Nextflow themselves on a login node&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_32.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;key-takeaways-with-workflow-managers&#34;&gt;Key Takeaways with workflow managers&lt;/h1&gt;
&lt;p&gt;Snakemake &amp;amp; Nextflow provide:&lt;/p&gt;
&lt;p&gt;reproducible pipelines&lt;/p&gt;
&lt;p&gt;automatic dependency tracking&lt;/p&gt;
&lt;p&gt;scalable HPC execution&lt;/p&gt;
&lt;p&gt;environment management&lt;/p&gt;
&lt;p&gt;workflow portability&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_33.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;nextflow&#34;&gt;Nextflow&lt;/h1&gt;
&lt;p&gt;Snakemake &amp;amp; Nextflow provide:&lt;/p&gt;
&lt;p&gt;reproducible pipelines&lt;/p&gt;
&lt;p&gt;automatic dependency tracking&lt;/p&gt;
&lt;p&gt;scalable HPC execution&lt;/p&gt;
&lt;p&gt;environment management&lt;/p&gt;
&lt;p&gt;workflow portability&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_34.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;what-is-nextflow&#34;&gt;What is Nextflow?&lt;/h1&gt;
&lt;p&gt;&lt;span style=&#34;color:#000000&#34;&gt;Nextflow&lt;/span&gt;  &lt;span style=&#34;color:#000000&#34;&gt; is a workflow management system that helps automate and organize multi-step computational pipelines.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#000000&#34;&gt;At a high level, it connects software steps together, manages how data moves between them, and handles execution across local machines, HPC schedulers like SLURM, or cloud platforms.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_35.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;nextflow-pipelines&#34;&gt;Nextflow Pipelines&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Key concepts:
&lt;ul&gt;
&lt;li&gt;Processes, workflows, and parameters&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;In general, we are going to:
&lt;ul&gt;
&lt;li&gt;Create processes to execute desired commands&lt;/li&gt;
&lt;li&gt;Specify parameters to represent workflow settings&lt;/li&gt;
&lt;li&gt;Define a workflow to execute processes in a specific order&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Key files:
&lt;ul&gt;
&lt;li&gt;main.nf and nextflow.config&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_36.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Parameters are user-adjustable values that control how a workflow runs. They can specify input files, output locations, software options, reference files, or general pipeline behavior.&lt;/p&gt;
&lt;h1 id=&#34;toy-example-print-the-text-hello-world&#34;&gt;Toy example: print the text &amp;ldquo;Hello World!&amp;rdquo;&lt;/h1&gt;
&lt;p&gt;First, create a process called HELLO with our shell command:&lt;/p&gt;
&lt;p&gt;process HELLO {&lt;/p&gt;
&lt;p&gt;script:&lt;/p&gt;
&lt;p&gt;&amp;quot;&amp;rdquo;&amp;rdquo;&lt;/p&gt;
&lt;p&gt;echo &amp;ldquo;Hello World!&amp;rdquo;&lt;/p&gt;
&lt;p&gt;&amp;quot;&amp;rdquo;&amp;quot;&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;Then we execute this process in our workflow:&lt;/p&gt;
&lt;p&gt;workflow {&lt;/p&gt;
&lt;p&gt;HELLO()&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_37.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Let&amp;rsquo;s start with a very simple toy example for echo&amp;rsquo;ing the text &amp;ldquo;Hello World!&amp;rdquo; And then we&amp;rsquo;ll build to our bioinformatics example.&lt;/p&gt;
&lt;h1 id=&#34;create-a-new-file-called-mainnf&#34;&gt;Create a new file called main.nf&lt;/h1&gt;
&lt;p&gt;process HELLO {&lt;/p&gt;
&lt;p&gt;script:&lt;/p&gt;
&lt;p&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/p&gt;
&lt;p&gt;echo &amp;ldquo;Hello World!&amp;rdquo;&lt;/p&gt;
&lt;p&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;workflow {&lt;/p&gt;
&lt;p&gt;HELLO()&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_38.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;We can create a new file called main.nf with these lines.&lt;/p&gt;
&lt;p&gt;Show and execute main.nf in terminal. Show where the file goes. Went to .command.out file in &amp;lsquo;work&amp;rsquo; directory
for the specific process&lt;/p&gt;
&lt;h1 id=&#34;lets-make-some-changes&#34;&gt;Let&amp;rsquo;s make some changes&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;process&lt;/strong&gt;  hello {  &lt;strong&gt;output&lt;/strong&gt; : path &amp;lsquo;hello.txt&amp;rsquo; script: &amp;quot;&amp;quot;&amp;quot; echo &amp;lsquo;Hello world!&amp;rsquo; &amp;gt; hello.txt &amp;ldquo;&amp;rdquo;&amp;quot;}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_39.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;We want to send the text to a file called &amp;lsquo;hello.txt.&amp;rsquo; Now we can update our shell command to send the text to a file, and we can add an output in our process to define our file name and since out output is a file, we&amp;rsquo;ll specify the type of output as a path.&lt;/p&gt;
&lt;p&gt;Run main.nf in terminal and show it still went to &amp;lsquo;work&amp;rsquo; directory&lt;/p&gt;
&lt;p&gt;This was better, but we still have to dig around for the file, so let&amp;rsquo;s add one more thing to our process.&lt;/p&gt;
&lt;h1 id=&#34;adda-publishdir-for-output-file-destination&#34;&gt;Add a publishDir for output file destination&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;process&lt;/strong&gt;  hello { publishDir &amp;ldquo;results/&amp;rdquo; , mode: &amp;ldquo;copy&amp;rdquo;&lt;/p&gt;
&lt;p&gt;__ output__ : path &amp;lsquo;hello.txt&amp;rsquo; script: &amp;quot;&amp;quot;&amp;quot; echo &amp;lsquo;Hello world!&amp;rsquo; &amp;gt; hello.txt &amp;ldquo;&amp;rdquo;&amp;quot;}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_40.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Now let&amp;rsquo;s try sending our output to a directory called &amp;lsquo;results&amp;rsquo; - we can add a publishDir to our process and specify the mode &amp;ldquo;copy&amp;rdquo; is safest, but you can do other things like move or even create links to the file.
Re-run the main.nf in the terminal and show where the file goes to results but since we did copy, it still does go to work. Point out that we need to be mindful of any extra data we&amp;rsquo;re creating so we don&amp;rsquo;t unnecessarily have duplicates for everything.&lt;/p&gt;
&lt;h1 id=&#34;lets-look-at-our-snakemake-trim-rule-from-earlier&#34;&gt;Let&amp;rsquo;s look at our snakemake &amp;ldquo;trim&amp;rdquo; rule from earlier&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;rule&lt;/strong&gt;  trim:&lt;/p&gt;
&lt;p&gt;input:&lt;/p&gt;
&lt;p&gt;”reads/sample1.fastq”&lt;/p&gt;
&lt;p&gt;output:&lt;/p&gt;
&lt;p&gt;”trimmed_reads/sample1-trimmed.fastq”&lt;/p&gt;
&lt;p&gt;shell:&lt;/p&gt;
&lt;p&gt;cutadapt -A TCCGGGTS -o {output} {input}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_41.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Here we specified our inputs/outputs and our shell command.&lt;/p&gt;
&lt;h1 id=&#34;what-do-we-need-to-update-innextflow&#34;&gt;What do we need to update in Nextflow?&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;process&lt;/strong&gt;  HELLO {    publishDir &amp;ldquo;results/&amp;rdquo; , mode: &amp;ldquo;copy&amp;rdquo;&lt;/p&gt;
&lt;p&gt;__    __ output:    path &amp;lsquo;hello.txt&amp;rsquo;    script:    &amp;quot;&amp;quot;&amp;quot;    echo &amp;lsquo;Hello world!&amp;rsquo; &amp;gt; hello.txt    &amp;ldquo;&amp;rdquo;&amp;quot;}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_42.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;So, looking at our HELLO process, what do we need to add? We already have a publishDir, an output, and script, so let&amp;rsquo;s update those for cutadapt.&lt;/p&gt;
&lt;h1 id=&#34;update-for-running-cutadapt&#34;&gt;Update for running cutadapt&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;process&lt;/strong&gt;  CUTADAPT {    publishDir &amp;ldquo;results/&amp;rdquo; , mode: &amp;ldquo;copy&amp;rdquo;&lt;/p&gt;
&lt;p&gt;output:    path &amp;rsquo;trimmed.fastq&amp;rsquo;    script:    &amp;quot;&amp;quot;&amp;quot;    cutadapt -a AACCGGTT -o trimmed.fastq ~/sample1.fastq    &amp;ldquo;&amp;rdquo;&amp;quot;}&lt;/p&gt;
&lt;p&gt;workflow {&lt;/p&gt;
&lt;p&gt;CUTADAPT()&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_43.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;We can keep &amp;lsquo;results&amp;rsquo; as our publishDir for this example, but we&amp;rsquo;ll need to change our output to trimmed.fastq and we&amp;rsquo;ll change the command for cutadapt with our adapter and our input and output file names. Because Nextflow executes each task in its own work directory, we need to provide the full path. Our workflow just becomes running the CUTADAPT process.
Does this work? Yes, it does. However, but we are hard-coding everything and this not really flexible and does not really allow us to scale.&lt;/p&gt;
&lt;h1 id=&#34;more-common-approach-for-input-files&#34;&gt;More common approach for input files&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;process&lt;/strong&gt;  CUTADAPT {    publishDir &amp;ldquo;results/&amp;rdquo; , mode: &amp;ldquo;copy&amp;rdquo;&lt;/p&gt;
&lt;p&gt;&lt;span style=&#34;color:#000000&#34;&gt;input:&lt;/span&gt;  &lt;span style=&#34;color:#000000&#34;&gt;    path &lt;/span&gt;  &lt;span style=&#34;color:#000000&#34;&gt;reads_var&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;output:    path &amp;rsquo;trimmed.fastq&amp;rsquo;    script:    &amp;quot;&amp;quot;&amp;quot;    cutadapt -a AACCGGTT -o trimmed.fastq $reads_var    &amp;ldquo;&amp;rdquo;&amp;quot;}&lt;/p&gt;
&lt;p&gt;workflow {&lt;/p&gt;
&lt;p&gt;CUTADAPT(Channel.fromPath(&amp;rsquo;~/sample1.fastq&amp;rsquo;, checkIfExists: true))&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_44.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;A better approach is to pass the file into the process with Channel.fromPath() and use input: path reads. The &amp;ldquo;input:&amp;rdquo; declares an input variable, not a literal source file location. And we use this variable &amp;ldquo;reads&amp;rdquo; our shell command and here $reads means: the local process input variable and use the actual input file that was provided to Nextflow for this task via our workflow. We can also use the reads variable to other things like dynamically name files or any&lt;/p&gt;
&lt;h1 id=&#34;dynamically-scaling-to-many-samples&#34;&gt;Dynamically scaling to many samples&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;process CUTADAPT {&lt;/strong&gt;  __ __  &lt;strong&gt;publishDir&lt;/strong&gt;  __ &amp;ldquo;results/&amp;rdquo;, mode: &amp;ldquo;copy&amp;rdquo;__  __ input:__  __ path __  &lt;strong&gt;reads_var&lt;/strong&gt;  __ output:__  __ path &amp;ldquo;${__  &lt;strong&gt;reads_var.simpleName&lt;/strong&gt;  &lt;strong&gt;}_&lt;/strong&gt;  &lt;strong&gt;trimmed.fastq&lt;/strong&gt;  &lt;strong&gt;&amp;rdquo;&lt;/strong&gt;  __ script:__  __ &amp;ldquo;&amp;rdquo;&amp;quot;__  __ __  &lt;strong&gt;cutadapt&lt;/strong&gt;  __ -a AACCGGTT -o ${__  &lt;strong&gt;reads_var.simpleName&lt;/strong&gt;  &lt;strong&gt;}_&lt;/strong&gt;  &lt;strong&gt;trimmed.fastq&lt;/strong&gt;  __ $__  &lt;strong&gt;reads_var&lt;/strong&gt;  __ &amp;ldquo;&amp;rdquo;&amp;quot;__  &lt;strong&gt;}&lt;/strong&gt;  &lt;strong&gt;workflow {&lt;/strong&gt;  __ CUTADAPT(__  &lt;strong&gt;Channel.fromPath&lt;/strong&gt;  &lt;strong&gt;(&amp;rsquo;*.&lt;/strong&gt;  &lt;strong&gt;fastq&lt;/strong&gt;  __&amp;rsquo;, __  &lt;strong&gt;checkIfExists&lt;/strong&gt;  &lt;strong&gt;: true))&lt;/strong&gt;  &lt;strong&gt;}&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_45.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Now we can start to use the flexibility nextflow provides to name our output files dynamically based on sample name and we also can start to scale up by using the wildcard to grab all the fastq files in our example &amp;lsquo;reads&amp;rsquo; directory. Here nextflow is going to create a new separate process for each of our samples.&lt;/p&gt;
&lt;h1 id=&#34;parameter-options-for-input-files&#34;&gt;Parameter options for input files&lt;/h1&gt;
&lt;p&gt;Add a parameter for &amp;lsquo;--reads&amp;rsquo; in your &amp;rsquo;nextflow run&amp;rsquo; command&lt;/p&gt;
&lt;p&gt;Add a params.reads at the top of your main.nf file&lt;/p&gt;
&lt;p&gt;Add a params.reads to a nextflow.config file&lt;/p&gt;
&lt;p&gt;Works for one file (&amp;lsquo;reads/sample1.fastq&amp;rsquo;) or many (&amp;lsquo;reads/*.fastq&amp;rsquo;)&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_46.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;As with many things with Nextflow, we have multple different ways we can accomplish this. Will talk about nextflow.config shortly.&lt;/p&gt;
&lt;h1 id=&#34;less-hard-coding--more-reproducibility&#34;&gt;Less hard-coding = more reproducibility&lt;/h1&gt;
&lt;p&gt;From:workflow { CUTADAPT(Channel.fromPath(~/sample1.fastq&amp;rsquo;, checkIfExists: true))}&lt;/p&gt;
&lt;p&gt;To:&lt;/p&gt;
&lt;p&gt;workflow { CUTADAPT(Channel.fromPath(params.reads, checkIfExists: true))}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_47.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;And if we use one of those parameter methods, instead of our workflow having a hard-coded path for our inputs, we can dynamically provide our input file names and clean things up in our workflow even further.&lt;/p&gt;
&lt;h1 id=&#34;loading-software--mainnf&#34;&gt;Loading software – main.nf&lt;/h1&gt;
&lt;p&gt;Use a &amp;lsquo;beforeScript&amp;rsquo; in the CUTADAPT process in main.nf&lt;/p&gt;
&lt;p&gt;beforeScript runs specified shell command(s) before running the script command&lt;/p&gt;
&lt;p&gt;Load the cutadapt module: beforeScript &amp;lsquo;module load cutadapt&amp;rsquo;&lt;/p&gt;
&lt;p&gt;Can also do other things like export variables or create directories&lt;/p&gt;
&lt;p&gt;beforeScript &amp;quot;&amp;quot;&amp;quot;        module purge        module load cutadapt        mkdir results        export PATH=&amp;quot;$PATH:/opt/tools&amp;quot;&amp;rsquo;    &amp;quot;&amp;quot;&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_48.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;We can definitely load the software in our process, but we just cleaned that thing up, so let&amp;rsquo;s put it somewhere better to keep our main.nf focused on workflow logic. To do this, let&amp;rsquo;s go ahead and start to build a nextflow.config file.&lt;/p&gt;
&lt;h1 id=&#34;loading-software--nextflowconfig&#34;&gt;Loading software – nextflow.config&lt;/h1&gt;
&lt;p&gt;Again, we use a &amp;lsquo;beforeScript&amp;rsquo; specific to the CUTADAPT process&lt;/p&gt;
&lt;p&gt;Process {withName: CUTADAPT {    beforeScript = &amp;rsquo;&amp;rsquo;&amp;rsquo;    module purge    module load cutadapt&lt;/p&gt;
&lt;p&gt;&amp;rsquo;&amp;rsquo;&amp;rsquo;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_49.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Now when we specifically run our CUTADAPT process, these commands will run before our script command and set up our process environment.  Ok, so now we have our software dialed in for cutadapt. But we need to think about where we are running these processes. By default, nextflow is running shell commands locally, so that means if we&amp;rsquo;re just at the command line, we&amp;rsquo;d be running the processes on the login nodes, which is a no-no.&lt;/p&gt;
&lt;h1 id=&#34;adding-slurm-options--nextflowconfig&#34;&gt;Adding SLURM options – nextflow.config&lt;/h1&gt;
&lt;p&gt;Process {    withName: CUTADAPT {        beforeScript = &amp;rsquo;&amp;rsquo;&amp;rsquo;        module purge        module load cutadapt&lt;/p&gt;
&lt;p&gt;&amp;rsquo;&amp;rsquo;&amp;rsquo;&lt;/p&gt;
&lt;p&gt;executor = &amp;lsquo;slurm&amp;rsquo;    queue = &amp;lsquo;standard&amp;rsquo;    cpus = 2    mem = &amp;lsquo;16 GB&amp;rsquo;    time = &amp;lsquo;1h&amp;rsquo;    clusterOptions = &amp;lsquo;--account=hpc_build&amp;rsquo;&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_50.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;So, we need to let Nextflow know that we want to use SLURM to execute our processes and with do this by specifying SLURM as our executor. We can also use this to specify various other options – nextflow doesn&amp;rsquo;t have explicit options for all possible slurm commands, so we can supplement with any additional options we need with &amp;lsquo;clusterOptions.&amp;rsquo; Again, there&amp;rsquo;s multiple ways to configure everything – you can also do global slurm options, but often different parts of the workflow are going to need different resources. And we could potentially specify these slurm options in the CUTADAPT process in our main.nf, but we&amp;rsquo;re trying to keep that tidy and focused on the workflow logic.&lt;/p&gt;
&lt;h1 id=&#34;now-we-have&#34;&gt;Now we have:&lt;/h1&gt;
&lt;p&gt;Workflow logic in main.nf&lt;/p&gt;
&lt;p&gt;Software and slurm options in nextflow.config&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_51.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;As you can imagine, there&amp;rsquo;s also multiple ways to set&lt;/p&gt;
&lt;h1 id=&#34;extendtocutadapt-bwa_align--freebayes&#34;&gt;Extend to CUTADAPT → BWA_ALIGN → FREEBAYES&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Same rules apply – largely rinse and repeat for additional processes&lt;/li&gt;
&lt;li&gt;Create processes for each step: inputs/outputs, commands, etc.&lt;/li&gt;
&lt;li&gt;Software and slurm options in nextflow.config&lt;/li&gt;
&lt;li&gt;Main difference is our workflow - more processes and channels
&lt;ul&gt;
&lt;li&gt;Send channel into process&lt;/li&gt;
&lt;li&gt;Process produces output&lt;/li&gt;
&lt;li&gt;Output becomes new channel for next process.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_52.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;With Nextflow, channels carry data and processes do work on that data. You link them together by sending a channel into a process, and if that process produces output, its output can become a new channel for the next process.&lt;/p&gt;
&lt;h1 id=&#34;workflow-forcutadapt-bwa_align--freebayes&#34;&gt;Workflow for CUTADAPT → BWA_ALIGN → FREEBAYES&lt;/h1&gt;
&lt;p&gt;workflow {&lt;/p&gt;
&lt;p&gt;reads_ch = Channel.fromPath(&amp;quot;${params.reads_dir}/*.fastq&amp;quot;, checkIfExists:  true)&lt;/p&gt;
&lt;p&gt;trimmed_ch = CUTADAPT(reads_ch)&lt;/p&gt;
&lt;p&gt;aligned_ch = BWA_ALIGN(trimmed_ch)&lt;/p&gt;
&lt;p&gt;FREEBAYES(aligned_ch)&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_53.png&#34;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Here&amp;rsquo;s how we could link the trim, align and variant calling together. So now we&amp;rsquo;ll put it all together and run the entire workflow from end to end on the system.&lt;/p&gt;
&lt;h1 id=&#34;additional-links&#34;&gt;Additional links&lt;/h1&gt;
&lt;p&gt;
&lt;a href=&#34;https://nf-co.re/rnaseq/3.23.0/&#34;

 
    target=&#34;_blank&#34; 
    rel=&#34;noopener&#34;

&gt;https&lt;/a&gt;
&lt;a href=&#34;https://nf-co.re/rnaseq/3.23.0/&#34;

 
    target=&#34;_blank&#34; 
    rel=&#34;noopener&#34;

&gt;://nf-co.re/rnaseq/3.23.0/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;
&lt;a href=&#34;https://training.nextflow.io&#34;

 
    target=&#34;_blank&#34; 
    rel=&#34;noopener&#34;

&gt;https://training.nextflow.io&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;https://github.com/nextflow-io/nextflow&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_54.png&#34;&gt;&lt;/p&gt;
&lt;h1 id=&#34;workflows-for-computational-data-analysis&#34;&gt;Workflows for computational data analysis&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems&lt;/li&gt;
&lt;li&gt;https://github.com/pditommaso/awesome-pipeline&lt;/li&gt;
&lt;li&gt;Galaxy platform - bioinformatic software, pipeline and workflows:
&lt;ul&gt;
&lt;li&gt;https://usegalaxy.org&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;img/Triant-Bobar_Reproducibility_55.png&#34;&gt;&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>
