BUSCO

BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool used to assess the completeness of genome assemblies and annotations. It does this by looking for single-copy orthologs in the DNA sequence.

Key properties of BUSCO’s benchmark orthologs

Running Busco Interactively

Running BUSCO interactively is good for testing.

To start an interactive session, use the ijob command:

ijob -c 1 -A hpc_training -p standard -v

To prepare to copy the test files, first move into your desired directory with the cd command:

cd /yourdirectory

Use the pwd command to check that you are in your desired directory:

pwd

Then copy the test files using the cp command (the dot at the end of the command means that the destination of the copied files will be the directory you are currently in):

cp /project/rivanna-training/genomics-hpc .

To check that all the files transferred correctly:

ls -lh

Now we can begin to load BUSCO. To check what versions of BUSCO are available on our system, run:

module spider busco

For this example, we will load version 5.8.2:

module load busco/5.8.2

To verify that the module was loaded, run:

module list

To actually run BUSCO on our Cyglarus-subset.fasta file, enter the command:

busco -i Cyglarus-subset.fasta -o Lep-Blue_out -m genome -l lepidoptera_odb10

The -l option indicates the lineage dataset used. This dataset will be downloaded automatically the first time you use it. The -o option will make a new output directory, or it will throw an error if the directory name already exists.

Running Busco with a Slurm Script

Slurm is a resource manager that can be used to run your code for you. Below is a Slurm script called busco_slurm_submit.sh.

#!/bin/bash
#SBATCH -A hpc_training                    # account name (--account)
#SBATCH -p standard                        # partition/queue (--partition)
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=1                         # 1 task – how many copies of code to run
#SBATCH --cpus-per-task=4                  # total cores per task – for multithreaded code
##SBATCH --mem=3200                        # total memory (Mb)
#SBATCH -t 01:00:00                        # time limit: 1-hour
#SBATCH -J busco-test                      # job name
#SBATCH -o busco-test-%A.out               # output file
#SBATCH -e busco-test-%A.err               # error file
#SBATCH --mail-user=yourid@virginia.edu    # where to send email alerts
#SBATCH --mail-type=ALL                    # receive email when starts/stops/fails

module purge   # good practice to purge all modules
module load busco/5.8.2

cd /project/rivanna-training/genomics-hpc/busco

# make new output directory or rename each time or throws an error if already exists!

busco -i Cyglarus-subset.fasta -o Lep-Blue_out -m genome -l lepidoptera_odb10

To submit the Slurm script and start a Slurm job, enter:

sbatch busco_slurm_submit.sh

When you run a Slurm script with the #SBATCH --mail-user=yourid@virginia.edu line and #SBATCH --mail-type=ALL line, Slurm will send you email notifications when your job starts, stops, or fails.

To immediately monitor the status of your Slurm job, enter:

squeue -u user_id

PD means pending, R means running, and CG means exiting.

Overall, the Slurm Script method should result in the same output files as when run interactively.

N50 Size

BUSCO is used to measure the completeness of a genome assembly. To measure the contiguity of a genome assembly (how continuous or unbroken the sequences are), you can use the N50 metric.

To calculate the N50 value for a genome, we place our contigs from largest to smallest on the genome. We then start with the longest contig and add the lengths of the next configs until the cumulative length reaches at least 50% of the total genome size. The length of the shortest contig added is the N50 value.

The N50 value is then compared with N50 values from genomes of similar size. A greater N50 is usually a sign of assembly improvement. Keep in mind that genome composition can bias comparisons. Another metric used alongside N50 is L50, which is the number of contigs that make up the N50. For example, a highly-contiguous assembly will have a high N50 value and a low L50 value.

Resources

Bioinformatics. 2015;31(19):3210-3212. doi:10.1093/bioinformatics/btv351
https://busco.ezlab.org

Last updated on Jul 10, 2025