Sequence File Formats

http://xkcd.com/927/

There are many different file formats that can be used for storing info about DNA sequences. With file formats, the format name usually denotes the suffix as well.

Format Name Suffix Info
FASTA .fasta, .fna, .fa A standard file format
FASTQ .fastq Like FASTA, but also stores quality scores
SAM/BAM .sam/.bam Stands for Sequence Alignment Map, developed for NGS data and stores alignment information
VCF .vcf Stands for Variant Call Format
GFF3 .gff3 Stands for Generic Feature Format ver. 3
GTF .gtf Stands for Gene Transfer Format, similar to GFF3 but contains additional gene annotation info

FASTA

Here is an example of what a FASTA file looks like after the first time an assembly is opened.

FASTQ

Most modern sequencing platforms perform base calling and then return that data in FASTQ format, with quality scores included. (These quality scores are encoded using ASCII characters.) The FASTQ data can then be used for quality control and trimming.

The basic structure of a FASTQ file
An example of a FASTQ file with multiple sequencing reads

SAM/BAM

Both SAM (Sequence Alignment Map) and BAM are file formats for sequence alignment files. These sequence alignment files provide context for raw data. Each file has eleven columns (tab-delimited), and one alignment is recorded for each line. SAM is a plain-text format (human readable), while BAM is a binary format. SAM/BAM files are often used with SAMTools (a suite of utilities for SAM/BAM files) and Picard (a collection of tools for sequencing data).

GTF

GTF (Gene Transfer Format) is a common format for annotating gene info on a genome.

An example of a GTF file

References

Previous
Next
RC Logo RC Logo © 2025 The Rector and Visitors of the University of Virginia