Sequence File Formats
Introduction to Sequence File Formats
As soon as biologicaly data was able to be stored digitally, a multitude of file formats arose. The very first files contained raw DNA sequence reads in a regular.txt file, but as soon as the range of information broadened, so did the types of files.
Several different file formats arose, each with their own purpose.
- Compatibility per specific software (for visualizations, diagrams, mappings).
- Simple text for easy data processing, parsing, and human readability (comma or tab delimited files - eg. tsv, csv.
- Improve efficiency for computers. Usually these are in a non-human readable binary format. You'll see some binary files have a corresponding "index" file which is useful for searching.
Purpose of this lesson
In this series, we'll go over the most common sequence file formats you'll come across in bioinformatics. The purpose of this lesson won't be to learn the intricate details per format, but to simply become familiarized with the different file types that are used to store biological data. This will help calm that overwhelming feeling you get when sifting through bioinformatics forums, books and software tools.
Typically, to truly understand a file format (especially raw output files), you'll need to know the technology used to generate that format. In these cases, we'll link you to the corresponding tutorial.
We'll mainly go over DNA sequence file types, and save database file formats such as EMBL or SWISS-PROT for another lesson.
What is a file format?
A file format is a way for computers (and humans) to standardize how data is organized. For example, this page was written on an .html extension. HTML files contain special tags that tell the browser what each block of text is, and how to display it on the page.
Additionally, computers are able to check file formats and immediately determine whether it should be opened in a text editor (for editing), a modern browser (for viewing) or some other software.
File types can also indicate which algorithm to use to view (or open) that file. For example, .gif, .jpg and .png all display images, but the level of compression, size and resolution differ.
Plain text files
Early on, scientists held sequence information in plain text (.txt) with descriptive file names. The researchers then felt limited, when they felt the need to include annotations and additional information about the sequences.
More common (yet still primitive) file types include csv and tsv. The former stands for comma-separated values, meaning that there is a comma between each value. The simplicity of this format allows researchers to easily exchange data among computers, a term known as portability.
id,fname,lname,occupation 314,Peter,Ignasius,Bioinformaticist 232,Sarah,Carlito,Mathematician 412,Enrique,Menezes,Microbiologist
A tsv (tab-separated values) file is similar, but data is separated by tabs instead of commas. Many of the biological filetypes covered in this lesson have tab-separated values.
What is a newline (EOL) character?
The newline (aka end of line or EOL) is a special character or sequence of special characters that signify the end of a line in text. On a normal text editor such as Notepad++, these characters are hidden.
What's tricky about the EOL character is that depending on the platform (UNIX or MS Windows), the newline character is different. On the Command Line, you may interchange files of the two types with the dos2unix and unix2dos commands.
Another common text-based format that is becoming more and more popular is the markdown format. These files are indicated with a basename of .md.
The markdown format is a markup language, just like HTML. All it does is mark up the text within the document to indicate which lines are headers, paragraphs, and so on. For example,
# This is a first-level header ## This is a second-level header ### This is a third-level header > This is a blockquote Four spaces / 1 tab = line for code *italics*, **bold**, `inline code`, [Google](http://google.com) Unordered List: - Illumina - PacBio - IonTorrent Ordered List: 1. FASTA 2. FASTQ 3. SAM/BAM/CRAM
There are command-line tools that help convert from markdown to html, such as pandoc:
$ pandoc --from markdown --to html README.md > README.html
You'll often see these files as README.md when you download a source file from GitHub or another repository. The markdown format allows the page to load with proper formatting.
Great! Now that we've covered basic text-based file formats, let's move into more bioinformatics-specific file types.
FASTA (pronounced "fast-A") format is a simple type of format that bioinformaticians use to represent either nucleotide or protein sequences. It is written in text format, allowing for processing tools to easily parse the data. The general file extension is .fas.
The FASTA file format originated from a DNA and protein sequence alignment software package called FASTP created in the mid-1980's. The format allows you to precede each sequence with a comment.
There are two lines per sequence - 1) the identifier (comments, annotations) and 2) the sequence itself.
Sample FASTA sequence
Before we dig into a FASTA sequence, let's see what one looks like. Here is an example of a standard FASTA format. Pretty simple, right?
>gi|13959657|sp|Q9PTU8|VSP3_BOTJA Venom serine proteinase A precursor MVLIRVIANLLILQLSNAQKSSELVIGGDECNITEHRFLVEIFNSSGLFCGGTLIDQEWVLSAAHCDMRN MRIYLGVHNEGVQHADQQRRFAREKFFCLSSRNYTKWDKDIMLIRLNRPVNNSEHIAPLSLPSNPPSVGS VCRIMGWGTITSPNATFPDVPHCANINLFNYTVCRGAHAGLPATSRTLCAGVLQGGIDTCGGDSGGPLIC NGTFQGIVSWGGHPCAQPGEPALYTKVFDYLPWIQSIIAGNTTATCPP
The top line holds information pertaining to the sequence below. It is preceded by with a ">". Without this informative first line, we just have a raw format.
When the FASTA sequence comes from a biological database, the identifier marks which database. Here is a list of major database sequence identifers:
The * is gb, embl, or dbj depending on the database.
- NCBI refseq
Protein Research Foundation
Non-coding RNA regions for a genome.
- Protein Data Bank
The line immediately proceeding the identifier is the raw sequence. For both DNA and proteins, standard nucleic acid and amino acid IUB/IUPAC codes are used.
Additionally, there are a few more notes to consider:
- Lower-case letters are mapped to upper-case.
- Hyphens represent a gap character.
- Amino acid sequences, U and * are acceptable.
- It is recommended that each line be shorter than 80 characters.
IUB/IUPAC DNA nucleic acid code
Here is a list of the standard IUB/IUPAC nucleic acid codes.
|R||A or G (puRine)|
|K||C, T, or U (bases with Ketone)|
|M||A or C (bases with an aMino group)|
|S||C or G (Strong interaction)|
|W||A, T or U (Weak interaction)|
|B||not A (B comes after A)|
|D||not C (D comes after C)|
|H||not G (H comes after G)|
|V||neither T nor U (V comes after U)|
|N||A C G T U (Nucleic acid)|
|-||Gap of unknown length|
IUB/IUPAC amino acid residue code
Here's a list of the 24 amino acids and 3 special codons.
|B||Aspartic Acid (D) or Asparagine (N)|
|J||Leucine (L) or Isoleucine (I)|
|Z||Glutamic acid (E) or Glutamine (Q)|
|-||Gap of unknown length|
Specific file extensions
The generic form of FASTA file has the.fas extension. For more specific types, we can use the following:
- FASTA nucleic acid
Specifies nucleic acids.
- FASTA nucleotide coding regions
Contains coding regions for a genome.
- FASTA amino acid
Contains amino acids.
- FASTA non-coding RNA
Non-coding RNA regions for a genome.
If we just append multiple sequences in FASTA format, we get multi-FASTA format. This is a single file with several sequences, and is often used for multi-alignment programs like ClustalW or multialign.
To get FASTA-formatted sequence from GenBank NCBI database, simply click the display near the top of the record and click FASTA.
Converting FASTA sequences
Keep in mind that there are programs out there like READSEQ that allow you to convert formats to and from FASTA.
The FASTA format is extremely simple with just two lines per sequence - the first is for the description, the other for the raw sequence. The simplicity is nice when running a quick pairwise alignment, but limiting when we need more information per sequence.
With next-generation sequencing instruments pumping out millions of reads per run, scientists needed a way to check the quality of each base call. To document both the sequence and the probability of each of being correct, scientists came up with the FASTQ format. The "Q" comes from quality, as in the quality of the read.
The file extension for FASTQ is .fq and .fastq.
The FASTQ format was developed by the Wellcome Trust Sanger Institute, and became the de facto standard for high-throughput sequencing instrument outputs.
In addition to storing biological sequence information, it also adds a line for the quality scores. Each score is encoded with a single ASCII character
Let's take a look at an example FASTQ format, then look at each line.
@SEQ_ID TTCAACTCGTTAGTAAATATCAAACGATCAGTACCATTTTGGGGTTCAAAGTGACAGTTT + !'>>>>CCC'*((((***(***-+*'')+))%%%++))**55CCF>>%%%%).1CCCC65
1) Sequence identifier and description
The first line begins with an '@' character and contains the sequence identifier with an optional description. This is just like FASTA's first line.
Illumina sequence identifiers
Here is an example sequence identifier from Illumina
- Unique instrument name
- Flowcell lane
- Tile number within the flow cell lane
- x-coordinate of teh cluster within the tile.
- y-coordinate of cluster within the tile.
- Index number for multiplexed sample
- Member of a pair
2) Raw sequence letters
The second line contains raw sequence reads, also similar to FASTA files.
3) Line 3: +
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
4) Quality scores
The fourth line encodes the quality scores per each base call. This line must have the same length as the sequence in line 2.
Scores range from ! being the lowest quality and ~ being the highest. These values come from the ASCII table values 33-126.
The values are shifted down to 0 to 93, but we rarely have a Phred score of over 60.
To map the quality to the probability that a base call is correct, we use a bit of math.
The p is the probability that the corresponding base call is incorrect, and Q is the Phred quality score which can range from 0 to 93.
For a more complete guide on FASTQ, visit the FASTQ format Wikipedia page.
SAM, BAM and CRAM
Before we talk about SAM, BAM and CRAM, we must discuss the software,SAMtools, from which these formats originate.
What is SAMtools?
SAMtools is a suite of utilities that allow for efficient post-processing of short DNA sequence read alignments. The program includes several command line programs such as
index that allow for next-generation sequence data processing.
The SAM, BAM and CRAM file formats come from the use ofSAMtools.
What is the SAM format?
The name SAM comes from Sequence Alignment/MAP. In addition to regular sequence reads, SAM includes alignment data that link short reads to a reference sequence. This makes SAM files the choice of format when visualizing short read sequences in genome browsers such as IGV (Integrated Genome Viewer).
What is BAM and CRAM?
The SAM format is simple to parse, generate and check for errors. However, its large file size (~10 gb on average) gets in the way of efficiency. Thus, researchers found a way to compress it into a binary format without losing the ability to manipulate it. BAM contains indexable representation of nucleotide sequence alignments, allowing for intensive data processing in production pipelines.
CRAM is a restructured version of its binary version, with column-orientation.
For more reading on SAM and BAM, head over to the Center for Statistical Genetics.
BED is a tabs-delimited file format allows users to define how data lines of an annotation track are displayed.
If you're unfamiliar with an annotation track, they're simply the lines that are displayed on a genome browser.
BED files can have up to 12 columns, but only three are required for the UCSC browser, Galaxy browser and bedtools. The number of columns must be consisted throughout each row of the file.
Let's look at all 12 BED fields, as explained by theUCSC Genome Browser Information section.
3 Required BED fields
The following 3 fields are required for all BED files.
- Name of chromosome - chr5, chrX, chr2_random. or scaffold - scaffold10671
- Starting position of chrom.
First base starts at 0.
- Ending position.
This value does not get displayed. For example, the first 20 bases would have chromStart value of 0 to and chromEnd value of 20.
9 Optional BED fields
These 9 BED fields are optional.
- Name of the BED line.
- Score between 0 and 1000. IfuseScoreis set to 1, the score will determine the level of gray that is displayed. A higher number equates to a darker shade.
- Which strand - either '+' or '-'.
- The position when the feature is drawn thickly (the start codon for gene display).
- Ending position of where the feature is drawn thickly.
- Determines the color of the data contained in the BED line. (255,0,0) for red.
Use the Color Picker to translate a color.
- Number of blocks (exons) in the BED line.
- Comma-separated list of block sizes.
Size of list should correspond to blockCount.
- A comma-separated list of block starts.
Should be calculated relative to chromStart.
Size of list should correspond to blockCount.
Wig and BigWig
The Wiggle format (.wig) is an efficient way to store dense, continuous blocks of data. It is primarily used to store values such as GC percentage, probability scores and transcriptome data. Instead of specifying a value for each nucleotide position, wig allows you to bind values to entire regions that follow a certain pattern.
Like SAM and BAM, wig has an indexed binary equivalent called bigWig. This allows for efficient data handling, as only parts of the file are extracted and processed when viewing particular regions on a genome browsers. For a conversion, use the WigToBigWig program.
The .wig filetype contains one or more blocks. On the top of each block is the track declaration line, which defines the data elements with a number of options.
Track definition line
There are several options we can place on the first line which characterizes that particular block of information. Each variable should be formatted as a key=value pair.
- Name of block.
- Describes the region in detail.
- Integer describing the order to display tracks.
- Color per track in RGB or hexadecimal.
- Bar or point graph.
The two main formatting option per block are variableStep and fixedStep.
The variableStep option is the more common option. It includes the chromosome position in one column, and data values in another.
variableStep chrom=chr4 400001 13 400002 13 400003 13 400004 13 400005 13
We may have the chromosome number and an optional parameter known as span, which tells us the number of bases each value should cover.
The use of the "span" parameter can help us save space. The following is identical to the data block above, but saves much more space.
variableStep chrom=chr4 span=5 400001 13
In case you have data blocks with regular intervals between each position, you can use the fixedStep option. This allows you to place the positions on the track definition line, along with the interval length. Thus, only one column is necessary for the data parameters.
fixedStep chrom=chr4 start=400001 step=100 13 14 15
The above block would feature chromosome 4, position 400001 as having a value of 13, position 400101 having the value 14, and position 400201 having value 15.
You may also specify a span, indicating the length of each sequence.
fixedStep chrom=chr4 start=400001 step=100 span=5 13 14 15
This is similar, but the values range for five nucleotides instead of just one. Thus we have 13 for 400101-400105, 14 for 400201-400205, and 15 for 400301-400305.
ReferencesEnsembl WIG File Format - Definition and support options
GFF and GTF formats
GFF, or the General Feature Format is used to describe genes and other features of DNA, RNA and protein sequences. It comes with the .gff extension.
What exactly is GFF?
GFF is an extension of a basic file with the name, start and end parameters (NSE). For example, an NSE (Chromosome2,2000,4000) specifies two kilobases found on chromosome 2. GFF allows the annotation of these segments.
GFF allows for users to perform common operations such as intersection, exclusion, union, filtration, sorting, transformation and dereferencing.
What types of software use GFF?
There are several versions of GFF. The ones used today are GFF2, GTF and GFF3.
GFF2 (General Feature Format version 2) was limited in that it could only handle three-level feature hierachies instead of three-level such as gene -> transcript -> exon. Thus the Sequence Ontology and GMOD projects expanded on this with features.
GTF (General Transfer Format) has also been known as GFF Version 2.5 since it improves on verison 2, but not as much as version 3.
GFF consists of one line per feature, each containing 9 columns of data. Each column is separated by a tab, making it a tabs-delimited file.
Optional track lines
Within the file, we can also include optional track definition lines. These go at the beginning of the list of features they are to affect.
- refseq name
- Name of chromosome or scaffold. Chromosomes can be given without the 'chr' prefix.
Must be one used within Ensembl.
- Source of annotation, name of program that generated this feature.
- Feature type name.
Gene, variation, similarity
- Start position, starting at 1.
- End position, starting at 1.
- Floating point value.
For scores such as similarity, identity, etc.
- '+' for forward and '-' for reverse.
- Either 0, 1 or 2.
0 indicates first base of the feature is first base of codon, 1 indicates second base of feature is the first base of a codon, etc.
- Semicolon-separated list of tag-value pairs.
Provides additional information about each feature.
Validators allow us to ensure that a file is formatted properly. To validate a GFF3 file, go to the GFF3 validator.
With so many different filetypes, bioinformaticists need a quick way to convert among the types.
Here are a list of converters that are well-used.
- Converts between a selection of biological sequence formats.
- Free sequence file converter
Available on GeneStudio.com
- Not a filetype converter, but provides a number of functions for a sequence.
EMBOSS Seqret webpage
Hopefully this lesson gave you a good idea of some of the more commonly used filetypes used in bioinformatics. Any thoughts, questions or concerns? Please leave a comment below!