As soon as biologicaly data was able to be stored digitally, a multitude of file formats arose. The very first files contained raw DNA sequence reads in a regular.txt file, but as soon as the range of information broadened, so did the types of files.
Several different file formats arose, each with their own purpose.
In this series, we'll go over the most common sequence file formats you'll come across in bioinformatics. The purpose of this lesson won't be to learn the intricate details per format, but to simply become familiarized with the different file types that are used to store biological data. This will help calm that overwhelming feeling you get when sifting through bioinformatics forums, books and software tools.
Typically, to truly understand a file format (especially raw output files), you'll need to know the technology used to generate that format. In these cases, we'll link you to the corresponding tutorial.
We'll mainly go over DNA sequence file types, and save database file formats such as EMBL or SWISS-PROT for another lesson.
A file format is a way for computers (and humans) to standardize how data is organized. For example, this page was written on an .html extension. HTML files contain special tags that tell the browser what each block of text is, and how to display it on the page.
Additionally, computers are able to check file formats and immediately determine whether it should be opened in a text editor (for editing), a modern browser (for viewing) or some other software.
File types can also indicate which algorithm to use to view (or open) that file. For example, .gif, .jpg and .png all display images, but the level of compression, size and resolution differ.
Early on, scientists held sequence information in plain text (.txt) with descriptive file names. The researchers then felt limited, when they felt the need to include annotations and additional information about the sequences.
More common (yet still primitive) file types include csv and tsv. The former stands for comma-separated values, meaning that there is a comma between each value. The simplicity of this format allows researchers to easily exchange data among computers, a term known as portability.
id,fname,lname,occupation 314,Peter,Ignasius,Bioinformaticist 232,Sarah,Carlito,Mathematician 412,Enrique,Menezes,Microbiologist
A tsv (tab-separated values) file is similar, but data is separated by tabs instead of commas. Many of the biological filetypes covered in this lesson have tab-separated values.
The newline (aka end of line or EOL) is a special character or sequence of special characters that signify the end of a line in text. On a normal text editor such as Notepad++, these characters are hidden.
What's tricky about the EOL character is that depending on the platform (UNIX or MS Windows), the newline character is different. On the Command Line, you may interchange files of the two types with the dos2unix and unix2dos commands.
Another common text-based format that is becoming more and more popular is the markdown format. These files are indicated with a basename of .md.
The markdown format is a markup language, just like HTML. All it does is mark up the text within the document to indicate which lines are headers, paragraphs, and so on. For example,
# This is a first-level header ## This is a second-level header ### This is a third-level header > This is a blockquote Four spaces / 1 tab = line for code *italics*, **bold**, `inline code`, [Google](http://google.com) Unordered List: - Illumina - PacBio - IonTorrent Ordered List: 1. FASTA 2. FASTQ 3. SAM/BAM/CRAM
There are command-line tools that help convert from markdown to html, such as pandoc:
$ pandoc --from markdown --to html README.md > README.html
You'll often see these files as README.md when you download a source file from GitHub or another repository. The markdown format allows the page to load with proper formatting.
Great! Now that we've covered basic text-based file formats, let's move into more bioinformatics-specific file types.
FASTA (pronounced "fast-A") format is a simple type of format that bioinformaticians use to represent either nucleotide or protein sequences. It is written in text format, allowing for processing tools to easily parse the data. The general file extension is .fas.
The FASTA file format originated from a DNA and protein sequence alignment software package called FASTP created in the mid-1980's. The format allows you to precede each sequence with a comment.
There are two lines per sequence - 1) the identifier (comments, annotations) and 2) the sequence itself.
Before we dig into a FASTA sequence, let's see what one looks like. Here is an example of a standard FASTA format. Pretty simple, right?
>gi|13959657|sp|Q9PTU8|VSP3_BOTJA Venom serine proteinase A precursor MVLIRVIANLLILQLSNAQKSSELVIGGDECNITEHRFLVEIFNSSGLFCGGTLIDQEWVLSAAHCDMRN MRIYLGVHNEGVQHADQQRRFAREKFFCLSSRNYTKWDKDIMLIRLNRPVNNSEHIAPLSLPSNPPSVGS VCRIMGWGTITSPNATFPDVPHCANINLFNYTVCRGAHAGLPATSRTLCAGVLQGGIDTCGGDSGGPLIC NGTFQGIVSWGGHPCAQPGEPALYTKVFDYLPWIQSIIAGNTTATCPP
The top line holds information pertaining to the sequence below. It is preceded by with a ">". Without this informative first line, we just have a raw format.
When the FASTA sequence comes from a biological database, the identifier marks which database. Here is a list of major database sequence identifers:
The line immediately proceeding the identifier is the raw sequence. For both DNA and proteins, standard nucleic acid and amino acid IUB/IUPAC codes are used.
Additionally, there are a few more notes to consider:
Here is a list of the standard IUB/IUPAC nucleic acid codes.
|R||A or G (puRine)|
|K||C, T, or U (bases with Ketone)|
|M||A or C (bases with an aMino group)|
|S||C or G (Strong interaction)|
|W||A, T or U (Weak interaction)|
|B||not A (B comes after A)|
|D||not C (D comes after C)|
|H||not G (H comes after G)|
|V||neither T nor U (V comes after U)|
|N||A C G T U (Nucleic acid)|
|-||Gap of unknown length|
Here's a list of the 24 amino acids and 3 special codons.
|B||Aspartic Acid (D) or Asparagine (N)|
|J||Leucine (L) or Isoleucine (I)|
|Z||Glutamic acid (E) or Glutamine (Q)|
|-||Gap of unknown length|
The generic form of FASTA file has the.fas extension. For more specific types, we can use the following:
If we just append multiple sequences in FASTA format, we get multi-FASTA format. This is a single file with several sequences, and is often used for multi-alignment programs like ClustalW or multialign.
To get FASTA-formatted sequence from GenBank NCBI database, simply click the display near the top of the record and click FASTA.
Keep in mind that there are programs out there like READSEQ that allow you to convert formats to and from FASTA.
The FASTA format is extremely simple with just two lines per sequence - the first is for the description, the other for the raw sequence. The simplicity is nice when running a quick pairwise alignment, but limiting when we need more information per sequence.
With next-generation sequencing instruments pumping out millions of reads per run, scientists needed a way to check the quality of each base call. To document both the sequence and the probability of each of being correct, scientists came up with the FASTQ format. The "Q" comes from quality, as in the quality of the read.
The file extension for FASTQ is .fq and .fastq.
The FASTQ format was developed by the Wellcome Trust Sanger Institute, and became the de facto standard for high-throughput sequencing instrument outputs.
In addition to storing biological sequence information, it also adds a line for the quality scores. Each score is encoded with a single ASCII character
Let's take a look at an example FASTQ format, then look at each line.
@SEQ_ID TTCAACTCGTTAGTAAATATCAAACGATCAGTACCATTTTGGGGTTCAAAGTGACAGTTT + !'>>>>CCC'*((((***(***-+*'')+))%%%++))**55CCF>>%%%%).1CCCC65
The first line begins with an '@' character and contains the sequence identifier with an optional description. This is just like FASTA's first line.
Here is an example sequence identifier from Illumina
The second line contains raw sequence reads, also similar to FASTA files.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
The fourth line encodes the quality scores per each base call. This line must have the same length as the sequence in line 2.
Scores range from ! being the lowest quality and ~ being the highest. These values come from the ASCII table values 33-126.
The values are shifted down to 0 to 93, but we rarely have a Phred score of over 60.
To map the quality to the probability that a base call is correct, we use a bit of math.
The p is the probability that the corresponding base call is incorrect, and Q is the Phred quality score which can range from 0 to 93.
For a more complete guide on FASTQ, visit the FASTQ format Wikipedia page.
Before we talk about SAM, BAM and CRAM, we must discuss the software,SAMtools, from which these formats originate.
SAMtools is a suite of utilities that allow for efficient post-processing of short DNA sequence read alignments. The program includes several command line programs such as
index that allow for next-generation sequence data processing.
The SAM, BAM and CRAM file formats come from the use ofSAMtools.
The name SAM comes from Sequence Alignment/MAP. In addition to regular sequence reads, SAM includes alignment data that link short reads to a reference sequence. This makes SAM files the choice of format when visualizing short read sequences in genome browsers such as IGV (Integrated Genome Viewer).
The SAM format is simple to parse, generate and check for errors. However, its large file size (~10 gb on average) gets in the way of efficiency. Thus, researchers found a way to compress it into a binary format without losing the ability to manipulate it. BAM contains indexable representation of nucleotide sequence alignments, allowing for intensive data processing in production pipelines.
CRAM is a restructured version of its binary version, with column-orientation.
For more reading on SAM and BAM, head over to the Center for Statistical Genetics.
BED is a tabs-delimited file format allows users to define how data lines of an annotation track are displayed.
If you're unfamiliar with an annotation track, they're simply the lines that are displayed on a genome browser.
BED files can have up to 12 columns, but only three are required for the UCSC browser, Galaxy browser and bedtools. The number of columns must be consisted throughout each row of the file.
Let's look at all 12 BED fields, as explained by theUCSC Genome Browser Information section.
The following 3 fields are required for all BED files.
These 9 BED fields are optional.
The Wiggle format (.wig) is an efficient way to store dense, continuous blocks of data. It is primarily used to store values such as GC percentage, probability scores and transcriptome data. Instead of specifying a value for each nucleotide position, wig allows you to bind values to entire regions that follow a certain pattern.
Like SAM and BAM, wig has an indexed binary equivalent called bigWig. This allows for efficient data handling, as only parts of the file are extracted and processed when viewing particular regions on a genome browsers. For a conversion, use the WigToBigWig program.
The .wig filetype contains one or more blocks. On the top of each block is the track declaration line, which defines the data elements with a number of options.
There are several options we can place on the first line which characterizes that particular block of information. Each variable should be formatted as a key=value pair.
The two main formatting option per block are variableStep and fixedStep.
The variableStep option is the more common option. It includes the chromosome position in one column, and data values in another.
variableStep chrom=chr4 400001 13 400002 13 400003 13 400004 13 400005 13
We may have the chromosome number and an optional parameter known as span, which tells us the number of bases each value should cover.
The use of the "span" parameter can help us save space. The following is identical to the data block above, but saves much more space.
variableStep chrom=chr4 span=5 400001 13
In case you have data blocks with regular intervals between each position, you can use the fixedStep option. This allows you to place the positions on the track definition line, along with the interval length. Thus, only one column is necessary for the data parameters.
fixedStep chrom=chr4 start=400001 step=100 13 14 15
The above block would feature chromosome 4, position 400001 as having a value of 13, position 400101 having the value 14, and position 400201 having value 15.
You may also specify a span, indicating the length of each sequence.
fixedStep chrom=chr4 start=400001 step=100 span=5 13 14 15
This is similar, but the values range for five nucleotides instead of just one. Thus we have 13 for 400101-400105, 14 for 400201-400205, and 15 for 400301-400305.
GFF, or the General Feature Format is used to describe genes and other features of DNA, RNA and protein sequences. It comes with the .gff extension.
GFF is an extension of a basic file with the name, start and end parameters (NSE). For example, an NSE (Chromosome2,2000,4000) specifies two kilobases found on chromosome 2. GFF allows the annotation of these segments.
GFF allows for users to perform common operations such as intersection, exclusion, union, filtration, sorting, transformation and dereferencing.
There are several versions of GFF. The ones used today are GFF2, GTF and GFF3.
GFF2 (General Feature Format version 2) was limited in that it could only handle three-level feature hierachies instead of three-level such as gene -> transcript -> exon. Thus the Sequence Ontology and GMOD projects expanded on this with features.
GTF (General Transfer Format) has also been known as GFF Version 2.5 since it improves on verison 2, but not as much as version 3.
GFF consists of one line per feature, each containing 9 columns of data. Each column is separated by a tab, making it a tabs-delimited file.
Within the file, we can also include optional track definition lines. These go at the beginning of the list of features they are to affect.
Validators allow us to ensure that a file is formatted properly. To validate a GFF3 file, go to the GFF3 validator.
With so many different filetypes, bioinformaticists need a quick way to convert among the types.
Here are a list of converters that are well-used.
Hopefully this lesson gave you a good idea of some of the more commonly used filetypes used in bioinformatics. Any thoughts, questions or concerns? Please leave a comment below!