In the previous lesson on basic DNA sequencing techniques, we covered a variety of sequencing methods that were used from the mid 80's to the early 2000's.
Although these techniques allowed us to sequence the first human genome, they were too costly and time-intensive. Because of this, there were a limited number of sample human genomes to base genetic studies on, making it difficult to come up with robust phenotype-genotype correlations.
It wasn't until pyrosequencing and other NGS techniques that allowed for a price drop to $1000. By 2008, consumer genomics began to take hold, with hard data showing genetic mutations correlating with specific disease.
Before we talk about how much cheaper DNA sequencing has gotten, let's look at these two terms used to describe sequencing methodologies: resequencing and de novo sequencing.
Resequencing is the term for sequencing an organism that has already been sequenced. We only need to align our reads to a reference genome. Thus, our reads need only be a few hundred base pairs long. Many Next-Generation Sequencing platforms provide short reads that can be aligned to a reference genome.
De novo sequencing, on the other hand, is the term used to sequence a genome from scratch. It is much more costly, time-intensive, and limited to select techniques. The length of a read must be at least 1,000 bps long. The first human genome sequenced relied on these methods, which is one of the reasons it was so costly and time-intensive.
Throughout the 2000's, scientists have come up with a class of novel techniques to lower the cost of DNA sequencing. These methods were successful not only because of the new chemistries available, but also due to cheaper and more powerful computing power.
Some may argue that computers allowed for the emergence of NGS technology, as faster processing powers allowed computers to assemble genomes at a rate much higher than before. Additionally, affordable data storage allowed for genomes to be stored and accessible through public databases, and novel algorithms provided immediate analysis and results.
The problem we face in bioinformatics is now not the lack of information, but the wealth of it! Scientists simply have too much data and not enough time to curate through them. This is why many biological databases are separated into primary (unfiltered) and secondary (curated) databases. In order to make sense of all this data, we are in need of well-trained and knowledgeable bioinformaticians.
The textbook definition of Next-Generation Sequencing is a high-throughput DNA sequencing methodology that makes use of parallelization to process up to half a million sequences concurrently. The process of running thousands of analytes at a time is known as a multiplexing.
NGS can also be used to describe a new era. In this new time, we can see sequencing one's genome becoming commonplace. Imagine going to the doctor's office with some illness or concerns, and simply ordering a genetic test. The process will be affordable and easy - just like taking an MRI or performing a blood test. An era where this is commonplace is what some say NGS refers to.
A commonality of Next-Generation Sequencing methods is the simplified workflow used to prepare genes for sequencing. With the advent of PCR and its variations, there is no more use of transforming DNA fragments into bacterial cells to replicate DNA. Library preparation includes the following:
Previous methods relied on capillary electrophoresis, which could only read up to 96 wells at a time. NGS's massively parallel technique allowed for millions of reads to run simultaneously; however, most reads come out as short, unless additional techniques such as mate-pair sequencing are used.
Instead of conventional PCR or amplification through bacterial species, NGS techniques use two different flavors of PCR to set the stage for sequencing.
There are two ways we are able to prepare the library: through emulsion PCR (ePCR) and bridge PCR.
With ePCR we have technologies such as Ion Torrent Semiconductor sequencing, 454 Roche Pyrosequencing, and sequencing by ligation.
With Bridge PCR, we have technologies such as Illumina's Sequencing by Synthesis and SOLiD sequencing by ligation.
We'll first cover ePCR and the technologies that use them, then move onto Bridge PCR.
There are alternative methods used to sequence the actual DNA. We have seen sequencing by synthesis already, where the base calls are read at the addition of each nucleotide. There is another type technique called sequencing by ligation, which we'll see soon.
Here are some important NGS terms you should familiarize yourselves with.
The term Next-Generation Sequencing is somewhat of a misnomer since it implies some technology of the future. However, as you're going through this lesson, note the limitations of NGS, as they exist. There is a Third-Generation Sequencing, which is supposed to be the next-Next-Generation of sequencing platforms, and improve upon these limitations. We will cover this in the future.
Emulsion PCR is a PCR variation that some NGS technologies use to replicate DNA sequences. It is conducted on a bead surface within tiny water bubbles floating on an oil solution.
This is a very important concept to understand, as all NGS techniques replicate DNA before sequencing is done. In short, DNA is replicated in order to amplify signals. No matter the method of sequencing, without a proper amount of amplification, it's near impossible to detect each base call.
The library is first fragmented either by sonication (high sound energy) or nebulization (forces DNA through a small hole) to fragments ranging from 300 to 800 bp.
Adapters are then ligated onto the DNA fragments. These allow the strands to bind to the emulsion beads.
The double stranded DNA's with adapters are then denatured by heating the DNA up to 95 °C. Denaturing DNA simply means to go from double stranded DNA (dsDNA) to two single strands (ssDNA) - the hydrogen bonds keeping the two together are broken.
Each bead coated with streptavidin, which is resistant to organic solvents, denaturants, detergents, proteolytic enzymes and extremes of temperature and pH.
Over a billion beads are used with a primer that matches the adapters attached earlier. The ssDNA is then attached to these beads.
Each bead is emulsified in a water-in-oil droplet with PCR reagents (DNA polymerase, primers, buffers, dNTPs).
Within these droplets, PCR is conducted. This involves the steps Denaturation, Annealing, Elongation. Firstly, the strand is elongated with DNA polymerase and dNTPs. Then the double-strand is denatured, allowing for the strand to ligate to another site on the surface of the bead. Eventually, 1 million copies of the target is amplified on the surface of each bead. The water-in-oil droplet is approximately 1-um.
Follow the figure to see how each bead is able to replicate DNA on its surface.
After the DNA strands are amplified, the emulsion from the preceding step is broken using isopropanol and detergent buffer. The solution is then vortexed, centrigued, and magnetically separated. The resulting solution is a suspension of empty, clonal and non-clonal beads, which will be filtered in the next step.
After PCR is conducted, you are left with a mixture of some beads that have amplified DNA attached on its surface, and some that do not.
We may take out the enriched beads by attaching streptavidin coated magnetic enrichment bead. With a magnet, we can then pull out the beads with amplified DNA.
There are other methods of bead enrichment that include using larger beads that are able to bind to beads with amplified DNA. After centrifugation, the beads with amplified DNA and without can then be separated.
Attach a capping oligonucleotide to the 3' end of both unextended forward ePCR primers and the RDV segment of template DNA. This helps in coverslip arraying, which is used to polony sequencing, and prevents fluorescent probes from ligating to the ends.
The beads with amplified sequences are then placed on a slide and are sequenced. Due to their high density of the same DNA molecule, the signal is amplified, allowing computers to read the sequencing data.
Thus far we have seen methods that add a single base per cycle, known as sequencing by synthesis. In contrast, sequencing by ligation uses short segments of DNA called oligonucleotides instead of single bases to sequence DNA. Take a look at the diagram below to see the difference:
Since this is ligation, we use the enzyme DNA ligase rather than DNA polymerase. This enzyme joins together ends of DNA molecules.
Note that ligation is performed in the 3'-5' direction for multiple cycles, which is the opposite of how polymerase works.
Because DNA ligase has a low efficiency when there are mismatches between bases, we can be sure that only the oligonucleotides that match are ligated.
There are five main steps to sequencing by ligation, as outlined below.
A known sequence is flanked onto the target DNA strand. A short anchor sequence is then brought in to bind to this known sequence.
Oligonucleotides are short segments of DNA, and are characterized by a number of features, as outlined below.
The oligonucleotides have either lengths 8 (octamer) or 9 (nonamer).
The oligonucleotides are partially degenerate, meaning that at one of their positions they have a known nucleotide. For example, one oligonucleotide can have a known query position at 1, but unknown positions for 2-9. Another nucleotide can have a known position at 4, but unknown nucleotides at 1-3 and 5-9.
For our example, let's assume we have a nonamer whose known position is at query position 1.
Each oligonucleotide is tagged at the 3' ends with a fluorescent dye. The colors vary depending on the known query position.
The pool of oligonucleotides are mixed in with the target DNA and allowed to hybridize with target DNA sequence.
DNA ligase joins the molecule to the anchor when its bases match the unknown DNA sequence. based on what color light is emitted, we are able to see the nucleotide at the position of the unknown sequence.
The fluorescent labels are cleaved away, regenerating at 5'-phosphate group on the ends of the ligated probes. This will allow the next oligonucleotides to be ligated onto the rest of the unknown sequences.
This process in steps 4 and 5 are repeated until the nonamers have reached the end of the unknown DNA sequence. After this, the anchor sequence is reduced by one nucleotide, and the process is repeated.
A downside to this method is its limitation to short reads, and the time it takes to ligate oligonucleotides on and off. Additionally, there have been problems sequencing palindromic strands.
The positives to this method is that it is easy to implement with off-the-shelf reagents.
Polony sequencing, developed by George M. Church at Harvard Medical School, is a sequencing technique that uses paired-tag library emulsion PCR to amplify the target DNA, and sequencing by ligation to detect DNA bases. This is a combination of concepts we covered in the two previous pages.
When polony sequencing was published was released in 2003, and the cost was less than 10% of Sanger Sequencing. It was used to sequence a full E. coli genome in 2005 with an error rate of less than 0.00001%.
One unique aspect of polony sequencing is that its technology is an open-source platform. This means the software and protocols are free and don't require licensing or a fee for use. Any modifications or improvements to the system are also made available. Additionally, the only machinery required is a computer-controlled fluidics system and an epifluorescence microscope.
The procedure takes a total of 9 steps, but the most important parts (emulsion PCR and sequencing by ligation) were already covered in an earlier lesson.
The first step, as in any other NGS technique, is the library construction. We break apart the genomic DNA.
Next we want to perform end-repair to fix any damaged or incompatible edges. We want to make our DNA ends blunt-ended with a phosphate group attached at the 5'. This allows us to ligate any adapter oligonucleotides.
The DNA fragments also undergo A-tailed treatment. This adds an A to the 3' end of the sheared DNA.
After the DNA molecules are repaired, those of length 1kb are selected by loading them onto a 6% TBE PAGE gel.
The next step is to circularize the DNA. We do this with the T-tailed 30 bp long synthetic oligonucleotides (T30). This contains two outward-facing Mmel recognition sites.
Restriction enzymes are biomolecules that are able to recognize a specific sequence and cut either at that particular spot, or a spot a certain nucleotides away from it. The cuts may be "sticky," or "blunt" depending on the type of restriction enzyme.
The circularized DNA undergoes rolling circle replication. This is a type of nucleic acid replication that rapidly synthesizes multiple copies of circular molecules of DNA.
The newly generated circularized DNA are then digested by restriction enzyme Mmel (type IIs restriction endonucleases), which cut at a distance away from its recognition site. This releases the T30 fragment, flanked by 17-18 bp tags of the sequence (70 bp in total).
The resulting DNA is repaired and FDV2 and RDV2 are added on each ends. In total, this results in a 135 bp library molecules.
We now have DNA templates with 44 bp FDV sequence, a 17-18 bp proximal tag, the T30 sequence, a 17-18 bp distal tag, and a 25 bp RDV sequence.
ePCR is used to amplify the 135 bp paired end-tag library molecules. This process takes place within a water droplet embedded within an oil solution.
Coverslips are washed and treated with aminosilane. This eliminate fluorescent contamination and allows for covalent coupling of template DNA and beads to attach.
The resulting beads from ePCR are mixed with acrylamide and poured into a teflon-masked microscope slide. The coverslip is placed on top of the acrylamide gel for 45 minutes to allow for polymerization.
The beads bind to the aminosaline coating of the coverslip, spreading out in a monolayer in an acrylamide gel. The coverslip with the gel, beads and template DNA are inverted. Now beneath this solution is where the sequencing reagents will flow.
The methods for DNA sequencing is sequencing by ligation. In short, a series of anchor primers are hybridized to the synthetic oligonucleotide sequences at the genomic DNA sequences.
A group of degenerate nonamers (oligonucleotides of length 9) are used, each with a particularly known query position and fluorescent marker. Thus, in this round the known query is at position 9:
Depending on which nonamer binds, we can see which nucleotide is at position 9. We can then do this again to get the nucleotide at postition 18, then 27, and so on. Now we can use a pool of nonamers that have a known query position down one nucleotide:
We may either use these, or simply shift the known nucleotide position up one base pair and again use nonamers of known query position 1.
We perform throw in this pool of degenerate nonamers again to see nucleotides at positions 8, 16, 24, 32 and so on. We repeat this over again with different known query positions until we are through with the sequence.
Pyrosequencing is considered to be one of the first of the second-generation sequencing technologies. It was commercialized through Roche's 454 sequencing instrument, and allowed scientists to garner large amounts of sequencing data in a single run.
Unlike polony sequencing, pyrosequencing falls under sequencing by synthesis, meaning the sequence is resolved while forming the sample's complementary strand. However, similar to polony sequencing, pyrosequencing uses emulsion PCR.
At its core, the pyrosequencing technique relies on the detection of pyrophosphate molecules that are released during DNA synthesis. This allows for the generation of light, which is then detected by a sensor.
In your typical dNTP molecule, there are three phosphate that are attached to the 5' carbon of the deoxyribose sugar. The first (which is attached to the sugar) is called the α-phophate. The next is β-phosphate and the last is γ-phosphate.
During replication, the α-phosphate of each incoming complementary nucleotide is joined enzymatically by a phosphodiester linkage to the 3'-OH group of the last nucleotide in the growing strand.
During this reaction, the β- and γ-phosphates are cleaved off in a unit called the pyrophosphate (PPi).
In second generation DNA sequence techniques, a cycle is established to resolve each nucleotide. Here is the cycle used in pyrosequencing:
After emulsion PCR is performed, each enriched bead is placed in one of the many picoliter-volume wells of the sequencing machine.
One of the four dNTP's is added. If the next sequence of the growing strand is complementary to the dXTP, PPi is released.
PPi reacts with ATP sufurylase, generating ATP. This reacts with luciferase to produce light. The flash of light is recorded by a camera - intensity is proportional to more dXTP's being added.
Here is the chemical reaction that takes place to generate light.
Any remaining deoxynucleoside triphosphate (dXTP) and ATP are degraded by apyrase and washed away.
This process is repeated from step 2 until all bases are sequenced.
The sequencing machines out in industry that use pyrosequencing includes Roche's 454 platform.
|GS Junior+||GS FLX Titanium XL+||GS FLX Titanium XLR70|
|Bases per run||~100,000||~1,000,000 shotgun||~1,000,000 shotgun|
|Read Length||~700 bp||Up to 1,000 bp||Up to 600 bp|
|Mode Read Length||700 bp||700 bp||450 bp|
|Run time||18 hours||23 hours||10 hours|
Some good points to pyrosequencing is the long read sizes, and fast run times. However, runs are expensive, and the homopolymer errors are frequent due to a low sensitivity.
Semiconductor sequencing is another sequencing by synthesis method that is based on detection of H+ ions released during the polymerization of DNA. With this technique, Life Technologies released the Personal Genome Machine in 2011 as, "a rapid, compact and economical bench top machine."
Emulsion PCR allows for enriched beads to be placed in microwells (see micro-machined well). Just underneath these microwells are pH sensors that are able to detect the most miniscule changes in pH.
Remember that pH is just a logarithmic scale that measures the amount of hydrogen ions (H+) in a solution. The lower the pH, the more hydrogen ions there are.
A particular dNTP is released. If the growing sequence requires that particular dNTP, then a H+ ion and pyrophosphate group is released.
The signal is picked up by the ISFET sensor and translated into a base call. Any homopolymers (multiple of the same base) result in a strong signal.
Unattached dNTP molecules are washed out, and the cycle repeats with a new dNTP.
|Ion Torrent Personal Genome Machine||Ion Proton System|
|Bases per run||1 Gb||Up to 10Gb|
|Read Length||35-400 bp||200 bp|
|Run time||4.5 hours||2-4 hours|
Watch how the Ion Torrent system works.
A more detailed look into the Ion Proteon Sequencer.
Bridge PCR is a PCR technique that embeds DNA on a surface for cloning. It is used by Illumina's HiSeq platform.
The DNA is fragmented (through sonication or any other method) and adapters are ligated to both ends.
The DNA is then denatured into single-stranded molecules. These fragments are then floated onto a flow cell which have corresponding adapter sequences that permit binding.
When the DNA strands are placed onto the slide, they attach to their corresponding adapter sequences.
Add dNTPs, and DNA polymerase enzyme to elongate DNA strands.
Denaturation the newly formed DNA strands and repeat until dense clusters of dsDNA are generated in each channel of flow cell.
The reverse strands are then cleaved and washed away.
After Bridge PCR is conducted on the flow cell, Illumina uses fluorescently labeled dNTP's to detect each nucleotide bases.
So for example, if a red fluorescent light goes off, then we know it's an A. If a blue light goes off, then we know it's a G, and so forth (colors here aren't accurate but you get the picture).
However, the signal produced by the synthesis of one dNTP on a strands is not enough to be detected. This is why we need to amplify the DNA sequences and producing a dense amount of sequences per area on the flow cell.
We'll see how Illumina is able to sequence millions of these dense colonies in parallel.
Instead of bead-based emulsion PCR, Illumina uses bridge PCR, which we just saw in the previous page. The sequencing is conducted on a flow cell using sequencing-by-synthesis methods with fluorescent lights. This requires the user of high-resolution optical devices.
The technology was originally developed by Shankar Balasubramanian and David Klenerman at the University of Cambridge. The two founded Solexa in 1998, commercializing their sequencing method. Illumina merged with Solexa in 2007 for $600m, together hoping to "reach and exceed the $100,000 genome." And reach and exceed they did.
Whole genomes are fragmented by nebulization or sonication. The randomly fragmented genomic DNA are then end-repaired by polymerase and exonuclease activity. The 3' ends are phosphorylated, while 5' ends are adenylated. Size selection occurs through gel electrophoresis and PCR selection.
The DNA is then placed on a flow cell, which are silica slides of eight lengthwise lengths. These are about the size of a microscope slide, and are sealed to minimize contamination and handling errors.
On the slides, the flow cells are subjected to isothermal bridge amplification, created clusters densities of up to 2000 molecules. The duplication of each genomic strand aids in amplifying the generated signals upon sequencing.
Illumina sequencing devices incorporate fluorescent reversible terminators. Each dNTP has a corresponding fluorophore attached to it.
When polymerase elongates the strand with a fluorescently-labeled dNTP, the clusters are then excited by a light source and the color recorded by an optical detector. After incorporation occurs, the fluorophore is cleaved, unblocking for the next nucleotide to be incorporated in the next cycle. Since each cycle one permits the elongation of a single dNTP at a time, homopolymers are determined precisely.
In order to elongate our reads, we may sequence starting from the other end. This would be helpful for de novo assemblies, detection of insertions/deletions and other genomic mutations.
Illumina machines generate output in FASTQ format, which gives the probability of a base call being incorrect.
Two videos outlining the overview of Illumina's Sequencing technology.