Pairwise alignment is the process of aligning two DNA, RNA or protein sequences such that the regions of similarity are maximized. This is often performed to find functional, structural or evolutionary commonalities.
In most cases, scientists use two protein sequences to quantitatively find relatedness (aka homology). With this, they are able to identify common domains and motifs, and sequence ancestry.
Domains are parts of a DNA or amino acid strand that code for a physiochemically similar feature as found in other sequences and proteins. Domains refer to specific functionalities. For example, you could have a ATP-binding domain or polar domain.
Motifs are similar, but reference the structural characteristics rather than functional regions. Motifs are often found in domains, although that's not always the case.
Protein amino acid sequences are preferred over DNA sequences for a list of reasons.
However, there are some obvious instances when DNA alignments are needed.
Before we move on, let's take a quick review on some elementary biochemistry and notations.
We're all familiar with the four nucleotide bases - however, there are other symbols used for more ambiguous nucleotides.
|R||A or G||puRine|
|Y||C or T||pYrimidine|
|M||A or C||aMino|
|K||G or T||Keto|
|S||C or G||Strong interaction (3 bonds)|
|W||A or T||Weak interaction (2 bonds)|
|H||A, C or T (not G)||H is after G|
|B||C, G, or T (not A)||B is after A|
|V||A, C or G (not T)||V is after T and U|
|D||A, G or T (not C)||D is after C|
|N||A, C, G or T||aNything|
Amino acids can be represented with one or three letters. Take some time to review these.
|ZA||GlnA||Glutamic acid or glutamine|
A good tip to memorizing these is to play the amino acids license plate game! Keep a printout of the following table. When you and your cool friends are out for a drive, try to translate each license plate letter into amino acids. Sounds nerdy, but very effective in learning. Bonus points for knowing the properties and/or structures!
There are several ways to group amino acids, depending on their functionalities and biochemical properties.
With nonpolar (hydrophobic) side chains: alanine, valine, leucine, isoleucine, proline, methionine, phenylaline, tryptophan
With uncharged polar side chains: tyrosine, asparagine, glutamine, glycine, serine, threnine, cystein
With positively charged side chains: histidine, lysine, arginine
With negatively charged side chains: aspartic acid, glutamic acid
Richard Owen, an English biologist who lived from 1804 to 1892, introduced the term homology, stating that it is "the same organ in different animals under every variety of form and function."
Today, we use the term homology is used to characterize biological species that share a common evolutionary ancestory.
Note that homology is a binary qualitative measure - either two sequences are homologous or not. Homology cannot be quantified. Thus, it is incorrect to claim that "two sequences are 55% homologous" - we use percent identity or similarity to quote numbers, which we will talk about in the next lesson.
There are two subclasses of homology - orthologous and paralogous.
Paralogous is when gene duplication occurs, but both copies descend side-by-side during the history of the organism (para = in parallel). This phenomenon occurs within the species. For example, human alpha-1 globin is paralogous to alpha-2 globin because they resulted from a gene duplication that arose from a single organism. Paralogous genes are assumed to carry common functions.
When speciation occurs, and a gene is inherited in both species, these sequences are said to be orthologous (ortho = exact). For example, the human and rat myoglobins are orthologous - the sequence that codes for this protein comes from a common ancestral gene.
In simplified terms, orthology is the homology between species, while parology is the homology within species. If you see the figure below from Jensen et al., the ancestral gene has two copies of a particular gene (A and B). Relative to the ancestral genome, this is considered paralogs since they are within its own species.
However, once speciation occurs, two copies of genes A and B are created. Since the duplication of A and B genes occurred before speciation, A and B genes are still considered paralogs. However, genes A1 and A2 are orthologs, as are B1 and B2.
In the second case, gene duplication occurs after speciation. Thus, A2 and B2 are orthologs of A1, while A2 and B2 are paralogs of each other.
To assess the similarity between two proteins, we first perform pairwise alignments. Pairwise alignment algorithms find the optimal alignment between two sequences including gaps. There are several algorithms that perform this including BLAST, FASTA and LALIGN.
After an alignment is made, we can extract two quantitative parameters from each pairwise comparison - identity and similarity.
Identity defines the percentage of amino acids (or nucleotides) with a direct match in the alignment.
What about some residues that aren't quite exact, but very similar? As you may recall from biochemistry 101, many residues are similar biochemically, structurally or functionally. To account for this, we can use the similarity quantifier.
When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won't compromise the translated protein.
Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.
Keep in mind that on a BLAST search, similarity is also known as Positives!
In a BLAST search, part of your results will come out like so:
...ARFSGTWYAMAK...: .||||.:| ...QKVAGTWYSLAM...
From this diagram, we can see periods (.), colons (:) and a vertical pipe (|). The periods mean the residues are somewhat similar, while colon mean they are very similar. A vertical pipe signifies a direct match.
Another notations commonly encountered is using a + sign instead of :, and letter for the matching residue instead of |. For example.
Let's look at some a quick example to see how identity and similarity are calculated.
Say Sequence A has 320 AA, while Sequence B has 450 AA. Using BLAST to perform a pairwise alignment, we see that 100 amino acids are identical.% identity is
We always use the smaller sequence length as the denominator.
Additionally, we see that there are 23 amino acids that are different by conservation substitution, meaning that their chemical properties are maintained.
To calculate similarity (a.k.a. positives),
Thus, our sequences are 31.25% identical and 38.44% similar. Similarity is always greater than identity. Can you see why?
BLAST (Basic Local Alignment Search Tool) serves two purposes:
Without diving into too much details about BLAST (which we will cover in a later series), let's perform a simple query to get a feel for how to use it.
There are several types of BLAST that depend on what your query sequence is (DNA or protein) and what you want to match it to. For this run, let's stick with blastp, in which you enter a protein sequence and it matches to a similar protein sequence from a database.
You should get a window that looks like this:
We can search entire databases with a query. The query can be inputted with an accession number, gi (think of these as ID's for a specific protein sequence) or FASTA format.
FASTA format simply has the first line beginning with a
> that describes the sequence. Any following lines are the protein sequence itself. For example:
>gi|293651548|ref|NP_001170841.1| nuclear factor NF-kappa-B p100 subunit isoform b [Mus musculus] MDNCYDPGLDGIPEYDDFEFSPSIVEPKDPAPETADGPYLVIVEQPKQRGFRFRYGCEGPSHGGLPGASS EKGRKTYPTVKICNYEGPAKIEVDLVTHSDPPRAHAHSLVGKQCSELGVCAVSVGPKDMTAQFNNLGVLH VTKKNMMEIMIQKLQRQRLRSKPQGLTEAERRELEQEAKELKKVMDLSIVRLRFSAFLRASDGSFSLPLK PVISQPIHDSKSPGASNLKISRMDKTAGSVRGGDEVYLLCDKVQKDDIEVRFYEDDENGWQAFGDFSPTD VHKQYAIVFRTPPYHKMKIERPVTVFLQLKRKRGGDVSDSKQFTYYPLVEDKEEVQRKRRKALPTFSQPF GGGSHMGGGSGGSAGGYGGAGGGGSLGFFSSSLAYNPYQSGAAPMGCYPGGGGGAQMAGSRRDTDAGEGA EEPRTPPEAPQGEPQALDTLQRAREYNARLFGLAQRSARALLDYGVTADARALLAGQRHLLMAQDENGDT PLHLAIIHGQTGVIEQIAHVIYHAQYLGVINLTNHLHQTPLHLAVITGQTRVVSFLLQVGADPTLLDRHG DSALHLALRAGAAAPELLQALLRSGAHAVPQILHMPDFEGLYPVHLAVHARSPECLDLLVDCGAEVEAPE RQGGRTALHLATEMEELGLVTHLVTKLHANVNARTFAGNTPLHLAAGLGSPTLTRLLLKAGADIHAENEE PLCPLPSPSTSGSDSDSEGPERDTQRNFRGHTPLDLTCSTKVKTLLLNAAQNTTEPPLAPPSPAGPGLSL GDAALQNLEQLLDGPEAQGSWAELAERLGLRSLVDTYRKTPSPSGSLLRSYKLAGGDLVGLLEALSDMGL HEGVRLLKGPETRDKLPSTAEVKEDSAYGSQSVEQEAEKLCPPPEPPGGLCHGHPQPQVH
Try inputting the above FASTA sequence.
You'll notice that there are different types of BLAST you can perform - PSI-BLAST, PHI-BLAST and DELTA-BLAST. We'll cover these advanced BLAST variations in a later lesson.
There is also another window down at the bottom for Algorithm parameters, where you can fiddle with the scoring matrix, different gap penalties and more. But for now, click the big BLAST button to run your sequence!
After waiting for your query to be processed...Great! You just ran a BLAST search! Looks like you just found yourself the human ortholog of a mouse protein.
Scroll down to the bottom to the Descriptions panel, and you can see all the matches that are similar to your query.
You can scroll further down to see the actual alignments with the Identities and Similarities (called Positives) scores next to them.
The other use of BLAST is for pairwise comparisons. This means you aren't querying a database, but just inputting two sequences and seeing how well they match up. To switch to pairwise comparison mode, click the "Align two or more sequences" option.
For the two sequences here, let's use gi|293651548 and gi|158256336.
Click the big BLAST button once again and wait for your query to be processed. Then scroll down and check your results.
In the Descriptions section there is just one alignment...but why are there multiple in the Alignments section? This is simply because there are several ways that BLAST can align your sequences. The top-scoring alignments are found on the top, while lower-scoring ones are at the bottom. For the most part, you'll want to look at the top-most result.
Wondering how the scoring system goes? We'll see that in the next few pages!
We can see that BLAST is able to align two sequences - but how does it pick the best sequence? To answer this, we have to look at scoring matrices, which assign a score to each gap or residue alignment.
If we had the following alignment, what score would it have? How would it assign residues are similar vs. identical?
...ARFSGTWYAMAK... : .||||.:| ...QKVAGTWYSLAM...
One key point to notice is that the substitution for one amino acid can be more physiochemically accepted than another. For example, arginine mutating into lysine isn't that bad since both have electrically charged side chains. However, if arginine mutated to glutamic acid, the charge would be changed from +1 to -1. Such a drastic change may render the protein useless!
We also must account that some amino acids are more commonly available from DNA. For example, serine has four different codon sequences, while tryptophan only has one, making it statistically more probably for serines to show up.
There are two amino acid substitution matrices that help score alignments. The first is the PAM substitution matrix, which is based on the rate of divergence between species. BLOSUM, an alternative to PAM, is based on the conservation of domains in proteins.
Let's first discuss PAMs.
In 1978, Dayhoff and her group came up with the Accepted Point Mutations (pronounced PAM since it's easier to pronounce than APM). In short, a PAM is the muation of an amino acid residue that is accepted by natural selection.
To see which amino acids are accepted in protein evolution, Dayhoff et al. examined 1572 changes in 71 groups of closely related proteins (shared at least 85% identity), and observed all amino acid substitutions. Thus, this experiment was based solely on observation among closely related species.
The Dayhoff group calculated the relative mutability (Rij) per amino acid.
Here, Mij is the probability of residue j changing to i in a given evolutionary interval. The denominator fi represents the frequency of residue i occurring by chance.
Thus, the mutability index is an odds ratio, which is an indicator of how much more authentic the mutation is than it occuring by chance.
With this, Dayhoff et al., generated a table listing the relative mutabilities of amino acids, normalized to integers:
They found that Asn, Ser, and Asp were most likely to mutate, while Leu, Cys and Trp were least likely.
Dayhoff also tallied up the frequencies of AA's across all proteins. If all amino acids were equally probable in protein sequences, the would all have frequencies of 1.00/20 = 0.05, but they don't.
As you can see, mutation rates vary across the amino acids, and Dayhoff found an empirical way to normalize for variations in AA composition and mutation rate. From Dayhoff's experiment, we can see the characteristics of mutable and less mutable residues.
More mutable residues:
Less mutable residues:
With these results, Dayhoff built a 20 x 20 mutation probability matrix that tallied a score per each amino acid change. This was known as the PAM1 matrix, which showed the probabilities of proteins undergoing 1% change (1 accepted point mutation per 100 amino acid residues). The PAM1 matrix is based on the alignment of protein sequences that shared at least 85% identity.
One important assumption made in the generation of PAM matrices are that each change in amino acid is assumed to be independent of previous mutational events at that site. This type of assumption shows that PAM is a Markov model.
So why is this assumption important? From this, we can generate more PAM matrices to be used for sequences that are separated by longer periods of evolutionary history. For example, a PAM250 matrix is simply a PAM1 matrix multipled by itself 250 times. This matrix applies to alignments that share about 20% amino acid identity, which represents about 2500 million years of evolution.
When BLAST scores an alignment, it doesn't use the probability matrix as seen above. It converts the matrix elements into integer numbers and produces a log-odds scoring matrix.
Here, si,j is score for aligning any two residues. Simply put, if the value is high, that means it aligns well. The qi,j is the amount a certain residue i would mutate to residue j. The denominator is the probability of the residue mutation occurring by chance.
Let's take the residue M (methionine) and calculate its mutation to L (leucine). Both of amino acids have a hydrophobic side chain, so they should align well - we expect a positive score. We will use the PAM250 Mutation Matrix
This is the value found in the log-odds matrix for PAM250. With this matrix, we can now score our alignments! But wait - what about gaps? Let's see how to handle those in the next page!
We have already discussed amino acid substitutions, but what about the other two types mutations, insertions and deletions? Together, these insertions and deletions are known as gaps, and its scoring system varies.
Handling gap penalties vary depending on what you're looking for. If you are searching for similar domains, you shouldn't penalize much for any gaps that occur at the ends of your sequences (local alignment). However, if you need a more exact match (global alignment), then you would probably want any sequences that have long end gaps.
Gap penalties may be broken down into two parts:
An opening gap penalty is applied at the appearance of any gaps. For gaps that are longer than one residue, we can apply an extension gap penalty, in which penalization occurs for the addition of each residue length.
With this, there are several methodologies we may apply to a gap.
This is when there are no opening gap penalties, and a fixed negative score is given to every gap.
AB--C----DEF AB--C----DEF ABGHCFGHIDEF
Here, we'd have a score of -2n, where n is the score we take for each gap.
This depends on the gap length, so a penalty score is assigned for every gap residue.
AB--C----DEF AB--C----DEF ABGHCFGHIDEF
Here, we'd have a score of -6l, where l is the score we take for residue gap.
The affine type takes into account the gap opening penalty, as well as each length. This means that on top of a linear penalty type, there is another penalty score added that stands for gap opening.
AB--C----DEF AB--C----DEF ABGHCFGHIDEF
Here, we would have two opening gap penalties of -2n and residue gaps of -6l where n is the penalty per opening gap and l is the penalty per residue in each gap.
BLAST uses the affine type by default. The opening gap penalty is -11, while each additional residue gap is -1. You may change these settings in the "Algorithm parameters" section, just below the BLAST button.
One thing to notice here - what is the default scoring matrix set on? It's not any PAM matrix, but it's rather BLOSUM62. Let's see how BLOSUMs are constructed, and the difference between them and PAMs in the next lesson.
An alternative to the PAM matrix is BLOSUM (BLocks SUbstitution Matrix), which was derived by Henikoff and Henikoff in 1992. NCBI uses BLOSUM62 as its the default matrix for protein BLAST.
BLOSUM matrices are derived from comparisons of blocks of sequences from the Blocks database.
A block is an ungapped multiple alignments of highly conserved, short regions. Here is what a sample block looks like:
The blocks database contains multiple alignments of conserved regions in protein families.
The Henikoffs developed a database of "blocks" based on sequences with shared motifs. More than 2,000 blocks of aligned sequence segments were analyzed from more than 500 groups of related proteins. Within each block, they counted the relative frequencies of amino acids and their substitution probabilities
The Henikoffs used blocks due to several reasons:
A BLOSUM tells us the likelihood of occurrence of each pairwise substitution, and we can use these values to score a pairwise comparison.
Each scoring matrix is constructed based on how identical the ungapped multiple sequence alignments are. For example, BLOSUM62 is derived from blocks containing at most 62% identity in the ungapped sequence aligments.
Here we'll show you how to calculate a BLOSUM.
Before we start constructing a matrix BLOSUM r, we have to eliminate the sequences that are more than r% identical. This solves us from the bias we get from databases over-representing certain classes of proteins. To do this, we have two options:
ACD DCE DCE DCE BCE BCD ACB
Since most databases today have an over-representation of proteins, the extraneous DCE sequences should be eliminated in order to make our database more representative.
Thus, after elminating redundancies, we look at the first vertical column in our block:
A D B B A
Let's find out how many possible pairwise combinations we can see for each possible pair.
For the AA pair, we have 2 possible combinations, for AB or BA we have 4. For AD we have 2. We continue these calculations until the occurrence of all possible pairs are found.
|Pair||Column 1 score||Column 2 score||Column 3 score||Total|
|AB or BA||4||0||0||4|
|AD or DA||2||0||0||2|
|BD or DB||2||0||2||4|
|DE or ED||0||0||4||4|
Note that the total sum is 26, which we can use to normalize our matrix.
To obtain integer values for our scoring matrix, we need to find the score per cell. We can do this with the following equation:
Where qij is observed frequency and eij is the expected frequency.
cij is the cell value as calculated above.
To calculate the Total T:
Where w is the number of columns and n is the number of sequences. With T, we can calculate qij, which is the rate of change of residue i to residue j.
In our case, T = 30, so let's divide all our cells by 30.
Now pi can be found with the following equation:
The expected frequencies:
Notice how I didn't calculate cell values that had a value of 0 - you'll see that we don't need these values in the actual scoring matrix.
Now we have all we need! Just plug in values from the two matrices above into the equation below to obtain our scoring matrix.
To obtain scores, we multiple sij by two and round.
As we have seen, there are two different log-odds scoring matrices available. When performing pairwise sequence alignments, you must specify the exact matrix to use - this depends on the suspected degree of identity between the query and its matches.
Let's do a quick summary of these two matrices and see how they differ.
PAM (Accepted Point Mutations) matrices are:
Some limitations of PAM matrices include the somewhat inaccuracy of the assumption model and each position being context dependent.
Limitations are that you are limited to a subset of conserved domains.
The purpose of global alignment (aka optimal matching algorithm) is to align two sequences from start to end, and make as many matches as possible. Algorithms do this by inserting gaps within the letters of each sequence to maximize the number of matches. This technique is often used when comparing sequences of similar length.
The algorithm for global alignment was first published by Saul Needleman and Christian Wunsch (1970), then modified over the next few years by other scientists. It aligns sequences from beginning to end and finds the best alignment that maximizes the overall score. The score is calculated based on matches, mismatches, and gaps.
The algorithm is based on an algorithmic technique known as dynamic programming, where the optimal alignment is reduced into a series of smaller problem. The solution to these smaller problems are then used to calculate the solution to the larger (or overall) problem.
The best way to go through the algorithm is to show an example, so let's compare two sequences - IDENTITY and SIMILARITY.
The first step is to set up a 2d matrix with the first sequence on the y-axis, and the second sequence on the x-axis.
Now we need to fill this grid with numerical values, which we will later use to score the matches. Let's come up with a formula to find the score per cell, which we can do with pre-determined values.
The maximum of these values will be taken and placed as the current cell's score.
Start with filling a 0 at the top-left corner. Each cell along the row will be -2 since the only value it looks at is the cell to the left. Each cell along the first column will be -2 from its top, since that's the only value it has to look.
After coloring each matching letter green, you will have a table that looks like this:
We can see that for green cells (match), we take the maximum value of the left, top-left (diagonal), or top, and +1 to it. For any mismatches or gap insertions, we -2 from the largest value of the left, top-left (diagonal) and top cells.
A completed matrix gives the final score of the alignment on the bottom-right most cell. In this case, it's -7.
Before we find trace the optimal path, let's go over some conventions.
Our purpose is to find the optimal path from the top left going to the bottom right. A diagonal line denotes that the letter is aligned (can be match or mismatch). A perfect match will have a diagonal going from top left to bottom right. Mismatches can still generate this diagonal, but their scores will be different.
Horizontal and vertical lines denote the presence of a gap in the first and second sequence, respectively. Gaps may be of any length, and may occur within (internal) or at the ends (terminal) of sequences.
For our example, we start from the bottom-right cell, then move our way up to the top-left.
With this, we get the alignment:
There are other alignments that give the same score. Can you trace them out?
Depending on what kind of alignment you're looking for, you can vary the parameters. For example, some sequence alignment algorithms vary gap penalties in two subcategories. An opening gap would be given a more severe penalty than each continuing gap. In our example case, all gaps were treated equally with a penalty of -2.
Adding a harsher mismatch penalty can force more gaps to be created. However, this is non-ideal as it's good to have near-identical sequences with few deletions.
Additionally, global alignments usually use an end gap penalty. This is the distinguishing factor between global and local alignment, which we'll see next.
Local alignments like global alignments, but they generate "islands" of areas that have the greatest similarity. This is helpful when the query and sequence are dissimilar, but are suspected to contain domains or small regions of similarity. The BLAST algorithm uses local alignment.
Local alignments differ from global alignments in a few ways:
The Smith-Waterman algorithm is very much like the Needleman-Wunsch algorithm used in global alignments - the hallmark difference is in the scoring methodology.
Unlike global alignment, local alignments have no end gap penalties, allowing small interior alignments to rank higher when scored.
Let's take a quick look at the effects of end gap penalties. The following sequence is aligned globally, with high end gap penalties.
M - N A L S D R T M G S D R T T E T 6 -12 1 0 -3 1 0 -1 3 = -5
Now in this next sequence, we have a local alignment. Notice how the small region in the middle aligns quite nicely.
M N A L S D R T - - - - - M G S D R T T E T 0 0 -1 -4 2 4 6 3 0 0 0 = 10
Without the end gap penalty, the Smith-Waterman alignment algorithm is able to find the best locally matching sequence.
Let's compare two sequences - CGTTCTA and AACGTTGG.
Set up a 2d matrix, as we did earlier in the Needleman-Wunsch example.
We need separate scores for matches, mismatches and gaps.
Any cell that would have a negative value are given0instead.
We want to start with the first row and column and gives those a value of0. Then we want to mark the cells that indicates a match.
Now we fill the rest of our table out. Make sure to keep track of where each cell value came from, as we need this to trace back our optimal alignment.
Note that a mismatch or a match can only come from the cell diagonally up to the left of the current cell. Additionally, gaps may only come from the top or left of the current cell.
Now all we need to do is retrace our steps. First, find the cell with the highest score.
Now we trace back until we get to a cell with0. Thus, our optimal local alignment becomes:
Thus, we may say that for global alignments, where the sequences are connected along the entire length of their sequences, there is a higher % identity with many small interior gaps. For local alignments, which focus on the best matching regions, there is a lower % identity, but fewer interior gaps and longer end gaps.