Pairwise Alignment Introduction
What is Pairwise Alignment?
Pairwise alignment is the process of aligning two DNA, RNA or protein sequences such that the regions of similarity are maximized. This is often performed to find functional, structural or evolutionary commonalities.
In most cases, scientists use two protein sequences to quantitatively find relatedness (aka homology). With this, they are able to identify common domains and motifs, and sequence ancestry.
Domains and Motifs
Domains are parts of a DNA or amino acid strand that code for a physiochemically similar feature as found in other sequences and proteins. Domains refer to specific functionalities. For example, you could have a ATP-binding domain or polar domain.
Motifs are similar, but reference the structural characteristics rather than functional regions. Motifs are often found in domains, although that's not always the case.
Protein vs. DNA sequence alignment
Protein amino acid sequences are preferred over DNA sequences for a list of reasons.
- Protein residues are more informative - a change in DNA (especially the 3rd position) does not necessarily change the AA.
- The larger number of amino acids than nucleic acids makes it easier to find significance.
- Some amino acids share related biochemical properties, which can be accounted for when scoring multiple pairwise alignments.
- Protein sequence comparisons can link back to over a billion years ago, whereas DNA sequence comparisons can only go back up to 600 mya. Thus, protein sequences are far better for evolutionary studies.
However, there are some obvious instances when DNA alignments are needed.
- When confirming the identity of cDNA (forensic sequencing).
- When studying noncoding regions of DNA. These regions evolve at a faster rate than coding DNA, while mitochondrial noncoding DNA evolves even faster.
- When studying DNA mutations.
- When researching on very similar organisms such as Neanderthals and modern humans.
Biochemistry 101 Review
Before we move on, let's take a quick review on some elementary biochemistry and notations.
Nucleotide Codes
We're all familiar with the four nucleotide bases - however, there are other symbols used for more ambiguous nucleotides.
Symbol | Meaning | Explanation |
---|---|---|
A | A | Adenine |
C | C | Cytosine |
G | G | Guanine |
T | T | Thymine |
R | A or G | puRine |
Y | C or T | pYrimidine |
M | A or C | aMino |
K | G or T | Keto |
S | C or G | Strong interaction (3 bonds) |
W | A or T | Weak interaction (2 bonds) |
H | A, C or T (not G) | H is after G |
B | C, G, or T (not A) | B is after A |
V | A, C or G (not T) | V is after T and U |
D | A, G or T (not C) | D is after C |
N | A, C, G or T | aNything |
Amino Acid Residue Codes
Amino acids can be represented with one or three letters. Take some time to review these.
1-letterA | 3-lettersA | Amino Acid |
---|---|---|
AA | AlaA | Alanine |
CA | CysA | Cysteine |
DA | AspA | Aspartic Acid |
EA | GluA | Glutamic Acid |
FA | PheA | Phenylaline |
GA | GlyA | Glycine |
HA | HisA | Histidine |
IA | IleA | Isoleucine |
KA | LysA | Lysine |
LA | LeuA | Leucine |
MA | MetA | Methionine |
NA | AsnA | Asparagine |
OA | PylA | Pyrrolysine |
PA | ProA | Proline |
QA | GlnA | Glutamine |
RA | ArgA | Arginine |
SA | SerA | Serine |
TA | ThrA | Threonine |
UA | SecA | Selenocysteine |
VA | ValA | Valine |
WA | TrpA | Tryptophan |
XA | XaaA | Undetermined |
YA | TyrA | Tyrosine |
ZA | GlnA | Glutamic acid or glutamine |
Amino Acids license plate game
A good tip to memorizing these is to play the amino acids license plate game! Keep a printout of the following table. When you and your cool friends are out for a drive, try to translate each license plate letter into amino acids. Sounds nerdy, but very effective in learning. Bonus points for knowing the properties and/or structures!
Amino acids grouping
There are several ways to group amino acids, depending on their functionalities and biochemical properties.
With nonpolar (hydrophobic) side chains: alanine, valine, leucine, isoleucine, proline, methionine, phenylaline, tryptophan
With uncharged polar side chains: tyrosine, asparagine, glutamine, glycine, serine, threnine, cystein
With positively charged side chains: histidine, lysine, arginine
With negatively charged side chains: aspartic acid, glutamic acid
Homology - a qualitative measure
Richard Owen, an English biologist who lived from 1804 to 1892, introduced the term homology, stating that it is "the same organ in different animals under every variety of form and function."
Today, we use the term homology is used to characterize biological species that share a common evolutionary ancestory.
Strictly a qualitative measure!
Note that homology is a binary qualitative measure - either two sequences are homologous or not. Homology cannot be quantified. Thus, it is incorrect to claim that "two sequences are 55% homologous" - we use percent identity or similarity to quote numbers, which we will talk about in the next lesson.
Orthologous vs. Paralogous
There are two subclasses of homology - orthologous and paralogous.
Paralogous is when gene duplication occurs, but both copies descend side-by-side during the history of the organism (para = in parallel). This phenomenon occurs within the species. For example, human alpha-1 globin is paralogous to alpha-2 globin because they resulted from a gene duplication that arose from a single organism. Paralogous genes are assumed to carry common functions.
When speciation occurs, and a gene is inherited in both species, these sequences are said to be orthologous (ortho = exact). For example, the human and rat myoglobins are orthologous - the sequence that codes for this protein comes from a common ancestral gene.
Further explanation
In simplified terms, orthology is the homology between species, while parology is the homology within species. If you see the figure below from Jensen et al., the ancestral gene has two copies of a particular gene (A and B). Relative to the ancestral genome, this is considered paralogs since they are within its own species.

However, once speciation occurs, two copies of genes A and B are created. Since the duplication of A and B genes occurred before speciation, A and B genes are still considered paralogs. However, genes A1 and A2 are orthologs, as are B1 and B2.
In the second case, gene duplication occurs after speciation. Thus, A2 and B2 are orthologs of A1, while A2 and B2 are paralogs of each other.
References
R.A. Jensen, Orthologs and paralogs - we need to get it right, Genome Biology, 2(8), 2001Richard Owen Biography. UC Berkeley.Identity and Similarity - a quantitative measure
To assess the similarity between two proteins, we first perform pairwise alignments. Pairwise alignment algorithms find the optimal alignment between two sequences including gaps. There are several algorithms that perform this including BLAST, FASTA and LALIGN.
After an alignment is made, we can extract two quantitative parameters from each pairwise comparison - identity and similarity.
Identity
Identity defines the percentage of amino acids (or nucleotides) with a direct match in the alignment.
What about some residues that aren't quite exact, but very similar? As you may recall from biochemistry 101, many residues are similar biochemically, structurally or functionally. To account for this, we can use the similarity quantifier.
Similarity (aka Positives)
When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won't compromise the translated protein.
Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.
Similarity = Positives
Keep in mind that on a BLAST search, similarity is also known as Positives!
What it looks like in BLAST
In a BLAST search, part of your results will come out like so:
...ARFSGTWYAMAK...: .||||.:|
...QKVAGTWYSLAM...
From this diagram, we can see periods (.), colons (:) and a vertical pipe (|). The periods mean the residues are somewhat similar, while colon mean they are very similar. A vertical pipe signifies a direct match.
Another notations commonly encountered is using a + sign instead of :, and letter for the matching residue instead of |. For example.
Calculations
Let's look at some a quick example to see how identity and similarity are calculated.
Say Sequence A has 320 AA, while Sequence B has 450 AA. Using BLAST to perform a pairwise alignment, we see that 100 amino acids are identical.% identity is
We always use the smaller sequence length as the denominator.
Additionally, we see that there are 23 amino acids that are different by conservation substitution, meaning that their chemical properties are maintained.
To calculate similarity (a.k.a. positives),
Thus, our sequences are 31.25% identical and 38.44% similar. Similarity is always greater than identity. Can you see why?
A quick look at BLAST
BLAST (Basic Local Alignment Search Tool) serves two purposes:
- Align two sequences and look for homology
- Search a sequence in a database to find similar and related sequences.
Without diving into too much details about BLAST (which we will cover in a later series), let's perform a simple query to get a feel for how to use it.
There are several types of BLAST that depend on what your query sequence is (DNA or protein) and what you want to match it to. For this run, let's stick with blastp, in which you enter a protein sequence and it matches to a similar protein sequence from a database.
Using BLAST
The first thing to do is to go to the NCBI page for BLAST. From here, click protein blast (blastp), which is located under "basic BLAST."
You should get a window that looks like this:

1) Running a query against a database
We can search entire databases with a query. The query can be inputted with an accession number, gi (think of these as ID's for a specific protein sequence) or FASTA format.
What is FASTA?
FASTA format simply has the first line beginning with a >
that describes the sequence. Any following lines are the protein sequence itself. For example:
>gi|293651548|ref|NP_001170841.1| nuclear factor NF-kappa-B p100 subunit isoform b [Mus musculus]
MDNCYDPGLDGIPEYDDFEFSPSIVEPKDPAPETADGPYLVIVEQPKQRGFRFRYGCEGPSHGGLPGASS
EKGRKTYPTVKICNYEGPAKIEVDLVTHSDPPRAHAHSLVGKQCSELGVCAVSVGPKDMTAQFNNLGVLH
VTKKNMMEIMIQKLQRQRLRSKPQGLTEAERRELEQEAKELKKVMDLSIVRLRFSAFLRASDGSFSLPLK
PVISQPIHDSKSPGASNLKISRMDKTAGSVRGGDEVYLLCDKVQKDDIEVRFYEDDENGWQAFGDFSPTD
VHKQYAIVFRTPPYHKMKIERPVTVFLQLKRKRGGDVSDSKQFTYYPLVEDKEEVQRKRRKALPTFSQPF
GGGSHMGGGSGGSAGGYGGAGGGGSLGFFSSSLAYNPYQSGAAPMGCYPGGGGGAQMAGSRRDTDAGEGA
EEPRTPPEAPQGEPQALDTLQRAREYNARLFGLAQRSARALLDYGVTADARALLAGQRHLLMAQDENGDT
PLHLAIIHGQTGVIEQIAHVIYHAQYLGVINLTNHLHQTPLHLAVITGQTRVVSFLLQVGADPTLLDRHG
DSALHLALRAGAAAPELLQALLRSGAHAVPQILHMPDFEGLYPVHLAVHARSPECLDLLVDCGAEVEAPE
RQGGRTALHLATEMEELGLVTHLVTKLHANVNARTFAGNTPLHLAAGLGSPTLTRLLLKAGADIHAENEE
PLCPLPSPSTSGSDSDSEGPERDTQRNFRGHTPLDLTCSTKVKTLLLNAAQNTTEPPLAPPSPAGPGLSL
GDAALQNLEQLLDGPEAQGSWAELAERLGLRSLVDTYRKTPSPSGSLLRSYKLAGGDLVGLLEALSDMGL
HEGVRLLKGPETRDKLPSTAEVKEDSAYGSQSVEQEAEKLCPPPEPPGGLCHGHPQPQVH
Try inputting the above FASTA sequence.
- Do not check the "Align two or more sequences" options.
- Select the "non-redundant protein sequences (nr)" for the database.
- For the organism name, use "human (taxid:9606)".
You'll notice that there are different types of BLAST you can perform - PSI-BLAST, PHI-BLAST and DELTA-BLAST. We'll cover these advanced BLAST variations in a later lesson.
There is also another window down at the bottom for Algorithm parameters, where you can fiddle with the scoring matrix, different gap penalties and more. But for now, click the big BLAST button to run your sequence!

After waiting for your query to be processed...Great! You just ran a BLAST search! Looks like you just found yourself the human ortholog of a mouse protein.
Scroll down to the bottom to the Descriptions panel, and you can see all the matches that are similar to your query.

You can scroll further down to see the actual alignments with the Identities and Similarities (called Positives) scores next to them.

2) Running a pairwise comparison
The other use of BLAST is for pairwise comparisons. This means you aren't querying a database, but just inputting two sequences and seeing how well they match up. To switch to pairwise comparison mode, click the "Align two or more sequences" option.
For the two sequences here, let's use gi|293651548 and gi|158256336.

Click the big BLAST button once again and wait for your query to be processed. Then scroll down and check your results.
In the Descriptions section there is just one alignment...but why are there multiple in the Alignments section? This is simply because there are several ways that BLAST can align your sequences. The top-scoring alignments are found on the top, while lower-scoring ones are at the bottom. For the most part, you'll want to look at the top-most result.

Wondering how the scoring system goes? We'll see that in the next few pages!
Dayhoff Model & accepted point mutations (PAMs)
We can see that BLAST is able to align two sequences - but how does it pick the best sequence? To answer this, we have to look at scoring matrices, which assign a score to each gap or residue alignment.
If we had the following alignment, what score would it have? How would it assign residues are similar vs. identical?
...ARFSGTWYAMAK...
: .||||.:|
...QKVAGTWYSLAM...
Considerations for a scoring matrix
One key point to notice is that the substitution for one amino acid can be more physiochemically accepted than another. For example, arginine mutating into lysine isn't that bad since both have electrically charged side chains. However, if arginine mutated to glutamic acid, the charge would be changed from +1 to -1. Such a drastic change may render the protein useless!
We also must account that some amino acids are more commonly available from DNA. For example, serine has four different codon sequences, while tryptophan only has one, making it statistically more probably for serines to show up.
There are two amino acid substitution matrices that help score alignments. The first is the PAM substitution matrix, which is based on the rate of divergence between species. BLOSUM, an alternative to PAM, is based on the conservation of domains in proteins.
Let's first discuss PAMs.
Accepted Point Mutations (PAM)
In 1978, Dayhoff and her group came up with the Accepted Point Mutations (pronounced PAM since it's easier to pronounce than APM). In short, a PAM is the muation of an amino acid residue that is accepted by natural selection.
To see which amino acids are accepted in protein evolution, Dayhoff et al. examined 1572 changes in 71 groups of closely related proteins (shared at least 85% identity), and observed all amino acid substitutions. Thus, this experiment was based solely on observation among closely related species.
Relative Mutability
The Dayhoff group calculated the relative mutability (Rij) per amino acid.
Here, Mij is the probability of residue j changing to i in a given evolutionary interval. The denominator fi represents the frequency of residue i occurring by chance.
Thus, the mutability index is an odds ratio, which is an indicator of how much more authentic the mutation is than it occuring by chance.
With this, Dayhoff et al., generated a table listing the relative mutabilities of amino acids, normalized to integers:
- Asn
- 134
- Ser
- 120
- Asp
- 106
- ...
- ...
- Leu
- 40
- Cys
- 20
- Trp
- 18
They found that Asn, Ser, and Asp were most likely to mutate, while Leu, Cys and Trp were least likely.
Normalized frequencies of AA
Dayhoff also tallied up the frequencies of AA's across all proteins. If all amino acids were equally probable in protein sequences, the would all have frequencies of 1.00/20 = 0.05, but they don't.
- Gly
- 0.089
- Ala
- 0.087
- Leu
- 0.085
- ...
- ...
- Tyr
- 0.030
- Met
- 0.015
- Trp
- 0.010
Mutation rates vary
As you can see, mutation rates vary across the amino acids, and Dayhoff found an empirical way to normalize for variations in AA composition and mutation rate. From Dayhoff's experiment, we can see the characteristics of mutable and less mutable residues.
More mutable residues:
- are much easier to replace. Asparagine is part of the uncharged polar amino acids group, which contains six other amino acids with similar properties
Less mutable residues:
- serve important functions. If it has a particular charge, the protein will most likely need that charge to stay there.
- are difficult to replace. Take a look at Tryptophan's unique structure below and you'll see why it'd be so difficult to replace it!
PAM matrices
With these results, Dayhoff built a 20 x 20 mutation probability matrix that tallied a score per each amino acid change. This was known as the PAM1 matrix, which showed the probabilities of proteins undergoing 1% change (1 accepted point mutation per 100 amino acid residues). The PAM1 matrix is based on the alignment of protein sequences that shared at least 85% identity.

One important assumption made in the generation of PAM matrices are that each change in amino acid is assumed to be independent of previous mutational events at that site. This type of assumption shows that PAM is a Markov model.
So why is this assumption important? From this, we can generate more PAM matrices to be used for sequences that are separated by longer periods of evolutionary history. For example, a PAM250 matrix is simply a PAM1 matrix multipled by itself 250 times. This matrix applies to alignments that share about 20% amino acid identity, which represents about 2500 million years of evolution.

Probabilty to Log-Odds Scoring
When BLAST scores an alignment, it doesn't use the probability matrix as seen above. It converts the matrix elements into integer numbers and produces a log-odds scoring matrix.
Here, si,j is score for aligning any two residues. Simply put, if the value is high, that means it aligns well. The qi,j is the amount a certain residue i would mutate to residue j. The denominator is the probability of the residue mutation occurring by chance.
Let's take the residue M (methionine) and calculate its mutation to L (leucine). Both of amino acids have a hydrophobic side chain, so they should align well - we expect a positive score. We will use the PAM250 Mutation Matrix
sM,M = 10 * log (0.06 / 0.015)
sM,M = 6
This is the value found in the log-odds matrix for PAM250. With this matrix, we can now score our alignments! But wait - what about gaps? Let's see how to handle those in the next page!
References
Dayhoff et al. (1978). A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, 345–352. National Biomedical Research Foundation, Silver Spring, MD, 1978.Handling Gap penalties
We have already discussed amino acid substitutions, but what about the other two types mutations, insertions and deletions? Together, these insertions and deletions are known as gaps, and its scoring system varies.
Gap penalties for local vs. global alignment
Handling gap penalties vary depending on what you're looking for. If you are searching for similar domains, you shouldn't penalize much for any gaps that occur at the ends of your sequences (local alignment). However, if you need a more exact match (global alignment), then you would probably want any sequences that have long end gaps.
Opening and Extension gap penalties
Gap penalties may be broken down into two parts:
- An opening of a gap.
- The extension of a gap.
An opening gap penalty is applied at the appearance of any gaps. For gaps that are longer than one residue, we can apply an extension gap penalty, in which penalization occurs for the addition of each residue length.
With this, there are several methodologies we may apply to a gap.
Types of gap penalties
Constant
This is when there are no opening gap penalties, and a fixed negative score is given to every gap.
AB--C----DEF
AB--C----DEF
ABGHCFGHIDEF
Here, we'd have a score of -2n, where n is the score we take for each gap.
Linear
This depends on the gap length, so a penalty score is assigned for every gap residue.
AB--C----DEF
AB--C----DEF
ABGHCFGHIDEF
Here, we'd have a score of -6l, where l is the score we take for residue gap.
Affine
The affine type takes into account the gap opening penalty, as well as each length. This means that on top of a linear penalty type, there is another penalty score added that stands for gap opening.
AB--C----DEF
AB--C----DEF
ABGHCFGHIDEF
Here, we would have two opening gap penalties of -2n and residue gaps of -6l where n is the penalty per opening gap and l is the penalty per residue in each gap.
BLAST uses the affine type by default. The opening gap penalty is -11, while each additional residue gap is -1. You may change these settings in the "Algorithm parameters" section, just below the BLAST button.

One thing to notice here - what is the default scoring matrix set on? It's not any PAM matrix, but it's rather BLOSUM62. Let's see how BLOSUMs are constructed, and the difference between them and PAMs in the next lesson.
BLOSUM - BLOcks SUbstitution Matrix
An alternative to the PAM matrix is BLOSUM (BLocks SUbstitution Matrix), which was derived by Henikoff and Henikoff in 1992. NCBI uses BLOSUM62 as its the default matrix for protein BLAST.
Derivation of BLOSUM matrices
BLOSUM matrices are derived from comparisons of blocks of sequences from the Blocks database.
What are blocks and what is the blocks database?
A block is an ungapped multiple alignments of highly conserved, short regions. Here is what a sample block looks like:
The blocks database contains multiple alignments of conserved regions in protein families.
Derivation a BLOSUM matrix
The Henikoffs developed a database of "blocks" based on sequences with shared motifs. More than 2,000 blocks of aligned sequence segments were analyzed from more than 500 groups of related proteins. Within each block, they counted the relative frequencies of amino acids and their substitution probabilities
Why did they use Blocks?
The Henikoffs used blocks due to several reasons:
- Need to have multiple alignments and it's easier to align with similar sequences.
- They didn't want to complicate calculations with insertions/deletions.
- Wanted to focus on conserved regions for computing the scoring matrix.
What does a BLOSUM matrix tell us?
A BLOSUM tells us the likelihood of occurrence of each pairwise substitution, and we can use these values to score a pairwise comparison.
Each scoring matrix is constructed based on how identical the ungapped multiple sequence alignments are. For example, BLOSUM62 is derived from blocks containing at most 62% identity in the ungapped sequence aligments.
Calculating a BLOSUM
Here we'll show you how to calculate a BLOSUM.
Removing redundancies
Before we start constructing a matrix BLOSUM r, we have to eliminate the sequences that are more than r% identical. This solves us from the bias we get from databases over-representing certain classes of proteins. To do this, we have two options:
- Remove sequences from the block.
- Replace the similar sequences with a new sequence that represents the cluster.
ACD
DCE
DCE
DCE
BCE
BCD
ACB
Since most databases today have an over-representation of proteins, the extraneous DCE sequences should be eliminated in order to make our database more representative.
Calculating possible pairwise combinations
Thus, after elminating redundancies, we look at the first vertical column in our block:
A
D
B
B
A
Let's find out how many possible pairwise combinations we can see for each possible pair.
For the AA pair, we have 2 possible combinations, for AB or BA we have 4. For AD we have 2. We continue these calculations until the occurrence of all possible pairs are found.
Pair | Column 1 score | Column 2 score | Column 3 score | Total |
---|---|---|---|---|
AA | 1 | 0 | 0 | 1 |
AB or BA | 4 | 0 | 0 | 4 |
AD or DA | 2 | 0 | 0 | 2 |
BB | 1 | 0 | 0 | 1 |
BD or DB | 2 | 0 | 2 | 4 |
CC | 0 | 10 | 0 | 10 |
DD | 0 | 0 | 1 | 1 |
DE or ED | 0 | 0 | 4 | 4 |
EE | 0 | 0 | 1 | 1 |
Note that the total sum is 26, which we can use to normalize our matrix.
A | B | C | D | E | |
---|---|---|---|---|---|
A | 1 | ||||
B | 4 | 1 | |||
C | 0 | 0 | 10 | ||
D | 2 | 4 | 0 | 1 | |
E | 0 | 0 | 0 | 4 | 1 |
Calculate scores
To obtain integer values for our scoring matrix, we need to find the score per cell. We can do this with the following equation:
Where qij is observed frequency and eij is the expected frequency.
cij is the cell value as calculated above.
To calculate the Total T:
Where w is the number of columns and n is the number of sequences. With T, we can calculate qij, which is the rate of change of residue i to residue j.
In our case, T = 30, so let's divide all our cells by 30.
A | B | C | D | E | |
---|---|---|---|---|---|
A | 0.0333 | ||||
B | 0.133 | 0.0333 | |||
C | 0 | 0 | 0.333 | ||
D | 0.0667 | 0.133 | 0 | 0.0333 | |
E | 0 | 0 | 0 | 0.133 | 0.0333 |
Now pi can be found with the following equation:
pB = ( 1 + 8/2 ) / 30 = 0.167
pC = 10 / 30
pD = ( 1 + 10/2 ) /30 = 0.200
pE = ( 1 + 4/2 ) / 30 = 0.0133
The expected frequencies:
eij = 2pipj (i ≠ j)
A | B | C | D | E | |
---|---|---|---|---|---|
A | 0.0178 | ||||
B | 0.0444 | 0.0278 | |||
C | ? | ? | 0.111 | ||
D | 0.0533 | 0.0667 | ? | 0.04 | |
E | ? | ? | ? | 0.04 | 0.01 |
Notice how I didn't calculate cell values that had a value of 0 - you'll see that we don't need these values in the actual scoring matrix.
Now we have all we need! Just plug in values from the two matrices above into the equation below to obtain our scoring matrix.
To obtain scores, we multiple sij by two and round.
A | B | C | D | E | |
---|---|---|---|---|---|
A | 0.9 | ||||
B | 1.58 | 0.0278 | |||
C | 0 | 0 | 0.111 | ||
D | 0.0533 | 0.0667 | 0 | 0.04 | |
E | 0 | 0 | 0 | 0.04 | 0.01 |
References
What is a blocks database?
BLOSUM Matrices Lecture. IA State.
Columbia CS Department.
BLOSUM vs. PAM
As we have seen, there are two different log-odds scoring matrices available. When performing pairwise sequence alignments, you must specify the exact matrix to use - this depends on the suspected degree of identity between the query and its matches.
Let's do a quick summary of these two matrices and see how they differ.
PAM matrices summary
PAM (Accepted Point Mutations) matrices are:
- Extrapolated using comparisons among closely related proteins based on an evolutionary model.
- Based on data from alignments of closely related protein families.
- Because they use the Markov model, the substitution probabilties for highly related proteins can be extrapolated to probabilties of distantly related proteins.
Some limitations of PAM matrices include the somewhat inaccuracy of the assumption model and each position being context dependent.
BLOSUM summary
- Based on empirical observations of more distantly related protein alignments.
- Not based on an evolutionary model - just empirically derived.
- Sequence comparisons cover a broad range of divergences rather than just one subfamily of proteins.
- Based on observed alignments from conserved regions of multiple sequence alignments (blocks). These blocks are from large protein databases, which give a large sampling site.
Limitations are that you are limited to a subset of conserved domains.
Global Alignment Needleman-Wunsch
The purpose of global alignment (aka optimal matching algorithm) is to align two sequences from start to end, and make as many matches as possible. Algorithms do this by inserting gaps within the letters of each sequence to maximize the number of matches. This technique is often used when comparing sequences of similar length.
Algorithm
The algorithm for global alignment was first published by Saul Needleman and Christian Wunsch (1970), then modified over the next few years by other scientists. It aligns sequences from beginning to end and finds the best alignment that maximizes the overall score. The score is calculated based on matches, mismatches, and gaps.
Dynamic Programming
The algorithm is based on an algorithmic technique known as dynamic programming, where the optimal alignment is reduced into a series of smaller problem. The solution to these smaller problems are then used to calculate the solution to the larger (or overall) problem.
Example
The best way to go through the algorithm is to show an example, so let's compare two sequences - IDENTITY and SIMILARITY.
1) Set up a 2d matrix
The first step is to set up a 2d matrix with the first sequence on the y-axis, and the second sequence on the x-axis.
2) Decide on a scoring system
Now we need to fill this grid with numerical values, which we will later use to score the matches. Let's come up with a formula to find the score per cell, which we can do with pre-determined values.
- In the first option, if a match is present (cell will be colored green),s(xi, yi) will be given a score of +1. If it is a mismatch, then the score will be -2.
- In the second option, a gap will be inserted in the first sequence. Thus, we take the cell left of the current cell and -2 from it.
- In the second option, a gap will be inserted in the second sequence. We take the cell on top of the current cell and -2 from it.
The maximum of these values will be taken and placed as the current cell's score.
3) Fill out the primary values
Start with filling a 0 at the top-left corner. Each cell along the row will be -2 since the only value it looks at is the cell to the left. Each cell along the first column will be -2 from its top, since that's the only value it has to look.
After coloring each matching letter green, you will have a table that looks like this:
4) Fill the rest of the table
We can see that for green cells (match), we take the maximum value of the left, top-left (diagonal), or top, and +1 to it. For any mismatches or gap insertions, we -2 from the largest value of the left, top-left (diagonal) and top cells.
A completed matrix gives the final score of the alignment on the bottom-right most cell. In this case, it's -7.
5) Trace the optimal path
Before we find trace the optimal path, let's go over some conventions.
Our purpose is to find the optimal path from the top left going to the bottom right. A diagonal line denotes that the letter is aligned (can be match or mismatch). A perfect match will have a diagonal going from top left to bottom right. Mismatches can still generate this diagonal, but their scores will be different.
Horizontal and vertical lines denote the presence of a gap in the first and second sequence, respectively. Gaps may be of any length, and may occur within (internal) or at the ends (terminal) of sequences.
For our example, we start from the bottom-right cell, then move our way up to the top-left.
With this, we get the alignment:
There are other alignments that give the same score. Can you trace them out?
6) Varying parameters
Depending on what kind of alignment you're looking for, you can vary the parameters. For example, some sequence alignment algorithms vary gap penalties in two subcategories. An opening gap would be given a more severe penalty than each continuing gap. In our example case, all gaps were treated equally with a penalty of -2.
Adding a harsher mismatch penalty can force more gaps to be created. However, this is non-ideal as it's good to have near-identical sequences with few deletions.
Additionally, global alignments usually use an end gap penalty. This is the distinguishing factor between global and local alignment, which we'll see next.
Local Alignment Smith-Waterman
Local alignments like global alignments, but they generate "islands" of areas that have the greatest similarity. This is helpful when the query and sequence are dissimilar, but are suspected to contain domains or small regions of similarity. The BLAST algorithm uses local alignment.
Local alignments differ from global alignments in a few ways:
- No penalty for starting at an internal position.
- Does not necessarily extend to ends of sequences.
- No negative values on the matrix are allowed - zeros are used instead.
The power of end gap penalties
The Smith-Waterman algorithm is very much like the Needleman-Wunsch algorithm used in global alignments - the hallmark difference is in the scoring methodology.
Unlike global alignment, local alignments have no end gap penalties, allowing small interior alignments to rank higher when scored.
Let's take a quick look at the effects of end gap penalties. The following sequence is aligned globally, with high end gap penalties.
M - N A L S D R T
M G S D R T T E T
6 -12 1 0 -3 1 0 -1 3 = -5
Now in this next sequence, we have a local alignment. Notice how the small region in the middle aligns quite nicely.
M N A L S D R T - - -
- - M G S D R T T E T
0 0 -1 -4 2 4 6 3 0 0 0 = 10
Without the end gap penalty, the Smith-Waterman alignment algorithm is able to find the best locally matching sequence.
Example
Let's compare two sequences - CGTTCTA and AACGTTGG.
1) Set up a 2d matrix
Set up a 2d matrix, as we did earlier in the Needleman-Wunsch example.
2) Decide on a scoring system
We need separate scores for matches, mismatches and gaps.
Any cell that would have a negative value are given0instead.
3) Fill out primary values
We want to start with the first row and column and gives those a value of0. Then we want to mark the cells that indicates a match.
4) Fill out rest of table
Now we fill the rest of our table out. Make sure to keep track of where each cell value came from, as we need this to trace back our optimal alignment.
Note that a mismatch or a match can only come from the cell diagonally up to the left of the current cell. Additionally, gaps may only come from the top or left of the current cell.
5) Trace the optimal path
Now all we need to do is retrace our steps. First, find the cell with the highest score.
Now we trace back until we get to a cell with0. Thus, our optimal local alignment becomes:
--CGTTCTA
AACGTTGG-
Conclusion: Global vs. Local alignments
Thus, we may say that for global alignments, where the sequences are connected along the entire length of their sequences, there is a higher % identity with many small interior gaps. For local alignments, which focus on the best matching regions, there is a lower % identity, but fewer interior gaps and longer end gaps.