In bioinformatics,
a sequence alignment is a way of arranging the sequences of DNA, RNA, or
protein to identify regions of similarity that may be a consequence of
functional, structural, or evolutionary relationships between the sequences.
Aligned sequences of nucleotide or amino acid residues are typically
represented as rows within a matrix. Gaps are inserted between the residues so
that identical or similar characters are aligned in successive columns
When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be aligned. Many bioinformatics tasks depend upon successful alignments. Alignments are conventionally shown as a traces.
In a symbolic sequence each base or residue monomer in each sequence is represented by a letter. The convention is to print the single-letter codes for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule). This is based on the assumption that the combined monomers evenly spaced along the single dimension of the molecule's primary structure. From now on we will refer to an alignment of two protein sequences.
Every element in a trace is either a match or a gap. Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace: a match. When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its "absence" is labelled by a dash in the derived sequence. When a residue appears to have been inserted to produce a longer sequence a dash appears opposite in the unaugmented sequence. Since these dashes represent "gaps" in one or other sequence, the action of inserting such spacers is known as gapping.
A deletion in one sequence is symmetric with an insertion in the other. When one sequence is gapped relative to another a deletion in sequence a can be seen as an insertion in sequence b. Indeed, the two types of mutation are referred to together as indels. If we imagine that at some point one of the sequences was identical to its primitive homologue, then a trace can represent the three ways divergence could occur (at that point).
Biological
interpretation of an alignment
A trace can represent a substitution:
For obvious reasons we do not represent a silent mutation.
Traces may represent recent genetic changes which obscure older changes. Here we have only represented point mutations for simplicity. Actual mutations often insert or delete several residues
Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem, including slow but formally optimizing methods like dynamic programming, and efficient, but not as thorough heuristic algorithms or probabilistic methods designed for large-scale database search
Global and local alignments
Hybrid methods, known as semiglobal or "glocal" methods, attempt to find the best possible alignment that includes the start and end of one or the other sequence. This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.
Pairwise alignment
Pairwise sequence alignment
methods are used to find the best-matching piecewise (local) or global
alignments of two query sequences. Pair wise alignments can only be used
between two sequences at a time, but they are efficient to calculate and are
often used for methods that do not require extreme precision (such as searching
a database for sequences with high homology to a query). The three primary
methods of producing pairwise alignments are dot-matrix methods, dynamic
programming, and word methods; however, multiple sequence alignment techniques
can also align pairs of sequences. Although each method has its individual
strengths and weaknesses, all three pair wise methods have difficulty with
highly repetitive sequences of low information content - especially where the
number of repetitions differ in the two sequences to be aligned. One way of
quantifying the utility of a given pairwise alignment is the 'maximum unique
match', or the longest subsequence that occurs in both query sequence. Longer
MUM sequences typically reflect closer relatedness.
Dot-matrix
methods
The dot-matrix approach, which
implicitly produces a family of alignments for individual sequence regions, is
qualitative and simple, though time-consuming to analyze on a large scale. It
is very easy to visually identify certain sequence features—such as insertions,
deletions, repeats, or inverted repeats—from a dot-matrix plot. To construct a
dot-matrix plot, the two sequences are written along the top row and leftmost
column of a two-dimensional matrix and a dot is placed at any point where the
characters in the appropriate columns match—this is a typical recurrence plot.
Some implementations vary the size or intensity of the dot depending on the
degree of similarity of the two characters, to accommodate conservative
substitutions. The dot plots of very closely related sequences will appear as a
single line along the matrix's main diagonal.
Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect can occur when a protein consists of multiple similar structural domains.
Dynamic
programming
The technique of dynamic programming can be applied to produce
global alignments via the Needleman-Wunsch algorithm, and local alignments via
the Smith-Waterman algorithm. In typical
usage, protein alignments use a substitution matrix to assign scores to
amino-acid matches or mismatches, and a gap penalty
for matching an amino acid in one sequence to a gap in the other. DNA and RNA
alignments may use a scoring matrix, but in practice often simply assign a
positive match score, a negative mismatch score, and a negative gap penalty.
(In standard dynamic programming, the score of each amino acid position is
independent of the identity of its neighbors, and therefore base stacking
effects are not taken into account. However, it is possible to account for such
effects by modifying the algorithm.) A common extension to standard linear gap
costs, is the usage of two different gap penalties for opening a gap and for
extending a gap. Typically the former is much larger than the latter, e.g. -10
for gap open and -2 for gap extension. Thus, the number of gaps in an alignment
is usually reduced and residues and gaps are kept together, which typically
makes more biological sense. The Gotoh algorithm implements affine gap costs by
using three matrices.
Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account frameshift mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Although the method is very slow, its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The BLAST and EMBOSS suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from both commercial sources, such as FrameSearch, distributed as part of the Accelrys GCG package, and Open Source software such as Genewise.
The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of or extremely long sequences.
Word methods
Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family. Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length k, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as EMBL FASTA and NCBI BLAST.
Multiple sequence alignment
Two approaches to multiple sequence alignment
(MSA) include progressive and iterative MSAs. As the names imply, progressive
MSA starts with one sequence and progressively aligns the others, while
iterative MSA realigns the sequences during multiple iterations of the process.
Progressive
Steps:
One major disadvantage, however, is the reliance on a good alignment of the first two sequences. Errors there can propagate throughout the rest of the MSA. An alternative approach is iterative MSA (see below).
Iterative
For iterative MSA, the MSA is re-iterated,
starting with the pair-wise re-alignment of sequences within subgroups, and
then the re-alignment of the subgroups. The choice of subgroups can be made via
sequence relations on the guide tree, random selection, and so on.
At heart, iterative MSA is an optimization method and
may use machine learning approaches such as genetic algorithms and Hidden Markov Models. The disadvantages of
iterative MSA are inherited from optimization methods: the process can get
trapped in local minima and can be much slower.
Software
Multiple sequence alignment is an
extension of pairwise alignment to incorporate more than two sequences at a
time. Multiple alignment methods try to align all of the sequences in a given
query set. Multiple alignments are often used in identifying conserved sequence regions across a group
of sequences hypothesized to be evolutionarily related. Such conserved sequence
motifs can be used in conjunction with structural and mechanistic information to locate the catalytic
active sites
of enzymes.
Alignments are also used to aid in establishing evolutionary relationships by
constructing phylogenetic trees. Multiple sequence
alignments are computationally difficult to produce and most formulations of
the problem lead to NP-complete combinatorial optimization
problems. Nevertheless, the utility of these alignments in bioinformatics has
led to the development of a variety of methods suitable for aligning three or
more sequences.
Dynamic
programming
The technique of dynamic
programming is theoretically applicable to any number of sequences; however,
because it is computationally expensive in both time and memory,
it is rarely used for more than three or four sequences in its most basic form.
This method requires constructing the n-dimensional equivalent of the
sequence matrix formed from two sequences, where n is the number of
sequences in the query. Standard dynamic programming is first used on all pairs
of query sequences and then the "alignment space" is filled in by
considering possible matches or gaps at intermediate positions, eventually
constructing an alignment essentially between each two-sequence alignment. Although
this technique is computationally expensive, its guarantee of a global optimum
solution is useful in cases where only a few sequences need to be aligned
accurately. One method for reducing the computational demands of dynamic
programming, which relies on the "sum of pairs" objective function, has been implemented in the
MSA
software package.
Progressive
methods
Progressive, hierarchical, or
tree methods generate a multiple sequence alignment by first aligning the most
similar sequences and then adding successively less related sequences or groups
to the alignment until the entire query set has been incorporated into the
solution. The initial tree describing the sequence relatedness is based on
pairwise comparisons that may include heuristic pairwise alignment methods
similar to FASTA.
Progressive alignment results are dependent on the choice of "most
related" sequences and thus can be sensitive to inaccuracies in the
initial pairwise alignments. Most progressive multiple sequence alignment
methods additionally weight the sequences in the query set according to their
relatedness, which reduces the likelihood of making a poor choice of initial
sequences and thus improves alignment accuracy.
Many variations of the Clustal progressive implementation are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee; implementations can be found at ClustalW and T-Coffee.
Iterative
methods
Iterative methods attempt to
improve on the weak point of the progressive methods, the heavy dependence on
the accuracy of the initial pairwise alignments. Iterative methods optimize an objective function based on a selected
alignment scoring method by assigning an initial global alignment and then
realigning sequence subsets. The realigned subsets are then themselves aligned
to produce the next iteration's multiple sequence alignment. Various ways of
selecting the sequence subgroups and objective function are reviewed in.
Motif
finding
Motif finding, also known as
profile analysis, constructs global multiple sequence alignments that attempt
to align short conserved sequence motifs among the sequences in the
query set. This is usually done by first constructing a general global multiple
sequence alignment, after which the highly conserved regions are isolated and used to
construct a set of profile matrices. The profile matrix for each conserved
region is arranged like a scoring matrix but its frequency counts for each
amino acid or nucleotide at each position are derived from the conserved
region's character distribution rather than from a more general empirical
distribution. The profile matrices are then used to search other sequences for
occurrences of the motif they characterize. In cases where the original data set
contained a small number of sequences, or only highly related sequences, pseudocounts
are added to normalize the character distributions represented in the motif.
Techniques
inspired by computer science
A variety of general optimization algorithms commonly used in
computer science have also been applied to the multiple sequence alignment
problem. Hidden Markov models have been used to produce
probability scores for a family of possible multiple sequence alignments for a
given query set. They are especially effective in detecting remotely related
sequences because they are less susceptible to noise created by conservative or
semiconservative substitutions. Genetic
algorithms and simulated annealing have also been used in
optimizing multiple sequence alignment scores as judged by a scoring function
like the sum-of-pairs method. More complete details and software packages can
be found in the main article multiple sequence alignment.
Structural alignment
Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the secondary and tertiary structure of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more evolutionarily conserved than sequence, structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure predictionbecause they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.
DALI
The DALI method, or distance matrix alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences. It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the Protein Data Bank (PDB). It has been used to construct the FSSP structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at EBI DALI and the FSSP is located at The Dali Database.
SSAP
SSAP (sequential structure
alignment program) is a dynamic programming-based method of structural
alignment that uses atom-to-atom vectors in structure space as comparison
points. It has been extended since its original description to include multiple
as well as pairwise alignments, and has been used in the construction of the CATH (Class, Architecture,
Topology, Homology) hierarchical database classification of protein folds. The
CATH database can be accessed at CATH Protein
Structure Classification.
Combinatorial
extension
The combinatorial extension
method of structural alignment generates a pairwise structural alignment by
using local geometry to align short fragments of the two proteins being
analyzed and then assembles these fragments into a larger alignment. Based on
measures such as rigid-body root mean square distance,
residue distances, local secondary structure, and surrounding environmental
features such as residue neighbor hydrophobicity,
local alignments called "aligned fragment pairs" are generated and
used to build a similarity matrix representing all possible structural
alignments within predefined cutoff criteria. A path from one protein structure
state to the other is then traced through the matrix by extending the growing
alignment one fragment at a time. The optimal such path defines the
combinatorial-extension alignment. A web-based server implementing the method
and providing a database of pairwise alignments of structures in the Protein
Data Bank is located at the Combinatorial
Extension website.
Phylogenetic analysis
Phylogenetics and sequence
alignment are closely related fields due to the shared necessity of evaluating
sequence relatedness. The field of phylogenetics
makes extensive use of sequence alignments in the construction and
interpretation of phylogenetic trees, which are used to classify
the evolutionary relationships between homologous genes represented in the genomes of
divergent species. The degree to which sequences in a query set differ is
qualitatively related to the sequences' evolutionary distance from one another.
Roughly speaking, high sequence identity suggests that the sequences in
question have a comparatively young most recent common ancestor, while low
identity suggests that the divergence is more ancient. This approximation,
which reflects the "molecular clock" hypothesis that a roughly
constant rate of evolutionary change can be used to extrapolate the elapsed
time since two genes first diverged (that is, the coalescence time), assumes that the
effects of mutation and selection are constant across sequence
lineages. Therefore it does not account for possible difference among organisms
or species in the rates of DNA repair or the possible functional
conservation of specific regions in a sequence. (In the case of nucleotide
sequences, the molecular clock hypothesis in its most basic form also discounts
the difference in acceptance rates between silent
mutations that do not alter the meaning of a given codon and other mutations
that result in a different amino acid being incorporated into the
protein.) More statistically accurate methods allow the evolutionary rate on
each branch of the phylogenetic tree to vary, thus producing better estimates
of coalescence times for genes.
Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly heuristic because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is NP-hard.
Assessment of
significance
Sequence alignments are useful in
bioinformatics for identifying sequence similarity, producing phylogenetic
trees, and developing homology models of protein structures. However, the
biological relevance of sequence alignments is not always clear. Alignments are
often assumed to reflect a degree of evolutionary change between sequences
descended from a common ancestor; however, it is formally possible that convergent evolution can occur to produce
apparent similarity between proteins that are evolutionarily unrelated but
perform similar functions and have similar structures.
In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.
Methods of statistical significance estimation for gapped sequence alignments are available in the literature.
It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.
Other
biological uses
When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be aligned. Many bioinformatics tasks depend upon successful alignments. Alignments are conventionally shown as a traces.
In a symbolic sequence each base or residue monomer in each sequence is represented by a letter. The convention is to print the single-letter codes for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule). This is based on the assumption that the combined monomers evenly spaced along the single dimension of the molecule's primary structure. From now on we will refer to an alignment of two protein sequences.
Every element in a trace is either a match or a gap. Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace: a match. When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its "absence" is labelled by a dash in the derived sequence. When a residue appears to have been inserted to produce a longer sequence a dash appears opposite in the unaugmented sequence. Since these dashes represent "gaps" in one or other sequence, the action of inserting such spacers is known as gapping.
A deletion in one sequence is symmetric with an insertion in the other. When one sequence is gapped relative to another a deletion in sequence a can be seen as an insertion in sequence b. Indeed, the two types of mutation are referred to together as indels. If we imagine that at some point one of the sequences was identical to its primitive homologue, then a trace can represent the three ways divergence could occur (at that point).
Biological
interpretation of an alignment
A trace can represent a substitution:
AKVAIL
AKIAIL
A trace can represent a deletion: VCGMD
VCG-D
A trace can represent a insertion: GS-K
GSGK
For obvious reasons we do not represent a silent mutation.
Traces may represent recent genetic changes which obscure older changes. Here we have only represented point mutations for simplicity. Actual mutations often insert or delete several residues
Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem, including slow but formally optimizing methods like dynamic programming, and efficient, but not as thorough heuristic algorithms or probabilistic methods designed for large-scale database search
Sequence
alignments can be stored in a wide variety of text-based file formats, many of
which were originally developed in conjunction with a specific alignment
program or implementation. Most web-based tools allow a limited number of input
and output formats, such as FASTA format and GenBank format and the output is
not easily editable. Several conversion programs are available READSEQ or EMBOSS
having a graphical interfaces or command line interfaces, while several
programming packages like BioPerl, BioRuby provide functions to do this.
Global and local alignments
Illustration of
global and local alignments demonstrating the 'gappy' quality of global
alignments that can occur if sequences are insufficiently similar Global alignments, which attempt
to align every residue in every sequence, are most useful when the sequences in
the query set are similar and of roughly equal size. (This does not mean global
alignments cannot end in gaps.) A general global alignment technique is the Needleman-Wunsch
algorithm, which is based on dynamic programming. Local alignments are more
useful for dissimilar sequences that are suspected to contain regions of
similarity or similar sequence motifs within their larger sequence context. The
Smith-Waterman algorithm is a general local alignment method also based on
dynamic programming. With sufficiently similar sequences, there is no
difference between local and global alignments.
Hybrid methods, known as semiglobal or "glocal" methods, attempt to find the best possible alignment that includes the start and end of one or the other sequence. This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.
Pairwise alignment
Pairwise sequence alignment
methods are used to find the best-matching piecewise (local) or global
alignments of two query sequences. Pair wise alignments can only be used
between two sequences at a time, but they are efficient to calculate and are
often used for methods that do not require extreme precision (such as searching
a database for sequences with high homology to a query). The three primary
methods of producing pairwise alignments are dot-matrix methods, dynamic
programming, and word methods; however, multiple sequence alignment techniques
can also align pairs of sequences. Although each method has its individual
strengths and weaknesses, all three pair wise methods have difficulty with
highly repetitive sequences of low information content - especially where the
number of repetitions differ in the two sequences to be aligned. One way of
quantifying the utility of a given pairwise alignment is the 'maximum unique
match', or the longest subsequence that occurs in both query sequence. Longer
MUM sequences typically reflect closer relatedness.
Dot-matrix
methods
The dot-matrix approach, which
implicitly produces a family of alignments for individual sequence regions, is
qualitative and simple, though time-consuming to analyze on a large scale. It
is very easy to visually identify certain sequence features—such as insertions,
deletions, repeats, or inverted repeats—from a dot-matrix plot. To construct a
dot-matrix plot, the two sequences are written along the top row and leftmost
column of a two-dimensional matrix and a dot is placed at any point where the
characters in the appropriate columns match—this is a typical recurrence plot.
Some implementations vary the size or intensity of the dot depending on the
degree of similarity of the two characters, to accommodate conservative
substitutions. The dot plots of very closely related sequences will appear as a
single line along the matrix's main diagonal.Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect can occur when a protein consists of multiple similar structural domains.
Dynamic
programming
The technique of dynamic programming can be applied to produce
global alignments via the Needleman-Wunsch algorithm, and local alignments via
the Smith-Waterman algorithm. In typical
usage, protein alignments use a substitution matrix to assign scores to
amino-acid matches or mismatches, and a gap penalty
for matching an amino acid in one sequence to a gap in the other. DNA and RNA
alignments may use a scoring matrix, but in practice often simply assign a
positive match score, a negative mismatch score, and a negative gap penalty.
(In standard dynamic programming, the score of each amino acid position is
independent of the identity of its neighbors, and therefore base stacking
effects are not taken into account. However, it is possible to account for such
effects by modifying the algorithm.) A common extension to standard linear gap
costs, is the usage of two different gap penalties for opening a gap and for
extending a gap. Typically the former is much larger than the latter, e.g. -10
for gap open and -2 for gap extension. Thus, the number of gaps in an alignment
is usually reduced and residues and gaps are kept together, which typically
makes more biological sense. The Gotoh algorithm implements affine gap costs by
using three matrices.Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account frameshift mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Although the method is very slow, its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The BLAST and EMBOSS suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from both commercial sources, such as FrameSearch, distributed as part of the Accelrys GCG package, and Open Source software such as Genewise.
The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of or extremely long sequences.
Word methods
Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family. Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length k, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as EMBL FASTA and NCBI BLAST.
Multiple sequence alignment
Two approaches to multiple sequence alignment
(MSA) include progressive and iterative MSAs. As the names imply, progressive
MSA starts with one sequence and progressively aligns the others, while
iterative MSA realigns the sequences during multiple iterations of the process.
Progressive
Steps: - Start with the most similar sequence.
- Align the new sequence to each of the
previous sequences.
- Create a distance matrix/function
for each sequence pair.
- Create a phylogenetic “guide tree” from the
matrices, placing the sequences at the terminal nodes.
- Use the guide tree to determine the next
sequence to be added to the alignment.
- Preserve gaps.
- Go back to step 1.
One major disadvantage, however, is the reliance on a good alignment of the first two sequences. Errors there can propagate throughout the rest of the MSA. An alternative approach is iterative MSA (see below).
Iterative
For iterative MSA, the MSA is re-iterated,
starting with the pair-wise re-alignment of sequences within subgroups, and
then the re-alignment of the subgroups. The choice of subgroups can be made via
sequence relations on the guide tree, random selection, and so on.
At heart, iterative MSA is an optimization method and
may use machine learning approaches such as genetic algorithms and Hidden Markov Models. The disadvantages of
iterative MSA are inherited from optimization methods: the process can get
trapped in local minima and can be much slower.
Software
- Chimera
- excellent molecular graphics package with support for a wide range of
operations
- Clustal-W
- the famous Clustal-W multiple alignment program
- Clustal-X
- provides a window-based user interface to the Clustal-W multiple
alignment program
- DCSE
- a multiple alignment editor
- Friend
- an Integrated Front-end Application for Bioinformatics
- Jalview
- a Java multiple alignment editor
- Mauve
- a multiple genome alignment and visualization package that considers
large-scale rearrangements in addition to nucleotide substitution and
indels
- ModView
- a program to visualize and analyze multiple biomolecule structures
and/or sequence alignments.
- Musca
- multiple sequence alignment of amino acid or nucleotide sequences;
uses pattern discovery
- MUSCLE
- more accurate than T-Coffee, faster than
Clustal-W.
- SeaView
- a graphical multiple sequence alignment editor
- ShadyBox
- the first GUI based WYSIWYG multiple sequence alignment drawing program
for major Unix platforms
- UGENE
- contains multiple alignment editor with MUSCLE alignment algorithm
integrated.
Dynamic
programming
The technique of dynamic
programming is theoretically applicable to any number of sequences; however,
because it is computationally expensive in both time and memory,
it is rarely used for more than three or four sequences in its most basic form.
This method requires constructing the n-dimensional equivalent of the
sequence matrix formed from two sequences, where n is the number of
sequences in the query. Standard dynamic programming is first used on all pairs
of query sequences and then the "alignment space" is filled in by
considering possible matches or gaps at intermediate positions, eventually
constructing an alignment essentially between each two-sequence alignment. Although
this technique is computationally expensive, its guarantee of a global optimum
solution is useful in cases where only a few sequences need to be aligned
accurately. One method for reducing the computational demands of dynamic
programming, which relies on the "sum of pairs" objective function, has been implemented in the
MSA
software package.
Progressive
methods
Progressive, hierarchical, or
tree methods generate a multiple sequence alignment by first aligning the most
similar sequences and then adding successively less related sequences or groups
to the alignment until the entire query set has been incorporated into the
solution. The initial tree describing the sequence relatedness is based on
pairwise comparisons that may include heuristic pairwise alignment methods
similar to FASTA.
Progressive alignment results are dependent on the choice of "most
related" sequences and thus can be sensitive to inaccuracies in the
initial pairwise alignments. Most progressive multiple sequence alignment
methods additionally weight the sequences in the query set according to their
relatedness, which reduces the likelihood of making a poor choice of initial
sequences and thus improves alignment accuracy.Many variations of the Clustal progressive implementation are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee; implementations can be found at ClustalW and T-Coffee.
Iterative
methods
Iterative methods attempt to
improve on the weak point of the progressive methods, the heavy dependence on
the accuracy of the initial pairwise alignments. Iterative methods optimize an objective function based on a selected
alignment scoring method by assigning an initial global alignment and then
realigning sequence subsets. The realigned subsets are then themselves aligned
to produce the next iteration's multiple sequence alignment. Various ways of
selecting the sequence subgroups and objective function are reviewed in.
Motif
finding
Motif finding, also known as
profile analysis, constructs global multiple sequence alignments that attempt
to align short conserved sequence motifs among the sequences in the
query set. This is usually done by first constructing a general global multiple
sequence alignment, after which the highly conserved regions are isolated and used to
construct a set of profile matrices. The profile matrix for each conserved
region is arranged like a scoring matrix but its frequency counts for each
amino acid or nucleotide at each position are derived from the conserved
region's character distribution rather than from a more general empirical
distribution. The profile matrices are then used to search other sequences for
occurrences of the motif they characterize. In cases where the original data set
contained a small number of sequences, or only highly related sequences, pseudocounts
are added to normalize the character distributions represented in the motif.
Techniques
inspired by computer science
A variety of general optimization algorithms commonly used in
computer science have also been applied to the multiple sequence alignment
problem. Hidden Markov models have been used to produce
probability scores for a family of possible multiple sequence alignments for a
given query set. They are especially effective in detecting remotely related
sequences because they are less susceptible to noise created by conservative or
semiconservative substitutions. Genetic
algorithms and simulated annealing have also been used in
optimizing multiple sequence alignment scores as judged by a scoring function
like the sum-of-pairs method. More complete details and software packages can
be found in the main article multiple sequence alignment.Structural alignment
Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the secondary and tertiary structure of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more evolutionarily conserved than sequence, structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure predictionbecause they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.
DALI
The DALI method, or distance matrix alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences. It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the Protein Data Bank (PDB). It has been used to construct the FSSP structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at EBI DALI and the FSSP is located at The Dali Database.
SSAP
SSAP (sequential structure
alignment program) is a dynamic programming-based method of structural
alignment that uses atom-to-atom vectors in structure space as comparison
points. It has been extended since its original description to include multiple
as well as pairwise alignments, and has been used in the construction of the CATH (Class, Architecture,
Topology, Homology) hierarchical database classification of protein folds. The
CATH database can be accessed at CATH Protein
Structure Classification.
Combinatorial
extension
The combinatorial extension
method of structural alignment generates a pairwise structural alignment by
using local geometry to align short fragments of the two proteins being
analyzed and then assembles these fragments into a larger alignment. Based on
measures such as rigid-body root mean square distance,
residue distances, local secondary structure, and surrounding environmental
features such as residue neighbor hydrophobicity,
local alignments called "aligned fragment pairs" are generated and
used to build a similarity matrix representing all possible structural
alignments within predefined cutoff criteria. A path from one protein structure
state to the other is then traced through the matrix by extending the growing
alignment one fragment at a time. The optimal such path defines the
combinatorial-extension alignment. A web-based server implementing the method
and providing a database of pairwise alignments of structures in the Protein
Data Bank is located at the Combinatorial
Extension website.
Phylogenetic analysis
Phylogenetics and sequence
alignment are closely related fields due to the shared necessity of evaluating
sequence relatedness. The field of phylogenetics
makes extensive use of sequence alignments in the construction and
interpretation of phylogenetic trees, which are used to classify
the evolutionary relationships between homologous genes represented in the genomes of
divergent species. The degree to which sequences in a query set differ is
qualitatively related to the sequences' evolutionary distance from one another.
Roughly speaking, high sequence identity suggests that the sequences in
question have a comparatively young most recent common ancestor, while low
identity suggests that the divergence is more ancient. This approximation,
which reflects the "molecular clock" hypothesis that a roughly
constant rate of evolutionary change can be used to extrapolate the elapsed
time since two genes first diverged (that is, the coalescence time), assumes that the
effects of mutation and selection are constant across sequence
lineages. Therefore it does not account for possible difference among organisms
or species in the rates of DNA repair or the possible functional
conservation of specific regions in a sequence. (In the case of nucleotide
sequences, the molecular clock hypothesis in its most basic form also discounts
the difference in acceptance rates between silent
mutations that do not alter the meaning of a given codon and other mutations
that result in a different amino acid being incorporated into the
protein.) More statistically accurate methods allow the evolutionary rate on
each branch of the phylogenetic tree to vary, thus producing better estimates
of coalescence times for genes.Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly heuristic because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is NP-hard.
Assessment of
significance
Sequence alignments are useful in
bioinformatics for identifying sequence similarity, producing phylogenetic
trees, and developing homology models of protein structures. However, the
biological relevance of sequence alignments is not always clear. Alignments are
often assumed to reflect a degree of evolutionary change between sequences
descended from a common ancestor; however, it is formally possible that convergent evolution can occur to produce
apparent similarity between proteins that are evolutionarily unrelated but
perform similar functions and have similar structures.In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.
Methods of statistical significance estimation for gapped sequence alignments are available in the literature.
Scoring functions
The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by Margaret Dayhoff and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penalties account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.
Other
biological uses
Sequenced RNA,
such as expressed sequence tags and full-length
mRNAs, can be aligned to a sequenced genome to find where there are genes and
get information about alternative splicing
and RNA editing.
Sequence alignment is also a part of genome
assembly, where sequences are aligned to find overlap so that contigs
(long stretches of sequence) can be formed. Another use is SNP analysis, where
sequences from different individuals are aligned to find single basepairs that
are often different in a population.