Saturday, May 26, 2012

Microarray Technology

DNA Microarray is one such technology which enables to analyze the expression of many genes in a single reaction quickly and in an efficient manner. DNA Microarray technology has empowered the fundamental aspects underlining the growth and development of life as well as to explore the genetic causes of anomalies occurring in the functioning of the human body.


Theoretical overview of a Microarray steps involves:
1. Extraction of mRNA
2. Make labelled cDNA  through reverse transcription
3. Mixing of samples & hybridize to cDNA microarray
4. Wash to remove non specific binding
5. Spin & scan
6. Analyze


"A typical microarray experiment involves the hybridization of an mRNA molecule to the the DNA template from which it is originated. Many DNA samples are used to construct an array. The amount of mRNA bound to each site on the array indicates the expression level of the various genes". This number may run in thousands. All the data is collected and a profile is generated for gene expression in the cell.
A DNA “chip” or microarray is prepared on a small solid base such as a piece of glass  divided into a grid of tiny squares. To each square is attached a different and specific piece of DNA, typically a short DNA sequence that can act as a probe for a particular gene. DNA corresponding to thousands of different genes can be accommodated on a single array no bigger than a microscope slide.

A single stranded DNA sample of interest is cut up and then washed over the chip. Any sequence in the sample that matches a sequence on the chip will hybridise to it and, if the sample is suitably labeled (usually with a fluorescent tag) the pattern of matches can be visualised and analysed by computer, giving a read-out of the presence or expression level of hundreds of different sequences simultaneously.

Microarray Technique
An array is an orderly arrangement of samples where matching of known and unknown DNA samples is done based on base pairing rules. An array experiment makes use of common assay systems such as microplates or standard blotting membranes. The sample spot sizes are typically less than 200 microns in diameter usually contain thousands of spots.

Thousands of spotted samples known as probes (with known identity) are immobilized on a solid support (a microscope glass slides or silicon chips or nylon membrane). The spots can be DNA, cDNA, or oligonucleotides. These are used to determine complementary binding of the unknown sequences thus allowing parallel analysis for gene expression and gene discovery. An experiment with a single DNA chip can provide information on thousands of genes simultaneously. An orderly arrangement of the probes on the support is important as the location of each spot on the array is used for the identification of a gene.

Types of Microarrays
Depending upon the kind of immobilized sample used construct arrays and the information fetched, the Microarray experiments can be categorized in three ways:
1. Microarray expression analysis: In this experimental setup, the cDNA derived from the mRNA of known genes is immobilized. The sample has genes from both the normal as well as the diseased tissues. Spots with more intensity are obtained for diseased tissue gene if the gene is over expressed in the diseased condition. This expression pattern is then compared to the expression pattern of a gene responsible for a disease.
2. Microarray for mutation analysis: For this analysis, the researchers use gDNA. The genes might differ from each other by as less as a single nucleotide base.
A single base difference between two sequences is known as Single Nucleotide Polymorphism (SNP) and detecting them is known as SNP detection.
3. Comparative Genomic Hybridization: It is used for the identification in the increase or decrease of the important chromosomal fragments harboring genes involved in a disease.

Microarrays are useful when one wants to survey a large number of genes quickly or when the sample to be studied is small. Microarrays may be used to assay gene expression within a single sample or to compare gene expression in two different cell types or tissue samples, such as in healthy and diseased tissue. Because a microarray can be used to examine the expression of hundreds or thousands of genes at once, it promises to revolutionize the way scientists examine gene expression. This technology is still considered to be in its infancy; therefore, many initial studies using microarrays have represented simple surveys of gene expression profiles in a variety of cell types. Nevertheless, these studies represent an important and necessary first step in our understanding and cataloging of the human genome.

With new advances, researchers will be able to infer probable functions of new genes based on similarities in expression patterns with those of known genes. Ultimately, these studies promise to expand the size of existing gene families, reveal new patterns of coordinated gene expression across gene families, and uncover entirely new categories of genes. Furthermore, because the product of any one gene usually interacts with those of many others, our understanding of how these genes coordinate will become clearer through such analyses, and precise knowledge of these inter-relationships will emerge. The use of microarrays may also speed the identification of genes involved in the development of various diseases by enabling scientists to examine a much larger number of genes. This technology will also aid the examination of the integration of gene expression and function at the cellular level, revealing how multiple gene products work together to produce physical and chemical responses to both static and changing cellular needs

Tuesday, April 10, 2012

University of Rajasthan Paper - M.Sc. Biotechnology

Important Note:
1. Question 1 is compulsory and to be answer at one place in the answer book
2. Attempt five question out of nine.
Animal cell science and Technology
2012
1.A. Define the following:
(a) Primary culture
(b) Simple growth medium
(c) Cell cloning
(d) Histolytic culture
(e) Somatic culture

B. Write the full form of:
(a) ES
(b) RCCS
(c) HEPA
(d) FBS
(e) CHO

C. Fill in the blanks:
(a) Serum contains..........which promotes cell proliferation.
(b) Freshly isolated cell cultures are called........................
(c) The process of ovary culture is known as ..................
(d) ..............are pluripotent cultured cells derived from early pre-implantation embryos.
(e) During cryopreservation, the temperature of liquid nitrogen remains at....................

D. Write True or False:-
(a) Temperature and pressure of an autoclave is maintained at 121c and 15 lb/in2.
(b) Precursor cells are not derived from stem cells.
(c) A vaccine is a preparation containing a pathogen either in attenuated or inactivated state.
(d) Protein free media contain protein constituent necessary for culture of the cells.
(e) Plasma clot is a natural media.

2. Explain the process of animal tissue culture and the equipment required.

3. Describe the basic cell technique to culture a mammalian cell in vitro. Also illustrate the process of cell separation and maintenance of cell culture.

4. What is meant by cell cloning? Explain types of cloning and micromanipulation.

5. Write an account of stem cell culture and embryo stem cell culture with their application

6. Write an essay on "Programmed cell death"?

7. What do you mean by primary cell culture? state various stages of primary cell culture.

8. Define a balance salt solution and explain the role of carbon dioxide, serum, and supplements in it?

9. Briefly describe:
(a) Measurement of cell death
(b) Genomic library
(c) Cell transformation
(d) Cell syncronization

Plant Biotechnology
1.A. Fill in the blank's
(a) The variations observed during tissue culture of some plants are known as...........
(b) A technique of using small metal particles coated with desired genes in gene transfer is called.........
(c) DNA fingerprinting is based on.........
(d) The techniques used to generate 'golden rice' is known as........
(e) During media preparation the thermolabile chemicals are sterilized by..........
(f) The flaver savr tomato was produced by.........
(g) The beta-gal reporter gene is encoaded by.........
(h) The CaMV 35S promoter.........

B. Define the following ?
(i) Cybrid
(ii)  Aseptic culture
(iii) Bt gene
(iv) Electroporation
(v) DNA vaccine
(vi) Dimethyl sulfoxide
(vii) PR protein
(viii) ACC Synthase
(ix) Glycophosphate
(x) Cellulase
(xi) Interferon
(xii) How does the bt gene kill insects?

2. Describe the various applications of plant tissue culture techniques.

3. What do you mean by somatic hybridization? Discuss the techniques and achievements of somatic hybrids.

4. What are binary vectors? Discuss the mechanism of T-DNA transformation.

5. What do you mean by promoter clearance? Discuss various types of eukaryotic promoters.

6. Discuss briefly all the tools and techniques used in r-DNA technology

7. What do you mean by post-harvest losses? Discuss that biotechnology and genetic engineering can prevent post-harvest losses?

8.  Discuss the principles and methodologies of metabolic engineering.

9. Discuss the advantages and disadvantages of various genetic marker's.

Bio-process Engineering and Technology
1. Answer briefly:-
(a) Shear rate in bioreactors
(b) Cell disruption
(c) Fermented food
(d) Antifoams
(e) Enrichment cultures
(f) Quality control of preserved stock cultures
(g) Air sterilization
(h) Gluconic acid
(i) Fed-batch culture
(j) Single cell protein

2. Write short note on:
(i) Bioreactors types and classifications
(ii) Media sterilization

3. Differentiate between:
(i) Batch and continuous bioreactors
(ii) Sterilization and pasteurlization
(iii) Foam separation and precipitation
(iv) Drying and crystilization

4. Describe various steps in the industrial production of citric acid.

5. Explain the process of whole cell immobilization and mention its industrial applications

6. write short notes on:-
(i) Configuration and application of photobioreactors
(ii) Food packing

7. Give an account of the various steps involved in downstream processing.

8. write a short notes on:-
(i) Canning
(ii) Technology for cheese production

9. Describe in detail the use of microbes in minerals beneficiation

Bio-resources and Environmental Biotechnology


1.I. Fill in the blanks:
a)Name the common toxic biocide…………..
b)Indian Agriculture research Institute is situated at……….
c)The concept of center of origin of crop plants given by………
d)Name a GN food crops………
e)An Indian Patent is valid for a period of……..
f)Any product, by-product or residue that can not be used profitably is called…..
g)The compounds that resist biodegradation and thereby persist in the environment are called….
h)Many xenobiotic compounds are ………in nature.
i)Rotating discs are used in biological treatment of………
j)Euphorbia Lathyris produce hydrocarbons which can be converted in to …….

      II. Write True/ False for the following:-
a)CGIR is a world famous germplasm bank.
b)Breeder’s seeds are certified seeds.
c)Plastics are biodegradable.
d)The solid wastes in water are called sludge.
e)A Patent is a right granted to an inventor by the government.

III. Write short answers:
a)What are Oxidation ponds?
b)Write names of two pesticides.
c)Write full form of CGIAR.
d)What are GM food crops?
e)What do you mean by biosafety measure?

2. What is biodiversity? Write about methods of conservation of biodiversity.

3. Write notes on a) Aerobic filters b) Bioremediation of contaminated soils and waste land.

4. Discuss some environmental problems of global significance along with solutions.

5. Write notes on a) Bioresources b) Sources of water pollution.

6. Give an account of some institution for their contribution in agriculture research and development.

7. What are molecular markers? Give their types and applications in plant breeding.

8. Describe common methods of classical plant breeding.

9. Write notes on a) Seed Biology b) Seed Banks.

Thursday, January 26, 2012

Molecular marker's


MOLECULAR MARKER= BIOSIGNATUR= BIOMARKER= GENETIC MARKER= DETECTOR MARKER

Marker is a piece of DNA molecule associated with a certain traits of an organism. Molecular marker is use to monitor DNA sequence variations among the species, create new favorable traits to related species. Molecular marker consists of molecules which show easily detectable differences among different strains of a species or among different species. Thus “A DNA sequence on the genome which can be located and identified” is known as Molecular Marker.

In recent years the use of molecular marker approaches for plant breeding where use to
1.      Study variety of transgenic crops for gene they contains
2.      Study link to numerous traits of economic importance.

 Molecular markers reveals neutral sites of variation at the DNA sequence levels known as “Neutral Sites”
A map of the location of identifiable landmark on chromosome is known as “Genetic Map”  and a gene or DNA sequence having a known location on a chromosome and associated with a particular trait or gene is known as “Genetic Marker”.

Saturday, January 14, 2012

Blotting










The transfer of nucleic acid fragments or proteins from gel to a suitable solid support is known as “Blotting”
Or Biochemical technique in which macromolecules separated on a gel are transferred to a nylon membrane or sheet of paper, thereby immobilizing them for furthur analysis.



Electro-Blotting: an important method to transfer mainly proteins from polyacrylamide gel on to nitrocellulose or other carier membrane. A technique for the electrophoretic transfer of DNA, RNA or protein to a suitable membrane. The method most commonly used for the electrotransfer of proteins to nitrocellulose is that reported by Towbin et al. (1979). This technique was patented in 1989 by William J. Littlehales under the title "Electroblotting technique for transferring specimens from a polyacrylamide electrophoresis or like gel onto a membrane." Transfer of the proteins can be carried out using several methods such as vacuum, capillary or electric field. Electroblotting is by far the most wide-spread technique which utilizes either vertical buffer tanks or semi-dry blotting

Tank-Blotting: The gel & blotting membrane are clamped fram between filter paper & fiber pads. Typically transfer time is overnight, only little transfer buffer is required & transfer time is very short.


Capillary-Blotting: Traditional method for transfer of nucleic acids
  • Membrane size up to 20 x 25 cm
  • Overnight transfer
  • Easy to use
Method for transfer of nucleic acids (especially RNA) from agarose gels onto a membrane. This procedure usually takes up to 12 hours (over night) to complete.
Easy to set up assembly that works without power or vacuum source.
This method is based on the movement of buffer from a reservoir through the gel and the blotting membrane to a stack of dry blotting paper by capillary force. The molecules are carried to the blotting membrane on which they are adsorbed.

Vaccum-Blotting:

Dot & Slot Blotting:

Some of the commercially available blotting products are:
Fastblot B33/34; B43/44; B64
Tankblot Tankbot Ecomini
Vacuum-Blot
Cappilary Blotting
Dot Blot 96/ System
Hybrit Slot
The convertible.


Blot is made in spot/ stain on membrane.  Blotting technique are of various types depending on the targets in the hybridization experiments for their specific detection. Commonly are:
Dot Blotting [For DNA/RNA]
Northern Blotting [For RNA]
Southern Blotting [For DNA]
Western Blotting [For Protein]


Dot blotting
A modification of Southern & Northern blotting techniques where nucleic acids are directly spotted on to the filters and not subjected to electrophoresis. The hybridization procedure is the same as in original blotting techniques.

Sample DNA’s from several tissue/ individual can be tested in a single test run. Dot blot are useful in detecting the presence of the sequence being transferred in a number of suspected transgenic individuals or in different tissues of a single individuals
Steps involve:
Sample DNA/RNA from different individual/tissues are transferred on to a nitrocellulose filter in form of dots;
Denature the DNA & backed filter at 80ºc to fix the DNA firmly on to the filter, that prevent non-specific binding of the probe to the filter;
The filter is then treated with the appropriate radioactive ssDNA probe under condition favoring hybridization;
Wash repeatedly filter to remove the free probe;
Dots having appropriate DNA/RNA sequence will hybridize with the radioactive probes;
The hybridization probes are detected by autoradiography; this denotes the individual/ tissue in which the DNA/RNA sequence corresponding to the probe is represented.


Northern Blotting
The technique for the specific identification of RNA molecules is known as Northern Blotting. This was developed in 1977 by “Alwine et al” at Stanford University.


RNA is separated by RNA gel electrophoresis subsequent transfer to membrane, hybridization with probe & finally detection through autoradiography. RNA molecules don’t easily bind to Nitrocellulose paper/ Nylon membrane. Blot transfer of RNA molecule is carried out by using a chemically reactive paper prepared by diazotization of aminobezyloxymethyl to create diazotibenzyloxymethyl (DBM) paper. The RNA can be conventionally bound to DBM paper.
Steps involve:
Sample target ;
RNA isolation;
Gel electrophoresis of RNA;
Transfer to membrane;
Probe preparation;
Cross-linking of RNA to membrane;
Pre-hybridization;
Hybridization;
Post-hybridization;
Signal detection.


Advantages:
Widely accepted & well regarded method;
A straight forward method;
Often used as a confirmation or check;
Versatile protocol as it can allow the usage of many types of probes, including radiolabeled, non-radiolabeled; invitro-transcribed RNA & even oligo-nucleotides such as Primers.


Disadvantages:
Often radioactivity is used, which prevents ease of performing its use and disposal;
The whole process takes long time;
If RNA sample are even slightly degraded by RNases, the quality of the data and quantization of expression is quite negatively affected.


Application:
Detection of mRNA transcripts size.
Study RNA degradation
Study RNA splicing, can detect alternatively spliced transcripts;
Study internal ribosomal entry site;
Often used to confirm & check transgenic.


Southern Blotting
The technique for DNA separation according size by gel electrophoresis is known as “Southern Blotting” where gel denatures in to single stranded molecules by treatment with alkali, neutralization & transferred to a hybridization method by using a high salt concentration buffer. DNA is then irreversibly bound to the membrane either by heat treatment/ UV cross linking. Thus ssDNA targeted molecules are available on the filter for hybridization with a labeled ssDNA probe. 


This technique, devised by Ed Southern in 1975, is a commonly used method for the identification of DNA fragments that are complementary to a know DNA sequence. Southern hybridisation, also called Southern blotting, allows a comparison between the genome of a particular organism and that of an available gene or gene fragment (the probe).  It can tell us whether an organism contains a particular gene, and provide information about the organisation and restriction map of that gene.

In Southern blotting, chromosomal DNA is isolated from the organism of interest, and digested to completion with a restriction endonuclease enzyme. The restriction fragments are then subjected to electrophoresis on an agarose gel, which separates the fragments on the basis of size.
DNA fragments in the gel are denatured (i.e. separated into single strands) using an alkaline solution. The next step is to transfer fragments from the gel onto nitrocellulose filter or nylon membrane. This can be performed by electrotransfer (electrophoresing the DNA out of the gel and onto a nitrocellulose filter), but is more typically performed by simple capillary action.

In this system, the denatured gel is placed onto sheet(s) of moist filter paper and immersed in a buffer reservoir. A nitrocellulose membrane is laid over the gel, and a number of dry filter papers are placed on top of the membrane. Bycapillary action, buffer moves up through the gel, drawn by the dry filter paper. It carries the single-stranded DNA with it, and when the DNA reaches the nitrocellulose it binds to it and is immobilised in the same position relative to where it had migrated in the gel.
The DNA is bound irreversibly to the filter/membrane by baking at high temperature (nitrocellulose) or cross-linking through exposure to UV light (nylon).


Southern blotting has became a routine technique in the analysis of gene organization, the identification & cloning of specific sequences.
Steps involve:
DNA digestion & electrophoresis;
DNA blotting
Hybridization
Post-hybridization
Autoradiography
Advantages
An invaluable method in gene analysis
For confirmation of DNA cloning results;
Forensic applied to detect minute quantification for DNA.


Western Blotting
The technique identifies proteins involving transfer of electrophorosed protein bands from polyacrylamide gel to nylon or nitrocellulose membrane, detected by specific protein-ligand interaction [i.e. Antibody/ Lectin] 





Western blot identify the location of a specific protein after it has been separated by SDS-PAGE. First a primary antibody that recognizes one epitop binds to the protein of interest. The location of the primary antibody is visualized by adding a secondary antibody conjugated to a detection system.


 Steps involve:
Ø  Separation of protein by Gel-electrophoresis
Ø  Transfer protein to Nitrocellulase membrane so, the protein retain the same pattern of separation they had on the get
Ø  Incubate the blot with a generic protein to binds to any remaining sticky place on the nitrocellulose;
Ø  Photograph, the location of the antibody, colorless substrate attached enzyme converted to colorless product.

In this technique it doesn’t matter whether the protein has been synthesized invivo or invitro, This technique explains the content of protein accumulated in cells and rate of synthesis explained by Radioimmune precipitation assay there is establishment of negative charge on side gel while positive on nitrocellulose membrane. The protein binds better to nitrocellulose membrane at low pH and no air bubbles between nitrocellulose membrane and gel. The primary antibody specific to protein form antibody-protein complex with protein of interest. The protein gel stained with KCl so, form precipitation with an SDS-PAGE gel.

The primary antibody obtained from rabbit antisera dilution with non-fat dry instant milk. The secondary antibody is goat anti-rabbit, antibody against 1st antibody. The reaction usually run out in about 1 hours.  

Friday, January 13, 2012

Sequence Analysis

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns


When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be aligned. Many bioinformatics tasks depend upon successful alignments. Alignments are conventionally shown as a traces.


In a symbolic sequence each base or residue monomer in each sequence is represented by a letter. The convention is to print the single-letter codes for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule). This is based on the assumption that the combined monomers evenly spaced along the single dimension of the molecule's primary structure. From now on we will refer to an alignment of two protein sequences.


Every element in a trace is either a match or a gap. Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace: a match. When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its "absence" is labelled by a dash in the derived sequence. When a residue appears to have been inserted to produce a longer sequence a dash appears opposite in the unaugmented sequence. Since these dashes represent "gaps" in one or other sequence, the action of inserting such spacers is known as gapping.


A deletion in one sequence is symmetric with an insertion in the other. When one sequence is gapped relative to another a deletion in sequence a can be seen as an insertion in sequence b. Indeed, the two types of mutation are referred to together as indels. If we imagine that at some point one of the sequences was identical to its primitive homologue, then a trace can represent the three ways divergence could occur (at that point).


Biological interpretation of an alignment

A trace can represent a substitution:
 AKVAIL
 AKIAIL
A trace can represent a deletion:
 VCGMD
 VCG-D
A trace can represent a insertion:
 GS-K
 GSGK

For obvious reasons we do not represent a silent mutation.

Traces may represent recent genetic changes which obscure older changes. Here we have only represented point mutations for simplicity. Actual mutations often insert or delete several residues


Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem, including slow but formally optimizing methods like dynamic programming, and efficient, but not as thorough heuristic algorithms or probabilistic methods designed for large-scale database search

Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as FASTA format and GenBank format and the output is not easily editable. Several conversion programs are available READSEQ or EMBOSS having a graphical interfaces or command line interfaces, while several programming packages like BioPerl, BioRuby provide functions to do this.

Global and local alignments

Illustration of global and local alignments demonstrating the 'gappy' quality of global alignments that can occur if sequences are insufficiently similar Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot end in gaps.) A general global alignment technique is the Needleman-Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith-Waterman algorithm is a general local alignment method also based on dynamic programming. With sufficiently similar sequences, there is no difference between local and global alignments.


Hybrid methods, known as semiglobal or "glocal" methods, attempt to find the best possible alignment that includes the start and end of one or the other sequence. This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.

Pairwise alignment

Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or global alignments of two query sequences. Pair wise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high homology to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods; however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pair wise methods have difficulty with highly repetitive sequences of low information content - especially where the number of repetitions differ in the two sequences to be aligned. One way of quantifying the utility of a given pairwise alignment is the 'maximum unique match', or the longest subsequence that occurs in both query sequence. Longer MUM sequences typically reflect closer relatedness.


Dot-matrix methods

The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and simple, though time-consuming to analyze on a large scale. It is very easy to visually identify certain sequence features—such as insertions, deletions, repeats, or inverted repeats—from a dot-matrix plot. To construct a dot-matrix plot, the two sequences are written along the top row and leftmost column of a two-dimensional matrix and a dot is placed at any point where the characters in the appropriate columns match—this is a typical recurrence plot. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's main diagonal.
Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect can occur when a protein consists of multiple similar structural domains.


Dynamic programming

The technique of dynamic programming can be applied to produce global alignments via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman algorithm. In typical usage, protein alignments use a substitution matrix to assign scores to amino-acid matches or mismatches, and a gap penalty for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore base stacking effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.) A common extension to standard linear gap costs, is the usage of two different gap penalties for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. Thus, the number of gaps in an alignment is usually reduced and residues and gaps are kept together, which typically makes more biological sense. The Gotoh algorithm implements affine gap costs by using three matrices.
Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account frameshift mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Although the method is very slow, its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The BLAST and EMBOSS suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from both commercial sources, such as FrameSearch, distributed as part of the Accelrys GCG package, and Open Source software such as Genewise.
The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of or extremely long sequences.


Word methods
Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family. Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length k, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as EMBL FASTA and NCBI BLAST.


Multiple sequence alignment

Two approaches to multiple sequence alignment (MSA) include progressive and iterative MSAs. As the names imply, progressive MSA starts with one sequence and progressively aligns the others, while iterative MSA realigns the sequences during multiple iterations of the process.

Progressive

Steps:
  1. Start with the most similar sequence.
  2. Align the new sequence to each of the previous sequences.
  3. Create a distance matrix/function for each sequence pair.
  4. Create a phylogeneticguide tree” from the matrices, placing the sequences at the terminal nodes.
  5. Use the guide tree to determine the next sequence to be added to the alignment.
  6. Preserve gaps.
  7. Go back to step 1.
Progressive MSA is one of the fastest approaches, considerably faster than the adaptation of pair-wise alignments to multiple sequences, which can become a very slow process for more than a few sequences.
One major disadvantage, however, is the reliance on a good alignment of the first two sequences. Errors there can propagate throughout the rest of the MSA. An alternative approach is iterative MSA (see below).


Iterative

For iterative MSA, the MSA is re-iterated, starting with the pair-wise re-alignment of sequences within subgroups, and then the re-alignment of the subgroups. The choice of subgroups can be made via sequence relations on the guide tree, random selection, and so on.

At heart, iterative MSA is an optimization method and may use machine learning approaches such as genetic algorithms and Hidden Markov Models. The disadvantages of iterative MSA are inherited from optimization methods: the process can get trapped in local minima and can be much slower.

Software

  • Chimera - excellent molecular graphics package with support for a wide range of operations
  • Clustal-W - the famous Clustal-W multiple alignment program
  • Clustal-X - provides a window-based user interface to the Clustal-W multiple alignment program
  • DCSE - a multiple alignment editor
  • Friend - an Integrated Front-end Application for Bioinformatics
  • Jalview - a Java multiple alignment editor
  • Mauve - a multiple genome alignment and visualization package that considers large-scale rearrangements in addition to nucleotide substitution and indels
  • ModView - a program to visualize and analyze multiple biomolecule structures and/or sequence alignments.
  • Musca - multiple sequence alignment of amino acid or nucleotide sequences; uses pattern discovery
  • MUSCLE - more accurate than T-Coffee, faster than Clustal-W.
  • SeaView - a graphical multiple sequence alignment editor
  • ShadyBox - the first GUI based WYSIWYG multiple sequence alignment drawing program for major Unix platforms
  • UGENE - contains multiple alignment editor with MUSCLE alignment algorithm integrated.

 

Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and mechanistic information to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problems. Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.

Dynamic programming

The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and memory, it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the n-dimensional equivalent of the sequence matrix formed from two sequences, where n is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" objective function, has been implemented in the MSA software package.

Progressive methods

Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to FASTA. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.
Many variations of the Clustal progressive implementation are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee; implementations can be found at ClustalW and T-Coffee.

Iterative methods

Iterative methods attempt to improve on the weak point of the progressive methods, the heavy dependence on the accuracy of the initial pairwise alignments. Iterative methods optimize an objective function based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's multiple sequence alignment. Various ways of selecting the sequence subgroups and objective function are reviewed in.

Motif finding

Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved sequence motifs among the sequences in the query set. This is usually done by first constructing a general global multiple sequence alignment, after which the highly conserved regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original data set contained a small number of sequences, or only highly related sequences, pseudocounts are added to normalize the character distributions represented in the motif.

Techniques inspired by computer science

A variety of general optimization algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. Hidden Markov models have been used to produce probability scores for a family of possible multiple sequence alignments for a given query set. They are especially effective in detecting remotely related sequences because they are less susceptible to noise created by conservative or semiconservative substitutions. Genetic algorithms and simulated annealing have also been used in optimizing multiple sequence alignment scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article multiple sequence alignment.


Structural alignment


Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the secondary and tertiary structure of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more evolutionarily conserved than sequence, structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure predictionbecause they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.


DALI
The DALI method, or distance matrix alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences. It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the Protein Data Bank (PDB). It has been used to construct the FSSP structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at EBI DALI and the FSSP is located at The Dali Database.

SSAP

SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments, and has been used in the construction of the CATH (Class, Architecture, Topology, Homology) hierarchical database classification of protein folds. The CATH database can be accessed at CATH Protein Structure Classification.

Combinatorial extension

The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment. Based on measures such as rigid-body root mean square distance, residue distances, local secondary structure, and surrounding environmental features such as residue neighbor hydrophobicity, local alignments called "aligned fragment pairs" are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at the Combinatorial Extension website.


Phylogenetic analysis

Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness. The field of phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young most recent common ancestor, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence time), assumes that the effects of mutation and selection are constant across sequence lineages. Therefore it does not account for possible difference among organisms or species in the rates of DNA repair or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid being incorporated into the protein.) More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.
Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly heuristic because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is NP-hard.

 

Assessment of significance

Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.
In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.
Methods of statistical significance estimation for gapped sequence alignments are available in the literature.


Scoring functions

The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by Margaret Dayhoff and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penalties account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.
It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.


Other biological uses

Sequenced RNA, such as expressed sequence tags and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about alternative splicing and RNA editing. Sequence alignment is also a part of genome assembly, where sequences are aligned to find overlap so that contigs (long stretches of sequence) can be formed. Another use is SNP analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.