Featured White Papers
- Aug. 28th: Delivering Online Presentations That Result in Higher Sales (Citrix Online)
- Enterprise PBX buyer's guide (VoIP-News)
- Enterprise PBX comparison guide (VoIP-News)
Hidden Markov models in biological sequence analysis
IBM Journal of Research and Development, May/Jul 2001 by Birney, E
The vast increase of data in biology has meant that many aspects of computational science have been drawn into the field. Two areas of crucial importance are large-scale data management and machine learning. The field between computational science and biology is varyingly described as "computational biology" or "bioinformatics." This paper reviews machine learning techniques based on the use of hidden Markov models (HMMs) for investigating biomolecular sequences. The approach is illustrated with brief descriptions of gene-prediction HMMs and protein family HMMs.
Introduction
There has been a revolution in molecular biology over the last decade due to a simple economic fact: The price of data gathering has fallen drastically. Nowhere is this better illustrated than in large-scale DNA sequencing. At current costs, it is economical to determine the DNA sequence of the entire genome of a species (the genome is all of the DNA sequence passed from one generation to the next), even for species with large genomes, such as humans.
The basic information of interest in bioinformatics pertains to DNA, RNA, and proteins. Molecules of DNA are usually designated by different sequences of the letters A, T, G, and C, representing their four different types of bases. RNA molecules are usually designated by similar sequences, but with the Ts replaced by Us, representing a different type of base. Proteins are represented by 20 letters, corresponding to the 20 amino acids of which they are composed. A one-to-one letter mapping occurs between a DNA molecule and its associated RNA molecule; and a three-to-one letter mapping occurs between the RNA molecule and its associated protein molecule. A protein sequence folds in a defined threedimensional structure, for which, in a small number of cases, the coordinates are known. The defined structure is what actually provides the molecular function of the protein sequence.
The basic paradigm of biology is shown graphically in Figure 1. Depicted in the figure is a region of DNA that produces a single RNA molecule, which subsequently produces a single protein having a well-defined biological function.
Roughly speaking, the time and cost of determining information increases from the top of the diagram to the bottom. Determining DNA and RNA sequences is relatively cheap; determining protein sequences and protein structures is far more expensive; many person-years can be spent trying to elucidate the function of a single protein.
A clear goal for bioinformatics is to provide a way to convert the cheaper information at the top to the more valuable information at the bottom. Two steps have proven to be difficult. For unknown reasons, large organisms deliberately process the RNA sequence that is derived from the DNA sequence by a method known as pre-mRNA splicing. This removes specific pieces of the RNA (called introns) and fuses the remaining pieces (called exons). The exons remain collinear with their original layout in the DNA sequence. The ratio of exon sequence to intron sequence is around 1:50 in human DNA, and the intron sequence appears to be extremely "random" in nature, making effective discrimination difficult. Despite this challenge, bioinformatics has developed a reasonably successful solution using HMMs (see below). The second problem is deducing protein structure from a linear protein sequence. This "folding problem" has resisted concerted attack from researchers over the last twenty years. Although there have been many exciting advances in the area of protein folding, it seems likely that there will not be a solution to this problem in the next five or more years.
Bioinformatics can thankfully sidestep both of these problems by using arguments of evolution. Imagine the proto-rodent that represents the common ancestor between mouse and human. This creature had a region of its DNA sequence which made a protein with a specific function (for example, catalyzing the reduction of ethanol to acetaldehyde). At some point there was a speciation event which led eventually to man and mouse. In the two lineages, the DNA sequences were maintained from generation to generation, sometimes suffering a mutation that changed the DNA sequence. As long as the mutation did not disadvantage the individual, in general preserving the function of the protein, the mutation would be passed on to its descendants. In the extant species of man and mouse, one ends up with two similar but not identical regions of DNA sequence which form two similar proteins with similar structures and functions.
This argument of common ancestry, or homology, is illustrated pictorially in Figure 1 by the horizontal arrows. Arguments of homology are the bedrock of bioinformatics. It is relatively easy to find a clearly homologous DNA sequence presupposed to exist at the first cellular organism and observable in all living organisms-- for example, the DNA sequence which produces the proteins found in the ribosome. This conservation in the face of potentially billions of random mutations in the DNA sequence shows how much selection (i.e., an individual with a deleterious mutation is unlikely to pass on this mutation) occurs in biology.