The use of Hidden Markov Models in the Analysis of Biological Sequences by Steve Hardies. Ph.D. The University of Texas Health Science Center at San Antonio Date: 10/15 M Time: 4:00 pm - 5:15 pm SB 3.02.02 In the genomic era, sequences of genes from organisms of all kinds are being acquired much faster than the functions of their encoded proteins can be experimentally characterized. The functional description of genomes is becoming more and more dependent on inferring similar function through the finding that the encoded protein sequence is similar to that of some characterized protein. Detection of sequence similarity has been dominated by the Blast search algorithm and a related profile method called Psi-Blast. Profile methods, by definition, create a survey of which amino acids are observed at each position among an alignment of the more obviously related sequences, and then search for more distant relatives making use of that extra training in what residues should be considered similar at each position. Blast and Psi-Blast are designed for speed, and are served from International Bioinformatics Centers, thus relieving individual investigators from having to invest in computational facilities or know-how. However, it has long been known that for proteins with established 3D structures, there are many examples where two proteins are structurally related, but the presumptive underlying sequence similarity is too weak for Blast or Psi-Blast to confidently detect. The fundamental problem is caused by the observation that proteins not only experience residue substitutions during evolution, but also gain and lose residues. To detect sequence similarity therefore requires alignment of the compared sequences by adding gaps to account for these insertions and deletions. Once the comparison algorithm is given the ability to add gaps to optimize the number of residue matches between two sequences, its flexibility to make unrelated sequences appear similar increases and detection of faint similarities is overpowered by this noise. The Hidden Markov Model (HMM) provides an algorithmic and statistical basis to limit the placement of gaps in a sequence alignment based on where insertions and deletion are observed among the more closely related (and hence more easily aligned) sequences of the set. Hence comparative algorithms based on HMMs experience reduced noise and can detect fainter traces of similarity. The HMM method is inherently Bayesian; meaning that it can accommodate a mixture of information learned from the related sequences already aligned with "prior" information derived from theory about how proteins usually evolve. The prior information can reflect facts learned by structural or functional investigation of one of the family members. Hence the HMM is a very powerful method for incorporating a variety of insights into the algorthmic part of the search. The lecture will focus on the use of HMM methods to address a number of difficult similarity detection problems encountered by Dr. Hardies, and will mainly feature the Sequence Alignment and Modeling system (SAM) obtained from Richard Hughey at UCSC. All about Blast and PsiBlast: • http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs All about SAM • http://www.soe.ucsc.edu/research/compbio/sam.html A paper showing a number of methods of inferring similarity at a distance • http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&TermToSearch=17673272