A structured approach to computational function prediction

T. Hawkins1 and D. Kihara1,2

1 Dept. of Biological Sciences, Purdue University, 2 Dept. of Computer Science, Purdue University, West Lafayette, IN, USA



For function prediction in CASP6, we used a multi-layered, multi-dimensional approach. The process of defining functions for uncharacterized protein targets involved three steps: (1) searching the primary target sequence against functional databases, (2) manually building and refining data from primary searches, and (3) assigning GO numbered definitions to predicted functions. This method was used to gather predictions for the GO Molecular Function, GO Biological Process, and GO Cellular Component categories. BLAST and PSI-BLAST1 were used for sequence similarity; PROSITE2, PRINTS3 and Blocks4 were used for functional motif searching; Pfam and Pfam-FS5 were used to for family alignments; PSORT6 was used for subcellular localization; and STRING7 was used for additional functional associations in primary searches. Information in the KEGG Pathway database8 and thorough literature searches were used refine and build on the data gathered from primary searches in the cases where that data was not sufficient to make a reasonable prediction of GO categories. GoFigure9 and AmiGO10 were used to find GO definitions for predicted functions.


To predict binding sites, multiple sequence alignments were made using ClustalW of BLAST and PSI-BLAST hits below an e-value of 0.01 (limited to 20). Conserved regions were determined manually and localized on predicted structures; regions containing clusters of conserved residues were predicted to be binding sites. If the predicted function of the protein indicated binding of a specific partner, that molecule/macromolecule was predicted to interact with the predicted binding region. If a conserved region consisted of 5 or more consecutive residues, we considered it to be a functional motif. All of these motifs for a single target sequence were searched individually against the NR protein database in the cases where other data was not sufficient to make a reasonable prediction.


Using this method, reasonable predictions were made for each of the 76 valid protein targets in CASP6. Automation of this method, including substitution of rule-based algorithms for manual interpretation steps, is underway in preparation for function prediction in CASP7.


1.        Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. & Lipman,D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402. [http://www.ncbi.nih.gov/BLAST/].

2.        Sigrist,C.J.A., Cerutti,L., Hulo,N., Gattiker,A., Falquet,L., Pagni,M., Bairoch,A. & Bucher,P. (2002). PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 3, 265-274. [http://www.expasy.org/prosite/].

3.        Attwood,T.K., Bradley,P., Flower,D.R., Gaulton,A., Maudling,N., Mitchell,A.L., Moulton,G., Nordle,A., Paine,K., Taylor,P., Uddin,A. & Zygouri,C. (2003). PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 31, 400-402. [http://bioinf.man.ac.uk/dbbrowser/PRINTS/].

4.        Henikoff,S., Henikoff,J.G. & Pietrokovski,S. (1999). Blocks+: A non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics. 15, 471-479. [http://blocks.fhcrc.org/].

5.        Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., Sonnhammer,E.L.L., Studholme,D.J., Yeats,C. & Eddy,S.R. (2004). The Pfam Protein Families Database. Nucleic Acids Res. 32, D138-D141. [http://www.sanger.ac.uk/Software/Pfam/].

6.        Nakai,K. & Kanehisa,M. (1991). Expert system for predicting protein localization sites in Gram-negative bacteria. PROTEINS: Structure, Function, and Genetics. 11, 95-110. [http://psort.nibb.ac.jp/].

7.        von Mering,C., Huynen,M., Jaeggi,D., Schmidt,S., Bork,P. & Snel,B. (2003). STRING: a database of predicted functional associations between proteins. Nucleic Acids Res. 31, 258-261. [http://string.embl.de/].

8.        Kanehisa,M. & Goto,S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30. [http://www.genome.jp/kegg/pathway.html].

9.        Khan,S., Situ,G., Decker,K. & Schmidt,C.J. (2003). GoFigure: automated Gene Ontology annotation. Bioinformatics. 19, 2484-2485. [http://udgenome.ags.udel.edu/gofigure/].

10.     AmiGO. [http://www.godatabase.org/].