About PFP Automated Protein Function Prediction Server

Thank you for using PFP The results are listed as the top ten most probable Gene Ontology annotations in the Biological Process, Molecular Function, and Cellular Component categories. Please note that these automated function predictions are not intended to be perfectly accurate, but only to represent the statistical probability that your sequence matches the listed function annotations according to known associations extracted from a variety of publicly available functional databases.

The sequence you submitted was queried with an iterative PSI-BLAST (3 iterations) against UniProt. These results were cross-referenced to the GO database version GO_200702 (available from www.geneontology.org) to create a list of primary annotations for your sequence. Database statistics are available below , total count of available annotations is 11,898,312. Secondary annotations were identified and scored from known associations to add depth not available from direct sequence comparison. PFP is hosted by the Kihara Lab at Purdue University, West Lafayette, IN USA. If you have any questions, concerns, comments or feedback, please email Troy Hawkins.

Thank you,

Troy Hawkins.
Meghana Chitale.
Daisuke Kihara.

Department of Biological Sciences, Department of Computer Science,
Purdue University,
West Lafayette, IN 47907




High-throughput techniques for experimental genomics and proteomics are driving rapid development of interpretative algorithms in bioinformatics, in particular, methods that can predict protein function using sequence, structure, gene expression and protein-protein interaction data. Of these data, protein sequences are the most plentiful, reliable and readily available. The incredible volume of experimentally characterized sequences and sequence motifs has allowed creation of reliable models for function annotation with far greater coverage of protein functional space than can be generated from any other single source of biological data. And although there are several existing tools which exploit the information they hold, sequences still remains a rich source of new information in bioinformatics.PFP is a publicly available server for automated function prediction for a query sequence with Gene Ontology (GO) biological process, molecular function, and cellular component terms. The PFP algorithm has been shown to increase coverage of sequence-based function annotation more than fivefold by extending a PSI-BLAST search to extract and score GO terms individually and include information from distantly related sequences, and by applying a novel data mining tool, the Function Association Matrix (FAM), to score significantly associating pairs of annotations.

The current version of our prediction method uses a sequence-based method. The aim is to utilize information from relatively weak hits in PSI-BLAST, which are not conventionally used. Typically, weak hits in PSI-BLAST are not perfect orthologs to the query sequence, but rather share a common functional domain. In addition to simply transferring the function of the common domain to the query sequence, our idea is to also consider those functions which are frequently associated with the annotated functions of the domain. To this end, we have built Function Association Matrices (FAMs) that quantify the co-occurrence of Gene Ontology (GO) annotations in sequences of the UniProt database (Figure 1). The GO is a controlled hierarchical vocabulary describing the function of genes in three categories: function, process, and component. Approximately two thirds of associated function pairs mined from UniProt bridge functions of different categories. Thus, we can assign function using FAMs that cannot be retrieved directly from highly similar sequences or structures. Taking advantage of the hierarchical nature of the GO vocabulary, we have developed a series of FAMs in varying “resolution”, i.e. depth of the functional association in the GO hierarchy. The structure of GO also allows us to define functional proximity as the coordinate distance of the annotation sets of two proteins on the GO tree.



FIGURE 1. A visual representation of the UniProt FAM in four resolutions. From left to right, matrices include associations between direct annotations (+0), direct annotations and one parent generation (+1), direct annotations and two parent generations (+2), and direct annotations and three parent generations (+3). The color scale plots the log of the co-occurrence of 475 GO annotations extracted from a random set of sequences in the UniProt database, including 64 cellular component annotations, 194 molecular function annotations and 217 biological process annotations in that order [x- and y-axes are equal].


Following table lists the total number of annotations in each evidenc code category

Evidence CodeAnnotation count
IEA 11270149
ISS 201276
ND 129765
RCA 79040
IDA 63208
IMP 54322
TAS 45068
NAS 25466
IPI 13436
IGI 7100
IC 4328
IEP 3866
NR 1288
OrganismAnnotation count
Mus musculus [House mouse]225901
Rattus norvegicus [Brown Rat]72964
Arabidopsis thaliana [Thale cress]105575
Escherichia coli69687
Caenorhabditis elegans60749
Drosophila melanogaster [Fruit fly]69707
Danio rerio [zebra fish]63912
Homo sapiens [human]163437
Bos taurus [Cow]91820
Cannabis sativa95574
Hepatitis C virus80223
Tetraodon nigroviridis62712
Gallus gallus [chicken]56016
Xenopus laevis [African clawed frog]50653
Paramecium tetraurelia41977
Saccharomyces cerevisiae [baker's yeast]36092
Schizosaccharomyces pombe [fission yeast]33002
Aedes aegypti [yellow fever mosquito]32706



PFP is operated and maintained by the Kihara Lab at Purdue University, West Lafayette, IN, USA.

last updated December 21, 2007