About PFP Automated Protein Function Prediction Server
Thank you for using
PFP
The results are listed as the top ten most probable Gene Ontology
annotations in the Biological Process, Molecular Function, and Cellular
Component categories. Please note that these automated function
predictions are not intended to be perfectly accurate, but only to
represent the statistical probability that your sequence matches the
listed function annotations according to known associations extracted
from a variety of publicly available functional databases.
The sequence you submitted was queried with an iterative PSI-BLAST (3
iterations) against UniProt. These results were cross-referenced to the
GO database version GO_200702 (available from
www.geneontology.org) to create a list of primary annotations for your
sequence. Database statistics are available below , total count of available annotations is 11,898,312. Secondary annotations were identified and scored from known
associations to add depth not available from direct sequence comparison.
PFP is hosted by the Kihara Lab at Purdue University, West Lafayette, IN
USA. If you have any questions, concerns, comments or feedback, please
email Troy Hawkins.
Thank you,
Troy Hawkins.
Meghana Chitale.
Daisuke Kihara.
Department of Biological Sciences, Department of Computer Science,
Purdue University,
West Lafayette, IN 47907
High-throughput techniques for experimental genomics and proteomics are driving rapid development of interpretative algorithms in bioinformatics, in particular, methods that can predict protein function using sequence, structure, gene expression and protein-protein interaction data. Of these data, protein sequences are the most plentiful, reliable and readily available. The incredible volume of experimentally characterized sequences and sequence motifs has allowed creation of reliable models for function annotation with far greater coverage of protein functional space than can be generated from any other single source of biological data. And although there are several existing tools which exploit the information they hold, sequences still remains a rich source of new information in bioinformatics.
PFP is a publicly available server for automated function prediction for a query sequence with Gene Ontology (GO) biological process, molecular function, and cellular component terms. The PFP algorithm has been shown to increase coverage of sequence-based function annotation more than fivefold by extending a PSI-BLAST search to extract and score GO terms individually and include information from distantly related sequences, and by applying a novel data mining tool, the Function Association Matrix (FAM), to score significantly associating pairs of annotations.
The current version of our prediction method uses a sequence-based method. The aim is to utilize information from relatively weak hits in PSI-BLAST, which are not conventionally used. Typically, weak hits in PSI-BLAST are not perfect orthologs to the query sequence, but rather share a common functional domain. In addition to simply transferring the function of the common domain to the query sequence, our idea is to also consider those functions which are frequently associated with the annotated functions of the domain. To this end, we have built Function Association Matrices (FAMs) that quantify the co-occurrence of Gene Ontology (GO) annotations in sequences of the UniProt database (Figure 1). The GO is a controlled hierarchical vocabulary describing the function of genes in three categories: function, process, and component. Approximately two thirds of associated function pairs mined from UniProt bridge functions of different categories. Thus, we can assign function using FAMs that cannot be retrieved directly from highly similar sequences or structures. Taking advantage of the hierarchical nature of the GO vocabulary, we have developed a series of FAMs in varying “resolution”, i.e. depth of the functional association in the GO hierarchy. The structure of GO also allows us to define functional proximity as the coordinate distance of the annotation sets of two proteins on the GO tree.

FIGURE 1. A visual representation of the UniProt FAM in four resolutions. From left to right, matrices include associations between direct annotations (+0), direct annotations and one parent generation (+1), direct annotations and two parent generations (+2), and direct annotations and three parent generations (+3). The color scale plots the log of the co-occurrence of 475 GO annotations extracted from a random set of sequences in the UniProt database, including 64 cellular component annotations, 194 molecular function annotations and 217 biological process annotations in that order [x- and y-axes are equal].
Following table lists the total number of annotations in each evidenc code category
| Evidence Code | Annotation count |
| IEA | 11270149 |
| ISS | 201276 |
| ND | 129765 |
| RCA | 79040 |
| IDA | 63208 |
| IMP | 54322 |
| TAS | 45068 |
| NAS | 25466 |
| IPI | 13436 |
| IGI | 7100 |
| IC | 4328 |
| IEP | 3866 |
| NR | 1288 |
|
| Organism | Annotation count |
| Mus musculus [House mouse] | 225901 |
| Rattus norvegicus [Brown Rat] | 72964 |
| Arabidopsis thaliana [Thale cress] | 105575 |
| Escherichia coli | 69687 |
| Caenorhabditis elegans | 60749 |
| Drosophila melanogaster [Fruit fly] | 69707 |
| Danio rerio [zebra fish] | 63912 |
| Homo sapiens [human] | 163437 |
| Bos taurus [Cow] | 91820 |
| Cannabis sativa | 95574 |
| Hepatitis C virus | 80223 |
| Tetraodon nigroviridis | 62712 |
| Gallus gallus [chicken] | 56016 |
| Xenopus laevis [African clawed frog] | 50653 |
| Paramecium tetraurelia | 41977 |
| Saccharomyces cerevisiae [baker's yeast] | 36092 |
| Schizosaccharomyces pombe [fission yeast] | 33002 |
| Aedes aegypti [yellow fever mosquito] | 32706 |
|
PFP is operated and maintained by the
Kihara Lab at
Purdue University, West Lafayette, IN, USA.
last updated December 21, 2007