1212-3434-5656

B213-207Dom

Domain boundaries prediction using a multi-layered neural network

E. Tapia¹, Y.H. Tan¹ and D. Kihara²^{, 1}

¹ – Dept. of Computer Science, Purdue University, ²– Dept. of Biological Sciences, Purdue University, West Lafayette, IN, USA

dkihara@purdue.edu

Understanding the domain organization of a protein is crucial for the structural determination of large proteins using techniques with an inherent size limitation. To predict domain boundaries in CASP6 targets, we implemented a multi-layered artificial neural network, which essentially uses only the sequence information of the target. We also compared the results with outputs from different prediction servers^6,
7. The architecture of the neural network approach used can be divided in the following levels: (1) a multilayered neural network^1-3 that assigns boundaries prediction using sequence information, (2) a second neural network that refines the output obtained from the first one, and (3) a statistical approach to combine the output from different networks and different databases.

A fully connected neural network with 11 input groups (optimal window size 11) was designed for the first level in our architecture. Each input group consists of 24 units for each residue in the window and two extra units to store values relative to the whole window. This neural network was trained using several types of information obtained from the sequence: (1) a multiple sequence alignment obtained using psi-blast⁹ on the target sequence; (2) Secondary structure prediction obtained using the Psipred⁸ prediction server; (3) average Kyle-Doolittle hydrophobicity index of a window and (4) domain delineation index⁴, which distinguishes regions with high concentration of N- and C-termini of aligned homologous sequences in the multiple sequence alignment. For training and testing purposes we extracted 600 proteins from the SCOP database⁵ with a uniform distribution among families and subfamilies. We distributed randomly these sequences into 10 different databases and we trained several networks for each database. The final level in our architecture merges all the information from the different networks and databases to obtain an optimal domain boundary prediction.

The results of our work show that using multilayer network improves the performance in comparison to a single network. The method is especially effective while using different databases and running several networks for each database and combining the results. Further improvement could be expected by incorporating additional information, such as the average number of domains respect to the sequence length or the average distance of the domain boundaries to the N- and C-termini.

1. Baldi P. & Brunak S. (2001) Bioinformatics: The machine Learning Approach, 2nd edition, MIT Press.

2. Krogh A. & Vedelsby J. (1995) Neural network ensembles cross validation, and active learning. Tesauro, G., Touretzky, D. & Leen, T., (eds.) NIPS 7. The MIT Press, pp. 231–238.

3. Baldi, P., Brunak, S., Chauvin, Y. & Nielsen, H. (1999) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16, 412-424.

4. George,R.A. and Heringa,J. (2002) Protein domain identification and improved sequence similarity searching using PSI-BLAST. Proteins, 48, 672–681

5. Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540.

6. Marsden, R.L., McGuffin, L.J. & Jones, D.T. (2002) Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Science, 11, 2814-2824.

7. Suyama M. & Ohara O., (2003) DomCut: prediction of inter-domain linker regions in amino acid sequences, Bioinformatics 19, 673-674.

8. McGuffin LJ, Bryson K, Jones, D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics. 16, 404-405.

9. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. & Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.