e-PROTEIN: A Distributed Pipeline for Structure-based Proteome Annotation using Grid Technology

Keiran Fleming¹, Liam McGuffin², Stefano Street, Andreas Kahari, Tim Massingham, Steven Newhouse, James Cuff, Ewan Birney, Soren Sorenson, Christine Orengo, John Darlington, David Jones, Janet Thornton, Michael Sternberg
¹k.fleming@imperial.ac.uk, Imperial College London; ²l.mcguffin@cs.ucl.ac.uk, University College London

With the recent acceleration in the number of completely sequenced genomes, it is becoming increasingly important to organise and exploit the information they contain. It is therefore essential that we have structure-based annotation of the encoded proteins in terms of their 3-D conformations and, putatively, their functions. It is envisaged that the elucidation of protein structures and functions will eventually aid the identification of new drug targets, may reveal evolutionary relationships, and will increase our understanding of the basic biology of cells. This is a particularly computationally intensive problem, given the extent of the data, and as such requires the sharing of bioinformatics resources at multiple sites.

The e-Protein project aims to provide a structure-based annotation of the proteins in the major genomes by linking resources via Grid technology at 3 sites; Imperial College London, University College London, and the European Bioinformatics Institute (EBI). We will establish local MySQL databases at each site, containing structural and functional annotations that reflect the strengths and interests of those researchers at each institution. The annotations will be generated via the transparent sharing of high performance computing resources between sites using Grid technology (and in our case GLOBUS as the software component of the Grid). A custom built web-based client for use with protein sequences and which uses the DAS (distributed annotation system) annotation procedure, will provide a common front end to the three databases. In use at Ensembl, DAS is already established as a protocol that is easily understood and used by the molecular biology community.

At the end of the first 6 months we have: Developed a prototype “Protein DAS” interface which is able to access annotations at two of the sites, and utilised GLOBUS to perform annotations on shared resources between the different institutions, demonstrating our first steps towards fulfilling the Grid aspects of the project.