UniProt: Universal Protein Databases for Protein Sequences and Function
Allyson Williams1, Maria Jesus Martin2, Claire O'Donovan, Daniel Barrell, Alexander Fedotov, Rolf Apweiler
1allyson@ebi.ac.uk, EMBL - EBI; 2martin@ebi.ac.uk, EMBL - EBI
The UniProt consortium (European Bioinformatics Institute, Swiss Institute of Bioinformatics, and the Protein Information Resource) was created to merge Swiss-Prot, TrEMBL and PIR database activities into a new resource capable of providing a stable, comprehensive view of protein sequences and function. This resource is composed of three layers: the UniProt protein sequence archive, the UniProt protein knowledgebase, and the UniProt non-redundant reference (NREF) databases.
The UniProt protein sequence archive provides complete coverage of all protein sequences by importing new and revised sequences from many sources. It forms the base from which the UniProt knowledgebase is drawn, and is used for performing computational analyses and for retrieving historic data via a versioning server. The UniProt protein knowledgebase is a merger of Swiss-Prot, TrEMBL and PIR-PSD databases, and has two parts: Swiss-Prot, containing fully manually annotated records, and TrEMBL, containing records which are mainly computationally analysed. Most redundancy is removed from the knowledgebase automatically, but these automatic procedures are limited so that similar sequences from different proteins remain separate. However, a completely non-redundant view is available with the UniProt NREF databases NREF100, NREF90 and NREF50. NREF100 (100% sequence identity) contains all records from the UniProt knowledgebase, but with identical sequences and sub-fragments from the same source organism presented as a single entry. NREF90 and NREF50 are built from NREF100, and are created by merging all records with a mutual sequence identity of >90% and >50%, respectively.
The main access points to UniProt are through the Consortium members' web sites and through http://www.uniprot.org. Complex as well as simple queries will be supported, including tools for extracting and downloading large datasets. Timely releases will be made available in different formats, including XML.