CleanBank is a database that documents suspected artifacts found in sequences (e.g. vector
contamination) and/or their annotation (e.g. erroneous species assignment) in
the international sequence databases (INSD).
INSD has an obligation to ensure completeness of the sequence record, and
for crediting all the original authors of a sequence (Brunak et. al, Science
2002, 298:1333). However, as a result,
researchers who identify errors in sequences have no way of publishing their
findings in the original database. The
artifacts cause two major problems: Inexperienced
users of bioinformatics often misinterpret the results and experienced users
still find that performing high-throughput research (e.g. EST assembly into
transcripts) requires intensive cleaning of the sequences.
To overcome this problem, and
yet maintain the integrity of the original data, we have established a parallel
database, CleanBank.
In CleanBank, artifacts are either reported by researchers, or identified
by curated algorithms. Current
algorithms detect E. coli contamination (using BLAT), and vector contamination
(using BLAST and a novel method based in restriction site identification).
Confidence levels are assigned to the reliability of the curated method
used and to individual results. Single
entries can be explored, or a cleaned version of the INSD can be produced
according to the confidence level decided by the user.
For a more detailed description of the proposed database, and a preview of the data, see http://bip.weizmann.ac.il/MIW/CleanBank/index.html