NetOGlyc 3.0: Prediction of mucin type O-glycosylation sites from sequence and sequence-derived features.
Karin Julenius1, Ramneek Gupta, Kristoffer Rapacki, Lars Juhl Jensen, Søren Brunak
1kj@cbs.dtu.dk, Center for Biological Sequence Analysis, BioCentrum-DTU
NetOGlyc 2.0 is a predictor of mucin type O-glycosylation sites (Hansen et al,
Glycoconj. J., 1998, 15.115-130). This predictor is widely used, but
experimental data having appeared since 1998 has shown that it does not
perform with the expected performance on these proteins, i e examples that was
not at all included in the training of the predictor. Therefore, we are in the
process of training a new predictor, using better and more stringent methods to
minimize sequence similarities between proteins in train- and test sets. For
NetOGlyc 3.0, a neural network with a back-propagation learning algorithm is
used - the same as for NetOGlyc 2.0. In addition to the sequence window around
the site (used in the earlier approach), a number of features are predicted
from the sequence as input information to the network: i) secondary structure
ii) surface accessibility iii) psi-blast profile iv) distance constraints
matrix and v) amino acid composition. These features, together with the
sequence itself, is now tested in different combinations to try and find the
optimal network architecture. We are using all experimentally verified in vivo
GalNAc glycosylation sites from mammalian proteins presently known to train the
publically available predictor, but in parallel we also train on a smaller set
(the same as for the earlier version) to be able to test the generalization
behaviour (the ability of the network to correctly predict for completely new
examples). Matthew correlation coefficient for the network is currently well
above 0.6 and the generalization behaviour is comparable.
See www.cbs.dtu.dk/services/NetOGlyc/ where NetOGlyc 3.0 will be available in
time for the conference.