NetOGlyc 3.0: Prediction of mucin type O-glycosylation sites from sequence and sequence-derived features.

Karin Julenius1, Ramneek Gupta, Kristoffer Rapacki, Lars Juhl Jensen, Søren Brunak
1kj@cbs.dtu.dk, Center for Biological Sequence Analysis, BioCentrum-DTU

NetOGlyc 2.0 is a predictor of mucin type O-glycosylation sites (Hansen et al, Glycoconj. J., 1998, 15.115-130). This predictor is widely used, but experimental data having appeared since 1998 has shown that it does not perform with the expected performance on these proteins, i e examples that was not at all included in the training of the predictor. Therefore, we are in the process of training a new predictor, using better and more stringent methods to minimize sequence similarities between proteins in train- and test sets. For NetOGlyc 3.0, a neural network with a back-propagation learning algorithm is used - the same as for NetOGlyc 2.0. In addition to the sequence window around the site (used in the earlier approach), a number of features are predicted from the sequence as input information to the network: i) secondary structure ii) surface accessibility iii) psi-blast profile iv) distance constraints matrix and v) amino acid composition. These features, together with the sequence itself, is now tested in different combinations to try and find the optimal network architecture. We are using all experimentally verified in vivo GalNAc glycosylation sites from mammalian proteins presently known to train the publically available predictor, but in parallel we also train on a smaller set (the same as for the earlier version) to be able to test the generalization behaviour (the ability of the network to correctly predict for completely new examples). Matthew correlation coefficient for the network is currently well above 0.6 and the generalization behaviour is comparable.

See www.cbs.dtu.dk/services/NetOGlyc/ where NetOGlyc 3.0 will be available in time for the conference.