A Novel Bayesian Network Model for the Study of Genetic Regulatory Networks
Y. Zeng1, J. Garcia-Frias2
1zeng@eecis.udel.edu, University of Delaware; 2jgarcia@eecis.udel.edu, University of Delaware
The regulation mechanism in protein synthesis is crucial for the cell’s lives, and decides the expression levels of different genes at different stages. With the emergence of microarray techniques, it is practical to measure the expression level for thousands of genes simultaneously. A major challenge in computational biology is to uncover, from such measurements, gene/protein interactions and key biological features of cellular systems.
Bayesian networks (BNs) provide an effective method to model phenomena through the joint distribution of a large number of random variables. The robustness and efficiency of this approach suggest its application in constructing genetic networks from microarray gene expression data. Our previous work on constructing BN models with discretized expression values was reported in ISMB 2002. This poster presents an extension of that work to the case of continuous expression data.
Although the discrete BN model is often used because of its simplicity, its weakness is evident. Since the microarray data is continuous valued, it requires quantization before model construction. Friedman et al. pointed out that the discretization probably loses information contained in the data. Moreover, the resulting network strongly depends on the selection of the quantization scales. To solve this problem, BN models with continuous valued variables have received more and more attention.
Most of the research on continuous BNs has focused on defining the relationships among variables with regression models. A commonly used criterion in these regression models is the minimization of the square error (MSE), which implies a Gaussian assumption for the distribution of the variables (G-model). Although the Gaussian assumption can be derived based on a maximum entropy approach, the main justification for its use is practicability. In fact, in spite of its common use, it is far from clear that the Gaussian assumption is a good choice for practical problems, especially when dealing with outliers. Outliers are errors which occur with low probability and which are not produced by the regulatory process that we try to uncover. The general problem is that a few (maybe even one) outliers of high deviation are sufficient to throw the standard Gaussian error estimators completely off-track. This bias degrades the inference power of the BN model.
In this poster, we proposed the use of a Student distribution (T-model) to characterize the variables in the BN model. The Student distribution contains two free parameters – the degree of freedom n and a width parameter s2. An attractive feature of this distribution is that if the degree of freedom n approaches infinity, we recover the Gaussian model; if n<¥ we obtain distributions with heavier tails than the Gaussian distribution. When n=1, we have the Cauchy model, which is commonly used for robust regression. Therefore, the proposed model has more flexibility to characterize the interactions among variables. By introducing an additional parameter, our network gains in robustness and accuracy when learning from the noisy microarray data. The training for the extra parameter is achieved by the EM algorithm, which requires a polynomial computational cost.
We implemented the BNs with continuous variables in Matlab, and tried both the G-model and the T-model in our simulation. Experimental results show that the performance of both continuous models is better than that of the discrete model. Moreover, the T-model significantly outperforms the G-model in all tests, showing its robustness in modeling genetic regulatory networks.