Our approach is to analyse, on a genome-wide scale the upstream regions of human genes with an emphasis on transcription factor binding sites. Using existing databases such as TRANSFAC to retrieve transcription factor binding site data we will recode each transcription factor binding site with a different number so that each upstream region will be identified by a different subset of numbers. Using these newly recoded vectors, the following analyses will be carried out; multiple alignment, multi-variate analysis, neural-network analysis and expression analysis using microarray data.
These approaches should enable one to identify homologous upstream regions as well as those that are divergent. The biological significance of this work will be to determine if two genes which have similar upstream regions have a similar function and if they will be expressed to the same level and conversely, if two genes have very divergent upstream sequences will their functions and expression levels also be dissimilar? The ultimate goal is to infer expression pattern from sequence.
Future work will include a comparative approach, analysing the mouse genome in a similar manner and comparing transcription factor binding site information with expression profiles of both human and mouse genes.