Remove false positive linkages derived from experimental noise and drift in otherwise unchanging baseline expression

This bias leads to biased learning unless appropriately taken into account, as the effect of reference linkages from the dominant GO term ”protein biosynthesis”is quite strong. Second, physical protein interaction and genetic interaction data can be assigned scores that allow, on a per interaction basis, for fine-grained, continuous valued confidence measures. The score that we employ, based on the hypergeometric probability, is simple and robust, and works across a variety of different experimental techniques, and would therefore even be appropriate as a final confidence score directly out of large-scale experimental assays. Introduction of this score significantly improves the performance of these data in deriving the probabilistic gene network. Third, introducing two additional parameters into the analysis of mRNA co-expression linkages significantly decreases the number of false positive linkages while simultaneously decreasing the variance in the quality of the derived linkages. Incorporation of each of these optimizations into YeastNet v. 2 significantly improves the quality of the model, improving precision and recall on independent test sets and increasing generality of the model for more diverse cellular systems. We expect that the protocol we present for calculating the network is general and could be applied to other organisms essentially directly as described. We describe applications of the gene network for Echinatin functional prediction and prediction of essential genes. In order to perform similar analyses of YeastNet v. 2, we have established a web site where the network can be downloaded in full. We anticipate posting future updates of the network to this site as new data sets become available. In order to benchmark the assigned functional linkages in this study, three different reference sets were used. As a major reference set for benchmarking, we used the Gene Ontology annotation, downloaded from the Saccharomyces cerevisiae Genome Database on March 2005. The GO schema lists three hierarchies of function describing ”biological process”, ”molecular function”, and ”cellular component”. For training the network, we used the Saccharomyces cerevisiae GO ”biological process”annotation, which contains up to 14 different levels of information under the term ”biological process”within the hierarchy. We used terms belonging to levels 2 through 10. We also excluded the term ”protein biosynthesis”because it annotates so many genes as to significantly bias the benchmarking. To construct the reference set of linkages, we considered all gene pairs as Butenafine hydrochloride functionally linked if they shared annotation from this set of GO terms. These pairs comprised our positive reference set for training network models. Negative examples were constructed as pairs of annotated genes not sharing any annotation terms, i.e., all other links among this annotated set of genes. We introduced two additional parameters to improve coexpression inferences: a threshold for the minimum observed change in mRNA levels across the set of array experiments, and a threshold for the minimum number of microarray experiments with expression values greater than R. Thus, only genes that are differentially expressed by at least R-fold on at least M microarrays in the given data set will be considered for co-expression linkages. These parameters considerably reduce the linkage false positive rate by removing genes that do not vary across the set of arrays being analyzed, under the premise that genes that are expressed at a constant level across the tested conditions are not likely to be relevant to the conditions of the experiments or to participate in strong coexpression relationships.