It is also advantageous to employ an ensemble approach for prediction are likely to share similar biological functions

Given a phonotype phi, we can infer its potential disease genes from those disease genes associated with phenotypes phj. A number of methods above have thus been proposed to prioritize candidate genes based on different kinds of biological data, such as gene sequence data, gene expression profile, evolutionary features, functional annotation data and PPI dataset. Adie et al. employed a decision tree algorithm based on a variety of genomic sequence and evolutionary features, such as coding sequence length and evolutionary conservation, presence, and closeness of paralogs in the human genome. Topological information on PPI network has also been demonstrated to be useful for disease gene prediction. Smalter et al. applied support vector machines classifier using PPI topological features in addition to sequence derived and evolutionary features, while Radivojac et al. built three individual SVM classifiers using three types of features2PPI network, protein sequence and protein functional information2and then built a final classifier to combine the predictions from three individual classifiers for candidate gene prediction. The research work mentioned above employed classical machine learning methods to build a binary classifier where the confirmed disease genes are used as positive training set P and unknown genes as negative training set N. However, these machine learning techniques hardly perform as well as they could because the negative set N that they used contained unconfirmed disease genes. In light of aforementioned limitation, recently positive unlabeled learning methods have been proposed for the task by building a classification model in which unknown genes are appropriately treated as an unlabeled set U. For example, Mordelet et al. proposed a bagging method ProDiGe for disease gene prediction. It iteratively choosed random subsets from U and then trained multiple classifiers using bias SVM to discriminate P from each subset RS. The multiple classifiers were subsequently aggregated to generate the final classifier. Given that the RS’s were likely to contain less noise than the original set U, it was able to perform better than classical binary classification models that inappropriately used U as negative training data. More recently, Yang et al. designed a novel multi-level PU learning algorithm PUDI to build a classifier with better performance for disease gene identification where the unlabeled set U was partitioned into multiple positive and negative sets with confidence scores for building the classifier. The prior works have clearly shown that integration of various biological data sources is not only desirable but also essential for robust disease gene prediction, since using only a single source of data for prediction is susceptible to incompleteness and noise in the genomic data.