The applicability domain of these models covers a large part of properties derived from their primary sequences

The powerful ensemble-based method, i.e., Random Forest, was adopted to Folinic acid calcium salt pentahydrate construct the models, which is more robust against the overfitting problem and performs more efficiently for large-scale data sets when compared with some traditional statistical methods such as Linear Discriminant Analysis, the Partial Least Square and Aritificial Neural Network. The performance of the RF algorithm was compared with that of the SVM method to validate the reliability of the obtained models. The validated models were further employed to systematically predict the known/unknown drugs or targets involving the enzymes, ion channels, GPCRs and nuclear receptors, etc. Particularly, we successfully identify unrelated target proteins of chemical compounds using RF method, and meanwhile, effectively distinguish the novel scaffold hopping ligands of the receptors, which will significantly facilitate the drug-target discovery. For the purpose of broadening the scope of application of these predictors, we developed a set of in silico models based on the largescale heterogeneous biological data. Our models concatenate the chemical structural and physicochemical properties with the protein structural and physicochemical properties to discriminate the binding patterns from the non-binding patterns. Generally, it is difficult to assess the performance of a chemical and protein feature encoding method in a direct manner. However, if the encoding are biologically meaningful and enable to capture relevant information with respect to receptor-ligand recognition, one would expect that they present good generalization properties. This can be evaluated by using the internal five-fold crossvalidation and external independent validation scheme as described in the Materials and Methods section. In the following section, we firstly assess the performance of these obtained models based on these two methods, and then carry out the systematical drug-target interaction predictions to further verify the usefulness of the models in comprehensive prediction. The consistency in model performance of the two methods further indicates that these models are robust and reliable for predicting the multiple drugtarget interactions. Based on this, we conclude that the conserved binding patterns that are common to the protein families such as GPCRs, nuclear receptors, ion channels and enzymes, can be effectively detected by our proposed approach. It is worth noting that all these models can definitely identify the negative samples with a quite high specificity from 83.22% to 93.62% for all datasets although the negative samples are initially randomly produced. This from a statistical point of view demonstrates that the drug-receptor recognition is quite specific, thus to find a new drug by chance should be extremely difficult. In principle, the applicability domain of a classification model is calculated on the basis of the range of individual samples in the training set that the minimum and maximum values of each feature were obtained by considering all the samples of the set. In this work, in order to reduce the dimensionality of the descriptor pool, to eliminate the correlations among variables as well as to retain the information restored in the dataset as much as possible, the principle component analysis is Orbifloxacin applied to the current datasets for analyzing the applicability domain of the obtained models. The distribution of all samples of Model I using the first three PCs is shown in Figure 4. It can be seen that training and test sets were well distributed in “chemical-biological” space.