Identification of Essential Proteins Using A Novel Multi-Objective Optimization Method

Using graph theory to identify essential proteins is a hot topic at present. These methods are called network-based methods. However, the generalization ability of most network-based methods is not satisfactory. Hence, in this paper, we consider the identification of essential proteins as a multi-objective optimization problem and use a novel multi-objective optimization method to solve it. The optimization result is a set of Pareto solutions. Every solution in this set is a vector which has a certain number of essential protein candidates and is considered as an independent predictor or voter. We use a voting strategy to assemble the results of these predictors. To validate our method, we apply it on the protein-protein interactions (PPI) datasets of two species (Yeast and Escherichia coli). The experiment results show that our method outperforms state-of-the-art methods in terms of sensitive, specificity, F-measure, accuracy, and generalization ability.


INTRODUCTION
To living organisms, proteins are indispensable components in cellular life activities, they perform varied functions like catalyzing metabolic, DNA replication reactions, and transporting molecules [1]. Among them, there is a kind of proteins called essential proteins, living organisms would die or be infertile if they lack them [2]. Some essential proteins have been found to be related to human disease genes, hence, the study of the identification of essential proteins is very necessary [3].
With the development of high-throughput technologies, a lot of protein-protein interactions (PPI) have been obtained which makes using computational methods to study the identification of essential proteins possible [4]. In general, protein-protein interactions are constructed to an undirected network which is called protein interaction network (PIN). Network-based methods are successfully used in the identification of essential proteins. According to wether they *Corresponding author, e-mail: chongwu2-c@my.cityu.edu.hk. integrate the biological information or not, they can be divided into two classes: (1) topological characteristics based methods; (2) integration methods.
The topological characteristics based methods use the features of node or edge of network to search the vital nodes, and they are also widely used in the field of complex networks. Degree centrality (DC) is the most well-known and simplest one applied on the identification of essential proteins. Some studies have confirmed that proteins with high degree tend to be essential proteins [5]. Besides DC, other node-aided methods were also applied to the identification of essential proteins, such as eigenvector centrality (EC) [6], betweenness centrality (BC) [7], closeness centrality (CC) [8], etc. Additionally, a few of edge-aided methods [9,10,11] have also been proposed to identify essential proteins from PIN, the typical one is edge clustering coefficient (ECC) [10]. A centrality method which is based on ECC called new centrality method (NC) is proposed to identify essential proteins from PIN [11]. Besides above node-aided and edge-aided methods, some researchers proposed a centrality method which combines the node and the edge characteristics of the network (NEC) [12].
Whereas, the PPI data obtained by high-throughput technologies have high false positives and these topological characteristics based methods are very sensitive to the stability of the PIN which consists of the PPI data. Hence, the performance of these methods is limited. Considering this problem, some researchers try to combine the biological information into the topological characteristics based methods to reduce the effect of high false positives of PPI data and improve the prediction accuracy of essential proteins. Some researchers proposed PeC which combines gene expression data into the NC and achieves a higher prediction accuracy than NC [13].
However, the generalization ability of these methods above is not good. In this paper, to improve the generalization ability, we consider the identification of essential proteins as a multi-objective optimization problem and use the adaptive multi-objective black hole algorithm (AMOBH) [14] to solve it, the new method is called IMAMOBH. After the optimization, we get a Pareto solution set. Each solution in this set will be considered as a predictor or voter. Each predictor will give a list of essential protein candidates. Then we use a voting mechanism to assemble them and get a final list of essential protein candidates. To validate our method, we select two species' PPI datasets (Yeast and Escherichia coli) and apply IMAMOBH on them. In the comparison experiments, our method achieves better results compared to some state-ofthe-art methods like BC, CC, DC, EC, LAC [15], NC, NEC, PageRank [16], and PeC. The contributions of this paper can be concluded as follows, • This is the first attempt of applying multi-objective optimization into the identification of essential proteins.
• A method with satisfactory generalization ability is proposed for the identification of essential proteins.

MATERIALS
The PPI data of Saccharomyces cerevisiae (Yeast) and Escherichia coli are downloaded from the DIP [17] database. The PPI dataset of Yeast contains 4,979 proteins and 22,061 interactions. The PPI dataset of Escherichia coli owns 2528 proteins and 11496 interactions.
The essential genes lists of Yeast and Escherichia coli are collected from OGEE [18]. The Yeast network consists of 1,209 essential proteins, 3,322 nonessential proteins, and 448 unknown proteins. The Escherichia coli network consists of 444 essential proteins, 1403 nonessential proteins, and 681 unknown proteins.
The gene expression datasets of Yeast and Escherichia coli are downloaded from GEO [19]. We use the Pearson correlation coefficient (PCC) to evaluate the gene expression similarity (GES) of two interacting proteins [13]. The gene ontology data used in this paper is collected from paper [9]. GO semantic similarity is based on the biological characteristics of genes. It is used to represent the genes functional similarity [20]. Using biological process category of GO, genes functional similarity (GFS) between two proteins can be calculated by the algorithm proposed in paper [21].

Identification of Essential Proteins using Adaptive Multi-objective Black Hole Algorithm
The identification of essential proteins can be considered as a multi-objective problem (MOP), which has two objectives: gene expression similarity (GES) and genes functional similarity (GFS), as follows, where, n is the number of elements in one solution, N T E(i) is the number of triangles consist of a certain edge includes protein i. The objectives of our methods consist of network topological feature N T E and biological information like GES and GFS. Here we choose N T E because it is highly correlated with GES and GFS. Hence, we construct objective functions like this type.
To solve above MOP, we use the adaptive multi-objective black hole algorithm (AMOBH) [14] which has several advantages: lower computational complexity, faster convergence rate, and better population diversity compared to stateof-the-art methods. The Pareto solution set of above MOP is corresponding to a set of different weighted combination of two objectives. The optimization method will guarantee the diversity of solution. Hence, we can avoid the subjective selection of the weights of two objectives. To assemble the results of solutions in the Pareto solution set, we build a voting system to select a certain number of solutions to form the final essential protein candidates. Every solution in Pareto solution set is considered as a voter. The larger number of votes of a protein obtained means it has the bigger probability to be chosen into the final essential protein candidates.
The brief pseudo code of AMOBH is as Algorithm 1 show.
The original AMOBH is used to solve the continuous MOPs. The solution update formula is as follow, where, P op(i) represents the solution i, t + 1 means current iteration, t means previous iteration, and Bh(j) means one of black holes (elite solutions) from the black hole set. However, here is a discrete MOP. The update rule needs to be changed.
The new solution update rule is shown in Fig 1. As Fig 1  shows, a solution is a vector. It will get close to a certain black hole. At first, we get the different parts of two vectors (eg. solution i and black hole j) and call them PartA from the black hole j and PartB from the solution i respectively. Then we select several elements from PartA randomly and use them to replace the same number of elements in PartB. The maximum of selecting elements is the size of PartA. After that, we will get a new solution i which is much similar to the black hole j as Fig 1 shows since there are more similar elements between two vectors. What's more, the order of an element in a solution vector is not important in this problem.  After using AMOBH to solve above MOP, we will get a Pareto solution set Ar. Every Pareto solution represents a possible essential protein candidate list provided by a certain weighted combination of two objectives. It is considered as a voter. To assemble the results of different voters and maintain a good generalization ability, we adopt a voting strategy. If a protein i is in the Pareto solution j, it gets a vote from j. The more number of votes a solution obtained means it is more possible to be selected as an essential protein candidate.

Computational Complexity
The computation of core AMOBH algorithm is O (M N 2 ). M is the number of objectives, and N is the size of archive. The values of N T E * GES and N T E * GF S of all proteins are calculated before the optimization, and max computation of this step is O(K 2 ). K is the total number of proteins. However, because of the property of small-world, the computation of this step is much smaller than O(K 2 ). Thus the compu-  TOP 100  TOP 200  TOP 300  TOP 400  TOP 500  TOP 600  TOP 700  TOP 800  TOP 900  TOP 1000  TOP 1100 TOP 1200 Fig. 2. Comparison of the number of true essential proteins identified from Yeast PPI dataset (different colors means different top ranked proteins intervals).
tation of core AMOBH algorithm dominates the computation of IMAMOBH.

Validation Metrics
To verify the proposed method, in this paper, we select several most frequently used validation metrics: sensitive, specificity, F-measure, and accuracy [4].

Performance Analysis
We applied IMAMOBH on the PPI datasets of Yeast and Escherichia coli and compared its performance with several state-of-the-art methods: BC, CC, DC, EC, LAC, NC, NEC, PR (PR is the abbreviation of PageRank), and PeC. All methods adopt the default parameters and all experiments are run on a personal computer with Windows 10 OS, Intel Core i7 2.3GHz CPU, and 8GB memory. As most of validation methods for the identification of essential proteins, we also ranked all proteins by using each essential protein search method and selected a certain number of top ranked proteins as the essential protein candidates. Considering the number of true essential proteins in the PPI data of Yeast and Escherichia coli, we set the range of essential protein candidates of Yeast from top 1% to top 24% and the range of essential protein candidates of Escherichia coli from top 1% to 18% 1 . all identification methods on the PPI network of Yeast. We can see that IMAMOBH outperforms rest methods in terms of all evaluation metrics in all top ranked proteins. Fig 4 shows the comparison of the number of true essential proteins identified from Escherichia coli PPI dataset using BC, CC, DC, EC, LAC, NC, NEC, PR, PeC, and IMAMOBH. It can be seen clearly that our method identifies more true essential proteins against rest methods in all essential protein candidates. And Fig 3 shows the results of 4 evaluation metrics obtained by all identification methods on the PPI network of Escherichia coli. It can be seen clearly that IMAMOBH achieves the best results of all evaluation metrics in all top ranked proteins. What's more, we can see that some methods like NC, LAC, and NEC achieve good results on Yeast PPI dataset. However, when they are applied on Escherichia coli PPI dataset, their performance is largely degraded. This proves that the generalization ability of our method is better than other state-of-the-art methods used in this paper.

CONCLUSION
In this paper, we consider the identification of essential proteins as a multi-objective optimization problem and use AMOBH algorithm to solve it. We call this identification method IMAMOBH. Our method avoids the subjective selection of weights and achieves a better generalization ability.  The validation experiments on the PPI data of Yeast and Escherichia coli show that our method achieves better performance in terms of sensitive, specificity, F-measure, and accuracy compared to some state-of-the-art methods. In future, we will validate our method on more different species' PPI datasets and sought to introduce many-objective optimization methods to the identification of essential proteins.