|
|
Amino Acids: The Forum for Amino Acid, Peptide and Protein Research (v.35, #3)
Protein function prediction with high-throughput data
by Xing-Ming Zhao; Luonan Chen; Kazuyuki Aihara (pp. 517-530).
Protein function prediction is one of the main challenges in post-genomic era. The availability of large amounts of high-throughput data provides an alternative approach to handling this problem from the computational viewpoint. In this review, we provide a comprehensive description of the computational methods that are currently applicable to protein function prediction, especially from the perspective of machine learning. Machine learning techniques can generally be classified as supervised learning, semi-supervised learning and unsupervised learning. By classifying the existing computational methods for protein annotation into these three groups, we are able to present a comprehensive framework on protein annotation based on machine learning techniques. In addition to describing recently developed theoretical methodologies, we also cover representative databases and software tools that are widely utilized in the prediction of protein function.
Keywords: High-throughput data; Machine learning; Protein function prediction; Semi-supervised learning; Supervised learning; Unsupervised learning
Solution structure of NPr, a bacterial signal-transducing protein that controls the phosphorylation state of the potassium transporter-regulating protein IIANtr
by Xia Li; Alan Peterkofsky; Guangshun Wang (pp. 531-539).
A nitrogen-related signal transduction pathway, consisting of the three phosphotransfer proteins EINtr, NPr, and IIANtr, was discovered recently to regulate the uptake of K+ in Escherichia coli. In particular, dephosphorylated IIANtr inhibits the activity of the K+ transporter TrkA. Since the phosphorylation state of IIANtr is partially determined by its reversible phosphorylation by NPr, we have determined the three-dimensional structure of NPr by solution NMR spectroscopy. In total, we obtained 973 NOE-derived distance restraints, 112 chemical shift-derived backbone angle restraints, and 35 hydrogen-bond restraints derived from temperature coefficients (wave). We propose that temperature wave is useful for identifying exposed beta-strands and assists in establishing protein folds based on chemical shifts. The deduced structure of NPr contains three α-helices and four β-strands with the three helices all packed on the same face of the β-sheet. The active site residue His16 of NPr for phosphoryl transfer was found to be neutral and in the Nε2-H tautomeric state. There appears to be increased motion in the active site region of NPr compared to HPr, a homologous protein involved in the uptake and regulation of carbohydrate utilization.
Keywords: IIANtr ; NMR; NPr; Phosphorylation; Signal transduction; TrkA
Analysis of 3D structural differences in the IgG-binding domains based on the interresidue average-distance statistics
by Takeshi Kikuchi (pp. 541-549).
It is well-known that the IgG-binding domain from staphylococcal protein A folds into a 3α helix bundle structure, while the IgG-binding domain of streptococcal protein G forms an (α + β) structure. Recently, He et al. (Biochemistry 44:14055–14061, 2005) made mutants of these proteins from the wild types of protein A and protein G strains. These mutants are referred to as protein A219 and protein G311, and it was showed that these two mutants have different 3D structures, i.e., the 3α helix bundle structure and the (α + β) structure, respectively, despite the high sequence identity (59%). The purpose of our study was to clarify how such 3D structural differences are coded in the sequences with high homology. To address this problem, we introduce a predicted contact map constructed based on the interresidue average-distance statistics for prediction of folding properties of a protein. We refer to this map as an average distance map (ADM). Furthermore, the statistics of interresidue distances can be converted to an effective interresidue potential. We calculated the contact frequency of each residue of a protein in random conformations with this effective interresidue potential, and then we obtained values similar to ϕ values. We refer to this contact frequency of each residue as a p(μ) value. The comparison of the p(μ) values to the ϕ values for a protein suggests that p(μ) values reveal the information on the folding initiation site. Using these techniques, we try to extract the information on the difference in the 3D structures of protein A219 and protein G311 coded in their amino acid sequences in the present work. The results show that the ADM analyses and the p(μ) value analyses predict the information of folding initiation sites, which can be used to detect the 3D difference in both proteins.
Keywords: Average distance map; IgG-binding domain; Protein structure; Folding initiation site
Secondary structure-based assignment of the protein structural classes
by Lukasz A. Kurgan; Tuo Zhang; Hua Zhang; Shiyi Shen; Jishou Ruan (pp. 551-564).
Structural class categorizes proteins based on the amount and arrangement of the constituent secondary structures. The knowledge of structural classes is applied in numerous important predictive tasks that address structural and functional features of proteins. We propose novel structural class assignment methods that use one-dimensional (1D) secondary structure as the input. The methods are designed based on a large set of low-identity sequences for which secondary structure is predicted from their sequence (PSSAsc model) or assigned based on their tertiary structure (SSAsc). The secondary structure is encoded using a comprehensive set of features describing count, content, and size of secondary structure segments, which are fed into a small decision tree that uses ten features to perform the assignment. The proposed models were compared against seven secondary structure-based and ten sequence-based structural class predictors. Using the 1D secondary structure, SSAsc and PSSAsc can assign proteins to the four main structural classes, while the existing secondary structure-based assignment methods can predict only three classes. Empirical evaluation shows that the proposed models are quite promising. Using the structure-based assignment performed in SCOP (structural classification of proteins) as the golden standard, the accuracy of SSAsc and PSSAsc equals 76 and 75%, respectively. We show that the use of the secondary structure predicted from the sequence as an input does not have a detrimental effect on the quality of structural class assignment when compared with using secondary structure derived from tertiary structure. Therefore, PSSAsc can be used to perform the automated assignment of structural classes based on the sequences.
Keywords: SCOP; Structural class; Structural class prediction; Structural classification of proteins; Secondary protein structure
Multi-agent-based bio-network for systems biology: protein–protein interaction network as an example
by Li-Hong Ren; Yong-Sheng Ding; Yi-Zhen Shen; Xiang-Feng Zhang (pp. 565-572).
Recently, a collective effort from multiple research areas has been made to understand biological systems at the system level. This research requires the ability to simulate particular biological systems as cells, organs, organisms, and communities. In this paper, a novel bio-network simulation platform is proposed for system biology studies by combining agent approaches. We consider a biological system as a set of active computational components interacting with each other and with an external environment. Then, we propose a bio-network platform for simulating the behaviors of biological systems and modelling them in terms of bio-entities and society-entities. As a demonstration, we discuss how a protein–protein interaction (PPI) network can be seen as a society of autonomous interactive components. From interactions among small PPI networks, a large PPI network can emerge that has a remarkable ability to accomplish a complex function or task. We also simulate the evolution of the PPI networks by using the bio-operators of the bio-entities. Based on the proposed approach, various simulators with different functions can be embedded in the simulation platform, and further research can be done from design to development, including complexity validation of the biological system.
Keywords: Bio-network simulation platform; Multi-agent systems; Bio-entities; Emergent computation; Systems biology; Protein–protein interaction (PPI) network
An ensemble of support vector machines for predicting the membrane protein type directly from the amino acid sequence
by Loris Nanni; Alessandra Lumini (pp. 573-580).
Given a particular membrane protein, it is very important to know which membrane type it belongs to because this kind of information can provide clues for better understanding its function. In this work, we propose a system for predicting the membrane protein type directly from the amino acid sequence. The feature extraction step is based on an encoding technique that combines the physicochemical amino acid properties with the residue couple model. The residue couple model is a method inspired by Chou’s quasi-sequence-order model that extracts the features by utilizing the sequence order effect indirectly. A set of support vector machines, each trained using a different physicochemical amino acid property combined with the residue couple model, are combined by vote rule. The success rate obtained by our system on a difficult dataset, where the sequences in a given membrane type have a low sequence identity to any other proteins of the same membrane type, are quite high, indicating that the proposed method, where the features are extracted directly from the amino acid sequence, is a feasible system for predicting the membrane protein type.
Keywords: Membrane type prediction; Residue couple model; Chou’s pseudo–amino acid composition; Ensemble of classifiers; Physicochemical properties; Support vector machine
Prediction of protein structure class by coupling improved genetic algorithm and support vector machine
by Z.-C. Li; X.-B. Zhou; Y.-R. Lin; X.-Y. Zou (pp. 581-590).
Structural class characterizes the overall folding type of a protein or its domain. Most of the existing methods for determining the structural class of a protein are based on a group of features that only possesses a kind of discriminative information for the prediction of protein structure class. However, different types of discriminative information associated with primary sequence have been completely missed, which undoubtedly has reduced the success rate of prediction. We present a novel method for the prediction of protein structure class by coupling the improved genetic algorithm (GA) with the support vector machine (SVM). This improved GA was applied to the selection of an optimized feature subset and the optimization of SVM parameters. Jackknife tests on the working datasets indicated that the prediction accuracies for the different classes were in the range of 97.8–100% with an overall accuracy of 99.5%. The results indicate that the approach has a high potential to become a useful tool in bioinformatics.
Keywords: Feature selection; Genetic algorithm; Protein structure class; Support vector machine
Using Chou’s pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach
by Shao-Wu Zhang; Wei Chen; Feng Yang; Quan Pan (pp. 591-598).
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, which associate through noncovalent interactions and, occasionally, disulfide bonds to form protein quaternary structures. It has long been known that the functions of proteins are closely related to their quaternary structures; some examples include enzymes, hemoglobin, DNA polymerase, and ion channels. However, it is extremely labor-expensive and even impossible to quickly determine the structures of hundreds of thousands of protein sequences solely from experiments. Since the number of protein sequences entering databanks is increasing rapidly, it is highly desirable to develop computational methods for classifying the quaternary structures of proteins from their primary sequences. Since the concept of Chou’s pseudo amino acid composition (PseAAC) was introduced, a variety of approaches, such as residue conservation scores, von Neumann entropy, multiscale energy, autocorrelation function, moment descriptors, and cellular automata, have been utilized to formulate the PseAAC for predicting different attributes of proteins. Here, in a different approach, a sequence-segmented PseAAC is introduced to represent protein samples. Meanwhile, multiclass SVM classifier modules were adopted to classify protein quaternary structures. As a demonstration, the dataset constructed by Chou and Cai [(2003) Proteins 53:282–289] was adopted as a benchmark dataset. The overall jackknife success rates thus obtained were 88.2–89.1%, indicating that the new approach is quite promising for predicting protein quaternary structure.
Keywords: Sequence-segmented PseAAC; Residue conservation; Von Neumann entropy; Multiscale energy; Moment descriptor; Support vector machine
DPROT: prediction of disordered proteins using evolutionary information
by Deepti Sethi; Aarti Garg; G. P. S. Raghava (pp. 599-605).
The association of structurally disordered proteins with a number of diseases has engendered enormous interest and therefore demands a prediction method that would facilitate their expeditious study at molecular level. The present study describes the development of a computational method for predicting disordered proteins using sequence and profile compositions as input features for the training of SVM models. First, we developed the amino acid and dipeptide compositions based SVM modules which yielded sensitivities of 75.6 and 73.2% along with Matthew’s Correlation Coefficient (MCC) values of 0.75 and 0.60, respectively. In addition, the use of predicted secondary structure content (coil, sheet and helices) in the form of composition values attained a sensitivity of 76.8% and MCC value of 0.77. Finally, the training of SVM models using evolutionary information hidden in the multiple sequence alignment profile improved the prediction performance by achieving a sensitivity value of 78% and MCC of 0.78. Furthermore, when evaluated on an independent dataset of partially disordered proteins, the same SVM module provided a correct prediction rate of 86.6%. Based on the above study, a web server (“DPROT”) was developed for the prediction of disordered proteins, which is available at http://www.imtech.res.in/raghava/dprot/ .
Keywords: Disorder proteins; Position-specific scoring matrices; Support vector machines; Amino acid composition; Web server
Use of tetrapeptide signals for protein secondary-structure prediction
by Yonge Feng; Liaofu Luo (pp. 607-614).
This paper develops a novel sequence-based method, tetra-peptide-based increment of diversity with quadratic discriminant analysis (TPIDQD for short), for protein secondary-structure prediction. The proposed TPIDQD method is based on tetra-peptide signals and is used to predict the structure of the central residue of a sequence fragment. The three-state overall per-residue accuracy (Q 3) is about 80% in the threefold cross-validated test for 21-residue fragments in the CB513 dataset. The accuracy can be further improved by taking long-range sequence information (fragments of more than 21 residues) into account in prediction. The results show the tetra-peptide signals can indeed reflect some relationship between an amino acid’s sequence and its secondary structure, indicating the importance of tetra-peptide signals as the protein folding code in the protein structure prediction.
Keywords: Protein secondary-structure prediction; Tetra-peptide structural words; Increment of diversity; Quadratic discriminant analysis; Boundary correction; Long-range interaction
Incorporating the amino acid properties to predict the significance of missense mutations
by Tze-Chuen Lee; Ann S. G. Lee; Kuo-Bin Li (pp. 615-626).
Determining if missense mutations are deleterious is critical for the analysis of genes implicated in disease. However, the mutational effects of many missense mutations in databases like the Breast Cancer Information Core are unclassified. Several approaches have emerged recently to determine such mutational effects but none have utilized amino acid property indices. We modified a previously described phylogenetic approach by first classifying benign substitutions based on the assumption that missense mutations that are maintained in orthologs are unlikely to affect function. A consensus conservation score based on 16 amino acid properties was used to characterize the remaining substitutions. This approach was evaluated with experimentally verified T4 lysozyme missnese mutations and is shown to be able to sieve out putative biochemical and structurally important residues. The use of amino acid properties can enhance the prediction of biochemical and structurally important residues and thus also predict the significance of missense mutations.
Keywords: Missense mutation; T4 lysozyme; BRCA1 ; Non-synonymous SNP; Physico-chemical properties
Bridging protein local structures and protein functions
by Zhi-Ping Liu; Ling-Yun Wu; Yong Wang; Xiang-Sun Zhang; Luonan Chen (pp. 627-650).
One of the major goals of molecular and evolutionary biology is to understand the functions of proteins by extracting functional information from protein sequences, structures and interactions. In this review, we summarize the repertoire of methods currently being applied and report recent progress in the field of in silico annotation of protein function based on the accumulation of vast amounts of sequence and structure data. In particular, we emphasize the newly developed structure-based methods, which are able to identify locally structural motifs and reveal their relationship with protein functions. These methods include computational tools to identify the structural motifs and reveal the strong relationship between these pre-computed local structures and protein functions. We also discuss remaining problems and possible directions for this exciting and challenging area.
Keywords: Functional genomics; Functional motifs; Local structures; Protein function prediction
|
|