A Medley of Potpourri

Sunday, December 16, 2018

Protein structure prediction

From Wikipedia, the free encyclopedia

Constituent amino-acids can be analyzed to predict secondary, tertiary and quaternary protein structure.

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its folding and its secondary and tertiary structure from its primary structure. Structure prediction is fundamentally different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry; it is highly important in medicine (for example, in drug design) and biotechnology (for example, in the design of novel enzymes). Every two years, the performance of current methods is assessed in the CASP experiment (Critical Assessment of Techniques for Protein Structure Prediction). A continuous evaluation of protein structure prediction web servers is performed by the community project CAMEO3D.

Protein structure and terminology

Proteins are chains of amino acids joined together by peptide bonds. Many conformations of this chain are possible due to the rotation of the chain about each Cα atom. It is these conformational changes that are responsible for differences in the three dimensional structure of proteins. Each amino acid in the chain is polar, i.e. it has separated positive and negative charged regions with a free carbonyl group, which can act as hydrogen bond acceptor and an NH group, which can act as hydrogen bond donor. These groups can therefore interact in the protein structure. The 20 amino acids can be classified according to the chemistry of the side chain which also plays an important structural role. Glycine takes on a special position, as it has the smallest side chain, only one hydrogen atom, and therefore can increase the local flexibility in the protein structure. Cysteine on the other hand can react with another cysteine residue and thereby form a cross link stabilizing the whole structure.

The protein structure can be considered as a sequence of secondary structure elements, such as α helices and β sheets, which together constitute the overall three-dimensional configuration of the protein chain. In these secondary structures regular patterns of H bonds are formed between neighboring amino acids, and the amino acids have similar Φ and Ψ angles.

Bond angles for ψ and ω

The formation of these structures neutralizes the polar groups on each amino acid. The secondary structures are tightly packed in the protein core in a hydrophobic environment. Each amino acid side group has a limited volume to occupy and a limited number of possible interactions with other nearby side chains, a situation that must be taken into account in molecular modeling and alignments.

α Helix

The α helix is the most abundant type of secondary structure in proteins. The α helix has 3.6 amino acids per turn with an H bond formed between every fourth residue; the average length is 10 amino acids (3 turns) or 10 Å but varies from 5 to 40 (1.5 to 11 turns). The alignment of the H bonds creates a dipole moment for the helix with a resulting partial positive charge at the amino end of the helix. Because this region has free NH2 groups, it will interact with negatively charged groups such as phosphates. The most common location of α helices is at the surface of protein cores, where they provide an interface with the aqueous environment. The inner-facing side of the helix tends to have hydrophobic amino acids and the outer-facing side hydrophilic amino acids. Thus, every third of four amino acids along the chain will tend to be hydrophobic, a pattern that can be quite readily detected. In the leucine zipper motif, a repeating pattern of leucines on the facing sides of two adjacent helices is highly predictive of the motif. A helical-wheel plot can be used to show this repeated pattern. Other α helices buried in the protein core or in cellular membranes have a higher and more regular distribution of hydrophobic amino acids, and are highly predictive of such structures. Helices exposed on the surface have a lower proportion of hydrophobic amino acids. Amino acid content can be predictive of an α -helical region. Regions richer in alanine (A), glutamic acid (E), leucine (L), and methionine (M) and poorer in proline (P), glycine (G), tyrosine (Y), and serine (S) tend to form an α helix. Proline destabilizes or breaks an α helix but can be present in longer helices, forming a bend.

An alpha-helix with hydrogen bonds (yellow dots)

β sheet

β sheets are formed by H bonds between an average of 5–10 consecutive amino acids in one portion of the chain with another 5–10 farther down the chain. The interacting regions may be adjacent, with a short loop in between, or far apart, with other structures in between. Every chain may run in the same direction to form a parallel sheet, every other chain may run in the reverse chemical direction to form an anti parallel sheet, or the chains may be parallel and anti parallel to form a mixed sheet. The pattern of H bonding is different in the parallel and anti parallel configurations. Each amino acid in the interior strands of the sheet forms two H bonds with neighboring amino acids, whereas each amino acid on the outside strands forms only one bond with an interior strand. Looking across the sheet at right angles to the strands, more distant strands are rotated slightly counterclockwise to form a left-handed twist. The Cα atoms alternate above and below the sheet in a pleated structure, and the R side groups of the amino acids alternate above and below the pleats. The Φ and Ψ angles of the amino acids in sheets vary considerably in one region of the Ramachandran plot. It is more difficult to predict the location of β sheets than of α helices. The situation improves somewhat when the amino acid variation in multiple sequence alignments is taken into account.

Loop

Loops are regions of a protein chain that are 1) between α helices and β sheets, 2) of various lengths and three-dimensional configurations, and 3) on the surface of the structure.

Hairpin loops that represent a complete turn in the polypeptide chain joining two antiparallel β strands may be as short as two amino acids in length. Loops interact with the surrounding aqueous environment and other proteins. Because amino acids in loops are not constrained by space and environment as are amino acids in the core region, and do not have an effect on the arrangement of secondary structures in the core, more substitutions, insertions, and deletions may occur. Thus, in a sequence alignment, the presence of these features may be an indication of a loop. The positions of introns in genomic DNA sometimes correspond to the locations of loops in the encoded protein. Loops also tend to have charged and polar amino acids and are frequently a component of active sites. A detailed examination of loop structures has shown that they fall into distinct families.

Coils

A region of secondary structure that is not a α helix, a β sheet, or a recognizable turn is commonly referred to as a coil.

Protein classification

Proteins may be classified according to both structural and sequence similarity. For structural classification, the sizes and spatial arrangements of secondary structures described in the above paragraph are compared in known three-dimensional structures. Classification based on sequence similarity was historically the first to be used. Initially, similarity based on alignments of whole sequences was performed. Later, proteins were classified on the basis of the occurrence of conserved amino acid patterns. Databases that classify proteins by one or more of these schemes are available. In considering protein classification schemes, it is important to keep several observations in mind. First, two entirely different protein sequences from different evolutionary origins may fold into a similar structure. Conversely, the sequence of an ancient gene for a given structure may have diverged considerably in different species while at the same time maintaining the same basic structural features. Recognizing any remaining sequence similarity in such cases may be a very difficult task. Second, two proteins that share a significant degree of sequence similarity either with each other or with a third sequence also share an evolutionary origin and should share some structural features also. However, gene duplication and genetic rearrangements during evolution may give rise to new gene copies, which can then evolve into proteins with new function and structure.

Terms used for classifying protein structures and sequences

The more commonly used terms for evolutionary and structural relationships among proteins are listed below. Many additional terms are used for various kinds of structural features found in proteins. Descriptions of such terms may be found at the CATH Web site, the Structural Classification of Proteins (SCOP) Web site, and a Glaxo-Wellcome tutorial on the Swiss bioinformatics Expasy Web site.

Active site is a localized combination of amino acid side groups within the tertiary (three-dimensional) or quaternary (protein subunit) structure that can interact with a chemically specific substrate and that provides the protein with biological activity. Proteins of very different amino acid sequences may fold into a structure that produces the same active site.

Architecture is the relative orientations of secondary structures in a three-dimensional structure without regard to whether or not they share a similar loop structure.

Fold is a type of architecture that also has a conserved loop structure.

Blocks is a conserved amino acid sequence pattern in a family of proteins. The pattern includes a series of possible matches at each position in the represented sequences, but there are not any inserted or deleted positions in the pattern or in the sequences. By way of contrast, sequence profiles are a type of scoring matrix that represents a similar set of patterns that includes insertions and deletions.

Class is a term used to classify protein domains according to their secondary structural content and organization. Four classes were originally recognized by Levitt and Chothia (1976), and several others have been added in the SCOP database. Three classes are given in the CATH database: mainly-α, mainly-β, and α–β, with the α–β class including both alternating α/β and α+β structures.

Core is the portion of a folded protein molecule that comprises the hydrophobic interior of α-helices and β-sheets. The compact structure brings together side groups of amino acids into close enough proximity so that they can interact. When comparing protein structures, as in the SCOP database, core is the region common to most of the structures that share a common fold or that are in the same superfamily. In structure prediction, core is sometimes defined as the arrangement of secondary structures that is likely to be conserved during evolutionary change.

Domain (sequence context) is a segment of a polypeptide chain that can fold into a three-dimensional structure irrespective of the presence of other segments of the chain. The separate domains of a given protein may interact extensively or may be joined only by a length of polypeptide chain. A protein with several domains may use these domains for functional interactions with different molecules.

Family (sequence context) is a group of proteins of similar biochemical function that are more than 50% identical when aligned. This same cutoff is still used by the Protein Information Resource (PIR). A protein family comprises proteins with the same function in different organisms (orthologous sequences) but may also include proteins in the same organism (paralogous sequences) derived from gene duplication and rearrangements. If a multiple sequence alignment of a protein family reveals a common level of similarity throughout the lengths of the proteins, PIR refers to the family as a homeomorphic family. The aligned region is referred to as a homeomorphic domain, and this region may comprise several smaller homology domains that are shared with other families. Families may be further subdivided into subfamilies or grouped into superfamilies based on respective higher or lower levels of sequence similarity. The SCOP database reports 1296 families and the CATH database (version 1.7 beta), reports 1846 families.

When the sequences of proteins with the same function are examined in greater detail, some are found to share high sequence similarity. They are obviously members of the same family by the above criteria. However, others are found that have very little, or even insignificant, sequence similarity with other family members. In such cases, the family relationship between two distant family members A and C can often be demonstrated by finding an additional family member B that shares significant similarity with both A and C. Thus, B provides a connecting link between A and C. Another approach is to examine distant alignments for highly conserved matches.

At a level of identity of 50%, proteins are likely to have the same three-dimensional structure, and the identical atoms in the sequence alignment will also superimpose within approximately 1 Å in the structural model. Thus, if the structure of one member of a family is known, a reliable prediction may be made for a second member of the family, and the higher the identity level, the more reliable the prediction. Protein structural modeling can be performed by examining how well the amino acid substitutions fit into the core of the three-dimensional structure.

Family (structural context) is as used in the FSSP database (Families of structurally similar proteins) and the DALI/FSSP Web site, two structures that have a significant level of structural similarity but not necessarily significant sequence similarity.

Fold is similar to structural motif, includes a larger combination of secondary structural units in the same configuration. Thus, proteins sharing the same fold have the same combination of secondary structures that are connected by similar loops. An example is the Rossman fold comprising several alternating α helices and parallel β strands. In the SCOP, CATH, and FSSP databases, the known protein structures have been classified into hierarchical levels of structural complexity with the fold as a basic level of classification.

Homologous domain (sequence context) is an extended sequence pattern, generally found by sequence alignment methods, that indicates a common evolutionary origin among the aligned sequences. A homology domain is generally longer than motifs. The domain may include all of a given protein sequence or only a portion of the sequence. Some domains are complex and made up of several smaller homology domains that became joined to form a larger one during evolution. A domain that covers an entire sequence is called the homeomorphic domain by PIR (Protein Information Resource).

Module is a region of conserved amino acid patterns comprising one or more motifs and considered to be a fundamental unit of structure or function. The presence of a module has also been used to classify proteins into families.

Motif (sequence context) is a conserved pattern of amino acids that is found in two or more proteins. In the Prosite catalog, a motif is an amino acid pattern that is found in a group of proteins that have a similar biochemical activity, and that often is near the active site of the protein. Examples of sequence motif databases are the Prosite catalog and the Stanford Motifs Database.

Motif (structural context) is a combination of several secondary structural elements produced by the folding of adjacent sections of the polypeptide chain into a specific three-dimensional configuration. An example is the helix-loop-helix motif. Structural motifs are also referred to as supersecondary structures and folds.

Position-specific scoring matrix (sequence context, also known as weight or scoring matrix) is represents a conserved region in a multiple sequence alignment with no gaps. Each matrix column represents the variation found in one column of the multiple sequence alignment.

Position-specific scoring matrix—3D (structural context) represents the amino acid variation found in an alignment of proteins that fall into the same structural class. Matrix columns represent the amino acid variation found at one amino acid position in the aligned structures.

Primary structure is the linear amino acid sequence of a protein, which chemically is a polypeptide chain composed of amino acids joined by peptide bonds.

Profile (sequence context) is a scoring matrix that represents a multiple sequence alignment of a protein family. The profile is usually obtained from a well-conserved region in a multiple sequence alignment. The profile is in the form of a matrix with each column representing a position in the alignment and each row one of the amino acids. Matrix values give the likelihood of each amino acid at the corresponding position in the alignment. The profile is moved along the target sequence to locate the best scoring regions by a dynamic programming algorithm. Gaps are allowed during matching and a gap penalty is included in this case as a negative score when no amino acid is matched. A sequence profile may also be represented by a hidden Markov model, referred to as a profile HMM.

Profile (structural context) is a scoring matrix that represents which amino acids should fit well and which should fit poorly at sequential positions in a known protein structure. Profile columns represent sequential positions in the structure, and profile rows represent the 20 amino acids. As with a sequence profile, the structural profile is moved along a target sequence to find the highest possible alignment score by a dynamic programming algorithm. Gaps may be included and receive a penalty. The resulting score provides an indication as to whether or not the target protein might adopt such a structure.

Quaternary structure is the three-dimensional configuration of a protein molecule comprising several independent polypeptide chains.

Secondary structure is the interactions that occur between the C, O, and NH groups on amino acids in a polypeptide chain to form α-helices, β-sheets, turns, loops, and other forms, and that facilitate the folding into a three-dimensional structure.

Superfamily is a group of protein families of the same or different lengths that are related by distant yet detectable sequence similarity. Members of a given superfamily thus have a common evolutionary origin. Originally, Dayhoff defined the cutoff for superfamily status as being the chance that the sequences are not related of 10 6, on the basis of an alignment score (Dayhoff et al. 1978). Proteins with few identities in an alignment of the sequences but with a convincingly common number of structural and functional features are placed in the same superfamily. At the level of three-dimensional structure, superfamily proteins will share common structural features such as a common fold, but there may also be differences in the number and arrangement of secondary structures. The PIR resource uses the term homeomorphic superfamilies to refer to superfamilies that are composed of sequences that can be aligned from end to end, representing a sharing of single sequence homology domain, a region of similarity that extends throughout the alignment. This domain may also comprise smaller homology domains that are shared with other protein families and superfamilies. Although a given protein sequence may contain domains found in several superfamilies, thus indicating a complex evolutionary history, sequences will be assigned to only one homeomorphic superfamily based on the presence of similarity throughout a multiple sequence alignment. The superfamily alignment may also include regions that do not align either within or at the ends of the alignment. In contrast, sequences in the same family align well throughout the alignment.

Supersecondary structure is a term with similar meaning to a structural motif. Tertiary structure is the three-dimensional or globular structure formed by the packing together or folding of secondary structures of a polypeptide chain.

Secondary structure

Secondary structure prediction is a set of techniques in bioinformatics that aim to predict the local secondary structures of proteins based only on knowledge of their amino acid sequence. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta strands (often noted as "extended" conformations), or turns. The success of a prediction is determined by comparing it to the results of the DSSP algorithm (or similar e.g. STRIDE) applied to the crystal structure of the protein. Specialized algorithms have been developed for the detection of specific well-defined patterns such as transmembrane helices and coiled coils in proteins.

The best modern methods of secondary structure prediction in proteins reach about 80% accuracy; this high accuracy allows the use of the predictions as feature improving fold recognition and ab initio protein structure prediction, classification of structural motifs, and refinement of sequence alignments. The accuracy of current protein secondary structure prediction methods is assessed in weekly benchmarks such as LiveBench and EVA.

Background

Early methods of secondary structure prediction, introduced in the 1960s and early 1970s, focused on identifying likely alpha helices and were based mainly on helix-coil transition models. Significantly more accurate predictions that included beta sheets were introduced in the 1970s and relied on statistical assessments based on probability parameters derived from known solved structures. These methods, applied to a single sequence, are typically at most about 60-65% accurate, and often underpredict beta sheets. The evolutionary conservation of secondary structures can be exploited by simultaneously assessing many homologous sequences in a multiple sequence alignment, by calculating the net secondary structure propensity of an aligned column of amino acids. In concert with larger databases of known protein structures and modern machine learning methods such as neural nets and support vector machines, these methods can achieve up to 80% overall accuracy in globular proteins. The theoretical upper limit of accuracy is around 90%, partly due to idiosyncrasies in DSSP assignment near the ends of secondary structures, where local conformations vary under native conditions but may be forced to assume a single conformation in crystals due to packing constraints. Limitations are also imposed by secondary structure prediction's inability to account for tertiary structure; for example, a sequence predicted as a likely helix may still be able to adopt a beta-strand conformation if it is located within a beta-sheet region of the protein and its side chains pack well with their neighbors. Dramatic conformational changes related to the protein's function or environment can also alter local secondary structure.

Historical perspective

To date, over 20 different secondary structure prediction methods have been developed. One of the first algorithms was Chou-Fasman method, which relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure. The original Chou-Fasman parameters, determined from the small sample of structures solved in the mid-1970s, produce poor results compared to modern methods, though the parameterization has been updated since it was first published. The Chou-Fasman method is roughly 50-60% accurate in predicting secondary structures.

The next notable program was the GOR method, named for the three scientists who developed it — Garnier, Osguthorpe, and Robson, is an information theory-based method. It uses the more powerful probabilistic technique of Bayesian inference. The GOR method takes into account not only the probability of each amino acid having a particular secondary structure, but also the conditional probability of the amino acid assuming each structure given the contributions of its neighbors (it does not assume that the neighbors have that same structure). The approach is both more sensitive and more accurate than that of Chou and Fasman because amino acid structural propensities are only strong for a small number of amino acids such as proline and glycine. Weak contributions from each of many neighbors can add up to strong effects overall. The original GOR method was roughly 65% accurate and is dramatically more successful in predicting alpha helices than beta sheets, which it frequently mispredicted as loops or disorganized regions.

Another big step forward, was using machine learning methods. First artificial neural networks methods were used. As a training sets they use solved structures to identify common sequence motifs associated with particular arrangements of secondary structures. These methods are over 70% accurate in their predictions, although beta strands are still often underpredicted due to the lack of three-dimensional structural information that would allow assessment of hydrogen bonding patterns that can promote formation of the extended conformation required for the presence of a complete beta sheet. PSIPRED and JPRED are some of the most known programs based on neural networks for protein secondary structure prediction. Next, support vector machines have proven particularly useful for predicting the locations of turns, which are difficult to identify with statistical methods.

Extensions of machine learning techniques attempt to predict more fine-grained local properties of proteins, such as backbone dihedral angles in unassigned regions. Both SVMs and neural networks have been applied to this problem. More recently, real-value torsion angles can be accurately predicted by SPINE-X and successfully employed for ab initio structure prediction.

Other improvements

It is reported that in addition to the protein sequence, secondary structure formation depends on other factors. For example, it is reported that secondary structure tendencies depend also on local environment, solvent accessibility of residues, protein structural class, and even the organism from which the proteins are obtained. Based on such observations, some studies have shown that secondary structure prediction can be improved by addition of information about protein structural class, residue accessible surface area and also contact number information.

Tertiary structure

The practical role of protein structure prediction is now more important than ever. Massive amounts of protein sequence data are produced by modern large-scale DNA sequencing efforts such as the Human Genome Project. Despite community-wide efforts in structural genomics, the output of experimentally determined protein structures—typically by time-consuming and relatively expensive X-ray crystallography or NMR spectroscopy—is lagging far behind the output of protein sequences.

The protein structure prediction remains an extremely difficult and unresolved undertaking. The two main problems are calculation of protein free energy and finding the global minimum of this energy. A protein structure prediction method must explore the space of possible protein structures which is astronomically large. These problems can be partially bypassed in "comparative" or homology modeling and fold recognition methods, in which the search space is pruned by the assumption that the protein in question adopts a structure that is close to the experimentally determined structure of another homologous protein. On the other hand, the de novo or ab initio protein structure prediction methods must explicitly resolve these problems. The progress and challenges in protein structure prediction has been reviewed in Zhang 2008.

Ab initio protein modelling

Energy- and fragment-based methods

Ab initio- or de novo- protein modelling methods seek to build three-dimensional protein models "from scratch", i.e., based on physical principles rather than (directly) on previously solved structures. There are many possible procedures that either attempt to mimic protein folding or apply some stochastic method to search possible solutions (i.e., global optimization of a suitable energy function). These procedures tend to require vast computational resources, and have thus only been carried out for tiny proteins. To predict protein structure de novo for larger proteins will require better algorithms and larger computational resources like those afforded by either powerful supercomputers (such as Blue Gene or MDGRAPE-3) or distributed computing (such as Folding@home, the Human Proteome Folding Project and Rosetta@Home). Although these computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) make ab initio structure prediction an active research field.

As of 2009, a 50-residue protein could be simulated atom-by-atom on a supercomputer for 1 millisecond. As of 2012, comparable stable-state sampling could be done on a standard desktop with a new graphics card and more sophisticated algorithms. A much larger simulation timescales can be achieved using coarse-grained modeling.

Evolutionary covariation to predict 3D contacts

As sequencing became more commonplace in the 1990s several groups used protein sequence alignments to predict correlated mutations and it was hoped that these coevolved residues could be used to predict tertiary structure (using the analogy to distance constraints from experimental procedures such as NMR). The assumption is when single residue mutations are slightly deleterious, compensatory mutations may occur to restabilize residue-residue interactions. This early work used what are known as local methods to calculate correlated mutations from protein sequences, but suffered from indirect false correlations which result from treating each pair of residues as independent of all other pairs.

In 2011, a different, and this time global statistical approach, demonstrated that predicted coevolved residues were sufficient to predict the 3D fold of a protein, providing there are enough sequences available (>1,000 homologous sequences are needed). The method, EVfold, uses no homology modeling, threading or 3D structure fragments and can be run on a standard personal computer even for proteins with hundreds of residues. The accuracy of the contacts predicted using this and related approaches has now been demonstrated on many known structures and contact maps, including the prediction of experimentally unsolved transmembrane proteins.

Comparative protein modeling

Comparative protein modelling uses previously solved structures as starting points, or templates. This is effective because it appears that although the number of actual proteins is vast, there is a limited set of tertiary structural motifs to which most proteins belong. It has been suggested that there are only around 2,000 distinct protein folds in nature, though there are many millions of different proteins.

These methods may also be split into two groups: Homology modeling is based on the reasonable assumption that two homologous proteins will share very similar structures. Because a protein's fold is more evolutionarily conserved than its amino acid sequence, a target sequence can be modeled with reasonable accuracy on a very distantly related template, provided that the relationship between target and template can be discerned through sequence alignment. It has been suggested that the primary bottleneck in comparative modelling arises from difficulties in alignment rather than from errors in structure prediction given a known-good alignment. Unsurprisingly, homology modelling is most accurate when the target and template have similar sequences.

Protein threading: scans the amino acid sequence of an unknown structure against a database of solved structures. In each case, a scoring function is used to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models. This type of method is also known as 3D-1D fold recognition due to its compatibility analysis between three-dimensional structures and linear protein sequences. This method has also given rise to methods performing an inverse folding search by evaluating the compatibility of a given structure with a large database of sequences, thus predicting which sequences have the potential to produce a given fold.

Side-chain geometry prediction

Accurate packing of the amino acid side chains represents a separate problem in protein structure prediction. Methods that specifically address the problem of predicting side-chain geometry include dead-end elimination and the self-consistent mean field methods. The side chain conformations with low energy are usually determined on the rigid polypeptide backbone and using a set of discrete side chain conformations known as "rotamers." The methods attempt to identify the set of rotamers that minimize the model's overall energy.

These methods use rotamer libraries, which are collections of favorable conformations for each residue type in proteins. Rotamer libraries may contain information about the conformation, its frequency, and the standard deviations about mean dihedral angles, which can be used in sampling. Rotamer libraries are derived from structural bioinformatics or other statistical analysis of side-chain conformations in known experimental structures of proteins, such as by clustering the observed conformations for tetrahedral carbons near the staggered (60°, 180°, -60°) values.

Rotamer libraries can be backbone-independent, secondary-structure-dependent, or backbone-dependent. Backbone-independent rotamer libraries make no reference to backbone conformation, and are calculated from all available side chains of a certain type (for instance, the first example of a rotamer library, done by Ponder and Richards at Yale in 1987). Secondary-structure-dependent libraries present different dihedral angles and/or rotamer frequencies for

\alpha

-helix,

\beta

-sheet, or coil secondary structures. Backbone-dependent rotamer libraries present conformations and/or frequencies dependent on the local backbone conformation as defined by the backbone dihedral angles

\phi

and

\psi

, regardless of secondary structure.

The modern versions of these libraries as used in most software are presented as multidimensional distributions of probability or frequency, where the peaks correspond to the dihedral-angle conformations considered as individual rotamers in the lists. Some versions are based on very carefully curated data and are used primarily for structure validation, while others emphasize relative frequencies in much larger data sets and are the form used primarily for structure prediction, such as the Dunbrack rotamer libraries.

Side-chain packing methods are most useful for analyzing the protein's hydrophobic core, where side chains are more closely packed; they have more difficulty addressing the looser constraints and higher flexibility of surface residues, which often occupy multiple rotamer conformations rather than just one.

Prediction of structural classes

Statistical methods have been developed for predicting structural classes of proteins based on their amino acid composition, pseudo amino acid composition and functional domain composition.

Quaternary structure

In the case of complexes of two or more proteins, where the structures of the proteins are known or can be predicted with high accuracy, protein–protein docking methods can be used to predict the structure of the complex. Information of the effect of mutations at specific sites on the affinity of the complex helps to understand the complex structure and to guide docking methods.

Software

A great number of software tools for protein structure prediction exist. Approaches include homology modeling, protein threading, ab initio methods, secondary structure prediction, and transmembrane helix and signal peptide prediction. Some recent successful methods based on the CASP experiments include I-TASSER and HHpred. For complete list see main article.

Evaluation of automatic structure prediction servers

CASP, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide experiment for protein structure prediction taking place every two years since 1994. CASP provides with an opportunity to assess the quality of available human, non-automated methodology (human category) and automatic servers for protein structure prediction (server category, introduced in the CASP7).

The CAMEO3D Continuous Automated Model EvaluatiOn Server evaluates automated protein structure prediction servers on a weekly basis using blind predictions for newly release protein structures. CAMEO publishes the results on its website.

Computational immunology

From Wikipedia, the free encyclopedia

In academia, computational immunology is a field of science that encompasses high-throughput genomic and bioinformatics approaches to immunology. The field's main aim is to convert immunological data into computational problems, solve these problems using mathematical and computational approaches and then convert these results into immunologically meaningful interpretations.

Introduction

The immune system is a complex system of the human body and understanding it is one of the most challenging topics in biology. Immunology research is important for understanding the mechanisms underlying the defense of human body and to develop drugs for immunological diseases and maintain health. Recent findings in genomic and proteomic technologies have transformed the immunology research drastically. Sequencing of the human and other model organism genomes has produced increasingly large volumes of data relevant to immunology research and at the same time huge amounts of functional and clinical data are being reported in the scientific literature and stored in clinical records. Recent advances in bioinformatics or computational biology were helpful to understand and organize these large scale data and gave rise to new area that is called Computational immunology or immunoinformatics.

Computational immunology is a branch of bioinformatics and it is based on similar concepts and tools, such as sequence alignment and protein structure prediction tools. Immunomics is a discipline like genomics and proteomics. It is a science, which specifically combines Immunology with computer science, mathematics, chemistry, and biochemistry for large-scale analysis of immune system functions. It aims to study the complex protein–protein interactions and networks and allows a better understanding of immune responses and their role during normal, diseased and reconstitution states. Computational immunology is a part of immunomics, which is focused on analyzing large scale experimental data.

History

Computational immunology began over 90 years ago with the theoretic modeling of malaria epidemiology. At that time, the emphasis was on the use of mathematics to guide the study of disease transmission. Since then, the field has expanded to cover all other aspects of immune system processes and diseases.

Immunological database

After the recent advances in sequencing and proteomics technology, there have been many fold increase in generation of molecular and immunological data. The data are so diverse that they can be categorized in different databases according to their use in the research. Until now there are total 31 different immunological databases noted in the Nucleic Acids Research (NAR) Database Collection, which are given in the following table, together with some more immune related databases. The information given in the table is taken from the database descriptions in NAR Database Collection.

Database	Description
ALPSbase	Autoimmune lymphoproliferative syndrome database
AntigenDB	Sequence, structure, and other data on pathogen antigens.
AntiJen	Quantitative binding data for peptides and proteins of immunological interest.
BCIpep	This database stores information of all experimentally determined B-cell epitopes of antigenic proteins. This is a curated database where detailed information about the epitopes are collected and compiled from published literature and existing databases. It covers a wide range of pathogenic organisms like virus, bacteria, protozoa and fungi. Each entry in database provides full information about a B-cell epitope that includes amino acid sequences, source of the antigenic protein, immunogenicity, model organism and antibody generation/neutralization test.
dbMHC	dbMHC provides access to HLA sequences, tools to support genetic testing of HLA loci, HLA allele and haplotype frequencies of over 90 populations worldwide, as well as clinical datasets on hematopoietic stem cell transplantation, and insulin dependent diabetes mellitus (IDDM), Rheumatoid Arthritis (RA), Narcolepsy and Spondyloarthropathy. For more information go to this link http://www.oxfordjournals.org/nar/database/summary/604
DIGIT	Database of ImmunoGlobulin sequences and Integrated Tools.
FIMM	FIMM is an integrated database of functional molecular immunology that focuses on the T-cell response to disease-specific antigens. FIMM provides fully referenced information integrated with data retrieval and sequence analysis tools on HLA, peptides, T-cell epitopes, antigens, diseases and constitutes one backbone of future computational immunology research. Antigen protein data have been enriched with more than 27,000 sequences derived from the non-redundant SwissProt-TREMBL-TREMBL_NEW (SPTR) database of antigens similar or related FIMM antigens across various species to facilitate a comprehensive analysis of conserved or variable T-cell epitopes.
GPX-Macrophage Expression Atlas	The GPX Macrophage Expression Atlas (GPX-MEA) is an online resource for expression based studies of a range of macrophage cell types following treatment with pathogens and immune modulators. GPX Macrophage Expression Atlas (GPX-MEA) follows the MIAME standard and includes an objective quality score with each experiment. It places special emphasis on rigorously capturing the experimental design and enables the statistical analysis of expression data from different micro-array experiments. This is the first example of a focussed macrophage gene expression database that allows efficient identification of transcriptional patterns, which provide novel insights into biology of this cell system.
HaptenDB	It is a comprehensive database of hapten molecules. This is a curated database where information is collected and compiled from published literature and web resources. Presently database has more than 1700 entries where each entry provides comprehensive detail about a hapten molecule that includes: i) nature of the hapten; ii) methods of anti- hapten antibody production; iii) information about carrier protein; iv) coupling method; v) assay method (used for characterization) and vi) specificities of antibodies. The Haptendb covers wide array of haptens ranging from antibiotics of biomedical importance to pesticides. This database will be very useful for studying the serological reactions and production of antibodies.
HPTAA	HPTAA is a database of potential tumor-associated antigens that uses expression data from various expression platforms, including carefully chosen publicly available microarray expression data, GEO SAGE data and Unigene expression data.
IEDB-3D	Structural data within the Immune Epitope Database.
IL2Rgbase	X-linked severe combined immunodeficiency mutations.
IMGT	IMGT is an integrated knowledge resource specialized in IG, TR, MHC, IG superfamily, MHC superfamily and related proteins of the immune system of human and other vertebrate species. IMGTW comprises 6 databases, 15 on-line tools for sequence, gene and 3D structure analysis, and more than 10,000 pages of resources Web. Data standardization, based on IMGT-ONTOLOGY, has been approved by WHO/IUIS.
IMGT_GENE-DB	IMGT/GENE-DB is the IMGT® comprehensive genome database for immunoglobulins (IG) and T cell receptors (TR) genes from human and mouse, and, in development, from other vertebrate species (e.g. rat). IMGT/GENE-DB is part of IMGT®, the international ImMunoGeneTics information system®, the high-quality integrated knowledge resource specialized in IG, TR, major histocompatibility complex (MHC) of human and other vertebrate species, and related proteins of the immune system (RPI) that belong to the immunoglobulin superfamily (IgSF) and to the MHC superfamily (MhcSF).
IMGT/HLA	There are currently over 1600 officially recognised HLA alleles and these sequences are made available to the scientific community through the IMGT/HLA database. In 1998, the IMGT/HLA database was publicly released. Since this time, the database has grown and is the primary source of information for the study of sequences of the human major histocompatibility complex. The initial release of the database contained allele reports, alignment tools, submission tools as well as detailed descriptions of the source cells. The database is updated quarterly with all the new and confirmatory sequences submitted to the WHO Nomenclature Committee and on average an additional 75 new and confirmatory sequences are included in each quarterly release. The IMGT/HLA database provides a centralized resource for everybody interested, either centrally or peripherally, in the HLA system.
IMGT/LIGM-DB	IMGT/LIGM-DB is the IMGT® comprehensive database of immunoglobulin (IG) and T cell receptor (TR) nucleotide sequences, from human and other vertebrate species, with translation for fully annotated sequences, created in 1989 by LIGM http://www.imgt.org/textes/IMGTinformation/LIGM.html), Montpellier, France, on the Web since July 1995. IMGT/LIGM-DB is the first and the largest database of IMGT®, the international ImMunoGeneTics information system® , the high-quality integrated knowledge resource specialized in IG, TR, major histocompatibility complex (MHC) of human and other vertebrate species, and related proteins of the immune system (RPI) that belong to the immunoglobulin superfamily (IgSF) and to the MHC superfamily (MhcSF). IMGT/LIGM-DB sequence data are identified by the EMBL/GenBank/DDBJ accession number. The unique source of data for IMGT/LIGM-DB is EMBL which shares data with GenBank and DDBJ.
Interferon Stimulated Gene Database	Interferons (IFN) are a family of multifunctional cytokines that activate transcription of a subset of genes. The gene products induced by IFN are responsible for the antiviral, antiproliferative and immunomodulatory properties of this cytokine. In order to obtain a more comprehensive understanding of the genes regulated by IFNs we have used different microarray formats to identify over 400 interferon stimulated genes (ISG). To facilitate the dissemination of this data we have compiled a database comprising the ISGs assigned into functional categories. The database is fully searchable and contains links to sequence and Unigene information. The database and the array data are accessible via the World Wide Web at (http://www.lerner.ccf.org/labs/williams/ ). We intend to add published ISG-sequences and those discovered by further transcript profiling to the database to eventually compile a complete list of ISGs.
IPD-ESTDAB	The Immuno Polymorphism Database (IPD) is a set of specialist databases related to the study of polymorphic genes in the immune system. IPD-ESTDAB is a database of immunologically characterised melanoma cell lines. The database works in conjunction with the European Searchable Tumour Cell Line Database (ESTDAB) cell bank, which is housed in TÜbingen, Germany and provides immunologically characterised tumour cells.
IPD-HPA - Human Platelet Antigens	Human platelet antigens are alloantigens expressed only on platelets, specifically on platelet membrane glycoproteins. These platelet-specific antigens are immunogenic and can result in pathological reactions to transfusion therapy. The IPD-HPA section contains nomenclature information and additional background material about Human platelet antigen. The different genes in the HPA system have not been sequenced to the same level as some of the other projects and so currently only single nucleotide polymorphisms (SNP) are used to determine alleles. This information is presented in a grid of SNP for each gene The IPD and HPA nomenclature committee hope to expand this to provide full sequence alignments when possible.
IPD-KIR - Killer-cell Immunoglobulin-like Receptors	The Killer-cell Immunoglobulin-like Receptors (KIR) are members of the immunoglobulin super family (IgSF) formerly called Killer-cell Inhibitory Receptors. KIRs have been shown to be highly polymorphic both at the allelic and haplotypic levels. They are composed of two or three Ig-domains, a transmembrane region and cytoplasmic tail, which can in turn be short (activatory) or long (inhibitory). The Leukocyte Receptor Complex (LRC), which encodes KIR genes, has been shown to be polymorphic, polygenic and complex in a manner similar to the MHC. The IPD-KIR Sequence Database contains the most up to date nomenclature and sequence alignments.
IPD-MHC	The MHC sequences of many different species have been reported, along with different nomenclature systems used in the naming and identification of new genes and alleles in each species. The sequences of the major histocompatibility complex from number of different species are highly conserved between species. By bringing the work of different nomenclature committees and the sequences of different species together it is hoped to provide a central resource that will facilitate further research on the MHC of each species and on their comparison. The first release of the IPD-MHC database involved the work of groups specialising in non-human primates, canines (DLA) and felines (FLA) and incorporated all data previously available in the IMGT/MHC database. This release included data from five species of ape, sixteen species of new world monkey, seventeen species of old world monkey, as well as data on different canines and felines. Since the first release, sequences from cattle (BoLA), swine (SLA), and rats (RT1) have been added and the work to include MHC sequences from chickens, horses (ELA) is still going on.
MHCBN	MHCBN is a comprehensive database comprising over 23000 peptides sequences, whose binding affinity with MHC or TAP molecules has been assayed experimentally. It is a curated database where entries are compiled from published literature and public databases. Each entry of the database provides full information like (sequence, its MHC or TAP binding specificity, source protein) about peptide whose binding affinity (IC50) and T cell activity is experimentally determined. MHCBN has number of web-based tools for the analysis and retrieval of information. All database entries are hyperlinked to major databases like SWISS-PROT, PDB, IMGT/HLA-DB, PubMed and OMIM to provide the information beyond the scope of MHCBN. Current version of MHCBN contains 1053 entries of TAP binding peptides. The information about the diseases associated with various MHC alleles is also included in this version.
MHCPEP	This database contains list of MHC-binding peptides.
MPID-T2	MPID-T2 (https://web.archive.org/web/20120902154345/http://biolinfo.org/mpid-t2/) is a highly curated database for sequence-structure-function information on MHC-peptide interactions. It contains all structures of major histocompatibility complex proteins (MHC) containing bound peptides, with emphasis on the structural characterization of these complexes. Database entries have been grouped into fully referenced redundant and non-redundant categories. The MHC-peptide interactions have been presented in terms of a set of sequence and structural parameters representative of molecular recognition. MPID will facilitate the development of algorithms to predict whether a query peptide sequence will bind to a specific MHC allele. MPID data has been sorted primarily on the basis of MHC Class, followed by organism (MHC source), next by allele type and finally by the length of peptide in the binding groove (peptide residues within 5 Å of the MHC). Data on inter-molecular hydrogen bonds, gap volume and gap index available in MPID are pre-computed and the interface area due to complex formation is calculated based on accessible surface area calculations. The available MHC-peptide databases have addressed sequence information as well as binding (or the lack thereof) of peptide sequences.
MUGEN Mouse Database	Murine models of immune processes and immunological diseases.
Protegen	Protective antigen database and analysis system.
SuperHapten	SuperHapten is a manually curated hapten database integrating information from literature and web resources. The current version of the database compiles 2D/3D structures, physicochemical properties and references for about 7,500 haptens and 25,000 synonyms. The commercial availability is documented for about 6,300 haptens and 450 related antibodies, enabling experimental approaches on cross-reactivity. The haptens are classified regarding their origin: pesticides, herbicides, insecticides, drugs, natural compounds, etc. Queries allow identification of haptens and associated antibodies according to functional class, carrier protein, chemical scaffold, composition or structural similarity.
The Immune Epitope Database (IEDB)	The Immune Epitope Database (IEDB, www.iedb.org), provides a catalog of experimentally characterized B and T cell epitopes, as well as data on MHC binding and MHC ligand elution experiments. The database represents the molecular structures recognized by adaptive immune receptors and the experimental contexts in which these molecules were determined to be immune epitopes. Epitopes recognized in humans, non-human primates, rodents, pigs, cats and all other tested species are included. Both positive and negative experimental results are captured. Over the course of four years, the data from 180,978 experiments were curated manually from the literature, covering about 99% of all publicly available information on peptide epitopes mapped in infectious agents (excluding HIV) and 93% of those mapped in allergens.
TmaDB^{[permanent dead link]}	To analyse TMA output a relational database (known as TmaDB) has been developed to collate all aspects of information relating to TMAs. These data include the TMA construction protocol, experimental protocol and results from the various immunocytological and histochemical staining experiments including the scanned images for each of the TMA cores. Furthermore, the database contains pathological information associated with each of the specimens on the TMA slide, the location of the various TMAs and the individual specimen blocks (from which cores were taken) in the laboratory and their current status. TmaDB has been designed to incorporate and extend many of the published common data elements and the XML format for TMA experiments and is therefore compatible with the TMA data exchange specifications developed by the Association for Pathology Informatics community.
VBASE2	VBASE2 is an integrative database of germ-line V genes from the immunoglobulin loci of human and mouse. It presents V gene sequences from the EMBL database and Ensembl together with the corresponding links to the source data. The VBASE2 dataset is generated in an automatic process based on a BLAST search of V genes against EMBL and the Ensembl dataset. The BLAST hits are evaluated with the DNAPLOT program, which allows immunoglobulin sequence alignment and comparison, RSS recognition and analysis of the V(D)J-rearrangements. As a result of the BLAST hit evaluation, the VBASE2 entries are classified into 3 different classes: class 1 holds sequences for which a genomic reference and a rearranged sequence is known. Class 2 contains sequences, which have not been found in a rearrangement, thus lacking evidence of functionality. Class 3 contains sequences which have been found in different V(D)J rearrangements but lack a genomic reference. All VBASE2 sequences are compared with the datasets from the VBASE-, IMGT- and KABAT-databases (latest published versions), and the respective references are provided in each VBASE2 sequence entry. The VBASE2 database can be accessed by either a text based query form or by a sequence alignment with the DNAPLOT program. A DAS-server shows the VBASE2 dataset within the Ensembl Genome Browser and links to the database.
Epitome	Epitome is a database of all known antigenic residues and the antibodies that interact with them, including a detailed description of the residues involved in the interaction and their sequence/structure environments. Each entry in the database describes one interaction between a residue on an antigenic protein and a residue on an antibody chain. Every interaction is described using the following parameters: PDB identifier, antigen chain ID PDB position of the antigenic residue, type of antigenic residue and its sequence environment, antigen residue secondary structure state, antigen residue solvent accessibility, antibody chain ID, type of antibody chain (heavy or light), CDR number, PDB position of the antibody residue, and type of antibody residue and its sequence environment. Additionally, interactions can be visualized using an interface to Jmol.
ImmGen	The Immunological Genome consortium database includes expression profiles for more than 250 mouse immune cell types, and several data browsers to study the dataset.

Online resources for allergy information are also available on http://www.allergen.org. Such data is valuable for investigation of cross-reactivity between known allergens and analysis of potential allergenicity in proteins. The Structural Database of Allergen Proteins (SDAP) stores information of allergenic proteins. The Food Allergy Research and Resource Program (FARRP) Protein Allergen-Online Database contains sequences of known and putative allergens derived from scientific literature and public databases. Allergome emphasizes the annotation of allergens that result in an IgE-mediated disease.

Tools

A variety of computational, mathematical and statistical methods are available and reported. These tools are helpful for collection, analysis, and interpretation of immunological data. They include text mining, information management, sequence analysis, analysis of molecular interactions, and mathematical models that enable advanced simulations of immune system and immunological processes. Attempts are being made for the extraction of interesting and complex patterns from non-structured text documents in the immunological domain. Such as categorization of allergen cross-reactivity information, identification of cancer-associated gene variants and the classification of immune epitopes.

Immunoinformatics is using the basic bioinformatics tools such as ClustalW, BLAST, and TreeView, as well as specialized immunoinformatics tools, such as EpiMatrix, IMGT/V-QUEST for IG and TR sequence analysis, IMGT/ Collier-de-Perles and IMGT/StructuralQuery for IG variable domain structure analysis. Methods that rely on sequence comparison are diverse and have been applied to analyze HLA sequence conservation, help verify the origins of human immunodeficiency virus (HIV) sequences, and construct homology models for the analysis of hepatitis B virus polymerase resistance to lamivudine and emtricitabine.

There are also some computational models which focus on protein–protein interactions and networks. There are also tools which are used for T and B cell epitope mapping, proteasomal cleavage site prediction, and TAP– peptide prediction. The experimental data is very much important to design and justify the models to predict various molecular targets. Computational immunology tools is the game between experimental data and mathematically designed computational tools.

Applications

Allergies

Allergies, while a critical subject of immunology, also vary considerably among individuals and sometimes even among genetically similar individuals. The assessment of protein allergenic potential focuses on three main aspects: (i) immunogenicity; (ii) cross-reactivity; and (iii) clinical symptoms. Immunogenicity is due to responses of an IgE antibody-producing B cell and/or of a T cell to a particular allergen. Therefore, immunogenicity studies focus mainly on identifying recognition sites of B-cells and T-cells for allergens. The three-dimensional structural properties of allergens control their allergenicity.

The use of immunoinformatics tools can be useful to predict protein allergenicity and will become increasingly important in the screening of novel foods before their wide-scale release for human use. Thus, there are major efforts under way to make reliable broad based allergy databases and combine these with well validated prediction tools in order to enable the identification of potential allergens in genetically modified drugs and foods. Though the developments are on primary stage, the World Health organization and Food and Agriculture Organization have proposed guidelines for evaluating allergenicity of genetically modified foods. According to the Codex alimentarius, a protein is potentially allergenic if it possesses an identity of ≥6 contiguous amino acids or ≥ 35% sequence similarity over an 80 amino acid window with a known allergen. Though there are rules, their inherent limitations have started to become apparent and exceptions to the rules have been well reported.

Infectious diseases and host responses

In the study of infectious diseases and host responses, the mathematical and computer models are a great help. These models were very useful in characterizing the behavior and spread of infectious disease, by understanding the dynamics of the pathogen in the host and the mechanisms of host factors which aid pathogen persistence. Examples include Plasmodium falciparum and nematode infection in ruminants.

Much has been done in understanding immune responses to various pathogens by integrating genomics and proteomics with bioinformatics strategies. Many exciting developments in large-scale screening of pathogens are currently taking place. National Institute of Allergy and Infectious Diseases (NIAID) has initiated an endeavor for systematic mapping of B and T cell epitopes of category A-C pathogens. These pathogens include Bacillus anthracis (anthrax), Clostridium botulinum toxin (botulism), Variola major (smallpox), Francisella tularensis (tularemia), viral hemorrhagic fevers, Burkholderia pseudomallei, Staphylococcus enterotoxin B, yellow fever, influenza, rabies, Chikungunya virus etc. Rule-based systems have been reported for the automated extraction and curation of influenza A records.

This development would lead to the development of an algorithm which would help to identify the conserved regions of pathogen sequences and in turn would be useful for vaccine development. This would be helpful in limiting the spread of infectious disease. Examples include a method for identification of vaccine targets from protein regions of conserved HLA binding and computational assessment of cross-reactivity of broadly neutralizing antibodies against viral pathogens. These examples illustrate the power of immunoinformatics applications to help solve complex problems in public health. Immunoinformatics could accelerate the discovery process dramatically and potentially shorten the time required for vaccine development. Immunoinformatics tools have been used to design the vaccine against Dengue virus and Leishmania.

Immune system function

Using this technology it is possible to know the model behind immune system. It has been used to model T-cell-mediated suppression, peripheral lymphocyte migration, T-cell memory, tolerance, thymic function, and antibody networks. Models are helpful to predicts dynamics of pathogen toxicity and T-cell memory in response to different stimuli. There are also several models which are helpful in understanding the nature of specificity in immune network and immunogenicity.

For example, it was useful to examine the functional relationship between TAP peptide transport and HLA class I antigen presentation. TAP is a transmembrane protein responsible for the transport of antigenic peptides into the endoplasmic reticulum, where MHC them class I molecules can bind them and presented to T cells. As TAP does not bind all peptides equally, TAP-binding affinity could influence the ability of a particular peptide to gain access to the MHC class I pathway. Artificial neural network (ANN), a computer model was used to study peptide binding to human TAP and its relationship with MHC class I binding. The affinity of HLA-binding peptides for TAP was found to differ according to the HLA supertype concerned using this method. This research could have important implications for the design of peptide based immuno-therapeutic drugs and vaccines. It shows the power of the modeling approach to understand complex immune interactions.

There exist also methods which integrate peptide prediction tools with computer simulations that can provide detailed information on the immune response dynamics specific to the given pathogen's peptides.

Cancer Informatics

Cancer is the result of somatic mutations which provide cancer cells with a selective growth advantage. Recently it has been very important to determine the novel mutations. Genomics and proteomics techniques are used worldwide to identify mutations related to each specific cancer and their treatments. Computational tools are used to predict growth and surface antigens on cancerous cells. There are publications explaining a targeted approach for assessing mutations and cancer risk. Algorithm CanPredict was used to indicate how closely a specific gene resembles known cancer-causing genes. Cancer immunology has been given so much importance that the data related to it is growing rapidly. Protein–protein interaction networks provide valuable information on tumorigenesis in humans. Cancer proteins exhibit a network topology that is different from normal proteins in the human interactome. Immunoinformatics have been useful in increasing success of tumour vaccination. Recently, pioneering works have been conducted to analyse the host immune system dynamics in response to artificial immunity induced by vaccination strategies. Other simulation tools use predicted cancer peptides to forecast immune specific anticancer responses that is dependent on the specified HLA. These resources are likely to grow significantly in the near future and immunoinformatics will be a major growth area in this domain.