Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins.
These proteins are usually ones that are poorly studied or predicted
based on genomic sequence data. These predictions are often driven by
data-intensive computational procedures. Information may come from
nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining
of publications, phylogenetic profiles, phenotypic profiles, and
protein-protein interaction. Protein function is a broad term: the roles
of proteins range from catalysis of biochemical reactions to transport
to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.
Generally, function can be thought of as, "anything that happens to or through a protein". The Gene Ontology Consortium
provides a useful classification of functions, based on a dictionary of
well-defined terms divided into three main categories of molecular function, biological process and cellular component. Researchers can query this database with a protein name or accession number to retrieve associated Gene Ontology (GO) terms or annotations based on computational or experimental evidence.
While techniques such as microarray analysis, RNA interference, and the yeast two-hybrid system
can be used to experimentally demonstrate the function of a protein,
advances in sequencing technologies have made the rate at which proteins
can be experimentally characterized much slower than the rate at which
new sequences become available. Thus, the annotation of new sequences is mostly by prediction
through computational methods, as these types of annotation can often
be done quickly and for many genes or proteins at once. The first such
methods inferred function based on homologous proteins with known functions (homology-based function prediction).
The development of context-based and structure based methods have
expanded what information can be predicted, and a combination of methods
can now be used to get a picture of complete cellular pathways based on
sequence data.
The importance and prevalence of computational prediction of gene
function is underlined by an analysis of 'evidence codes' used by the GO
database: as of 2010, 98% of annotations were listed under the code IEA
(inferred from electronic annotation) while only 0.6% were based on
experimental evidence.
Function prediction methods
Homology-based methods
A part of a multiple sequence alignment of four different hemoglobin protein sequences. Similar protein sequences, usually indicate shared functions.
Proteins of similar sequence are usually homologous and thus have a similar function. Hence proteins in a newly sequenced genome are routinely annotated using the sequences of similar proteins in related genomes.
However, closely related proteins do not always share the same function. For example, the yeast Gal1 and Gal3 proteins are paralogs (73% identity and 92% similarity) that have evolved very different functions with Gal1 being a galactokinase and Gal3 being a transcriptional inducer.
There is no hard sequence-similarity threshold for "safe"
function prediction; many proteins of barely detectable sequence
similarity have the same function while others (such as Gal1 and Gal3)
are highly similar but have evolved different functions. As a rule of
thumb, sequences that are more than 30-40% identical are usually
considered as having the same or a very similar function.
For enzymes, predictions of specific functions are especially difficult, as they only need a few key residues in their active site,
hence very different sequences can have very similar activities. By
contrast, even with sequence identity of 70% or greater, 10% of any pair
of enzymes have different substrates; and differences in the actual
enzymatic reactions are not uncommon near 50% sequence identity.
Sequence motif-based methods
The development of protein domain databases such as Pfam (Protein Families Database) allow us to find known domains within a query sequence, providing evidence for likely functions. The dcGO website
contains annotations to both the individual domains and supra-domains
(i.e., combinations of two or more successive domains), thus via dcGO
Predictor allowing for the function predictions in a more realistic
manner. Within protein domains, shorter signatures known as 'motifs' are associated with particular functions, and motif databases such as PROSITE ('database of protein domains, families and functional sites') can be searched using a query sequence.
Motifs can, for example, be used to predict subcellular localization
of a protein (where in the cell the protein is sent after synthesis).
Short signal peptides direct certain proteins to a particular location
such as the mitochondria, and various tools exist for the prediction of
these signals in a protein sequence. For example, SignalP, which has been updated several times as methods are improved.
Thus, aspects of a protein's function can be predicted without comparison to other full-length homologous protein sequences.
Structure-based methods
An alignment of the toxic proteins ricin and abrin. Structural alignments may be used to determine if two proteins have similar functions even when their sequences differ.
Because 3D protein structure
is generally more well conserved than protein sequence, structural
similarity is a good indicator of similar function in two or more
proteins. Many programs have been developed to screen an unknown protein structure against the Protein Data Bank
and report similar structures (for example, FATCAT (Flexible structure
AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists), CE (combinatorial extension)) and DeepAlign (protein structure alignment beyond spatial proximity). To deal with the situation that many protein sequences have no solved structures, some function prediction servers such as RaptorX
are also developed that can first predict the 3D model of a sequence
and then use structure-based method to predict functions based upon the
predicted 3D model. In many cases instead of the whole protein
structure, the 3D structure of a particular motif representing an active site or binding site can be targeted. The Structurally Aligned Local Sites of Activity (SALSA) method, developed by Mary Jo Ondrechen
and students, utilizes computed chemical properties of the individual
amino acids to identify local biochemically active sites. Databases such
as Catalytic Site Atlas have been developed that can be searched using novel protein sequences to predict specific functional sites.
Genomic context-based methods
Many
of the newer methods for protein function prediction are not based on
comparison of sequence or structure as above, but on some type of
correlation between novel genes/proteins and those that already have
annotations. Also known as phylogenomic profiling, these genomic context
based methods are based on the observation that two or more proteins
with the same pattern of presence or absence in many different genomes
most likely have a functional link.
Whereas homology-based methods can often be used to identify molecular
functions of a protein, context-based approaches can be used to predict
cellular function, or the biological process in which a protein acts.
For example, proteins involved in the same signal transduction pathway
are likely to share a genomic context across all species.
Gene fusion
Gene fusion
occurs when two or more genes encode two or more proteins in one
organism and have, through evolution, combined to become a single gene
in another organism (or vice versa for gene fission).
This concept has been used, for example, to search all E. coli
protein sequences for homology in other genomes and find over 6000
pairs of sequences with shared homology to single proteins in another
genome, indicating potential interaction between each of the pairs.
Because the two sequences in each protein pair are non-homologous,
these interactions could not be predicted using homology-based methods.
Co-location/co-expression
In prokaryotes,
clusters of genes that are physically close together in the genome
often conserve together through evolution, and tend to encode proteins
that interact or are part of the same operon. Thus, chromosomal proximity also called the gene neighbour method
can be used to predict functional similarity between proteins, at least
in prokaryotes. Chromosomal proximity has also been seen to apply for
some pathways in selected eukaryotic genomes, including Homo sapiens, and with further development gene neighbor methods may be valuable for studying protein interactions in eukaryotes.
Genes involved in similar functions are also often
co-transcribed, so that an unannotated protein can often be predicted to
have a related function to proteins with which it co-expresses. The guilt by association algorithms
developed based on this approach can be used to analyze large amounts
of sequence data and identify genes with expression patterns similar to
those of known genes. Often, a guilt by association study compares a group of candidate genes
(unknown function) to a target group (for example, a group of genes
known to be associated with a particular disease), and rank the
candidate genes by their likelihood of belonging to the target group
based on the data.
Based on recent studies, however, it has been suggested that some
problems exist with this type of analysis. For example, because many
proteins are multifunctional, the genes encoding them may belong to
several target groups. It is argued that such genes are more likely to
be identified in guilt by association studies, and thus predictions are
not specific.
With the accumulation of RNA-seq data that are capable of
estimating expression profiles for alternatively spliced isoforms,
machine learning algorithms have also been developed for predicting and
differentiating functions at the isoform level.
This represents an emerging research area in function prediction, which
integrates large-scale, heterogeneous genomic data to infer functions
at the isoform level.
Computational solvent mapping
Computational
solvent mapping of AMA1 protein using fragment-based computational
solvent mapping (FTMAP) by computationally scanning the surface of AMA1
with 16 probes (small organic molecules) and defining the locations
where the probes cluster (marked as colorful regions on the protein
surface)
One of the challenges involved in protein function prediction is
discovery of the active site. This is complicated by certain active
sites not being formed – essentially existing – until the protein
undergoes conformational changes brought on by the binding of small
molecules. Most protein structures have been determined by X-ray crystallography which requires a purified protein crystal.
As a result, existing structural models are generally of a purified
protein and as such lack the conformational changes that are created
when the protein interacts with small molecules.
Computational solvent mapping utilizes probes (small organic
molecules) that are computationally 'moved' over the surface of the
protein searching for sites where they tend to cluster. Multiple
different probes are generally applied with the goal being to obtain a
large number of different protein-probe conformations. The generated
clusters are then ranked based on the cluster's average free energy.
After computationally mapping multiple probes, the site of the protein
where relatively large numbers of clusters form typically corresponds to
an active site on the protein.
This technique is a computational adaptation of 'wet lab' work
from 1996. It was discovered that ascertaining the structure of a
protein while it is suspended in different solvents and then
superimposing those structures on one another produces data where the
organic solvent molecules (that the proteins were suspended in)
typically cluster at the protein's active site. This work was carried
out as a response to realizing that water molecules are visible in the
electron density maps produced by X-ray crystallography.
The water molecules are interacting with the protein and tend to
cluster at the protein's polar regions. This led to the idea of
immersing the purified protein crystal in other solvents (e.g. ethanol, isopropanol,
etc.) to determine where these molecules cluster on the protein. The
solvents can be chosen based on what they approximate, that is, what
molecule this protein may interact with (e.g. ethanol can probe for interactions with the amino acidserine, isopropanol a probe for threonine, etc.). It is vital that the protein crystal maintains its tertiary structure
in each solvent. This process is repeated for multiple solvents and
then this data can be used to try to determine potential active sites on
the protein. Ten years later this technique was developed into an algorithm by Clodfelter et al.
Network-based methods
An example protein interaction network, produced through the STRING web resource. Patterns of protein interactions within networks are used to infer function. Here, products of the bacterial trp genes coding for tryptophan synthase are shown to interact with themselves and other, related proteins.
Guilt
by association type algorithms may be used to produce a functional
association network for a given target group of genes or proteins. These
networks serve as a representation of the evidence for shared/similar
function within a group of genes, where nodes represent genes/proteins and are linked to each other by edges representing evidence of shared function.
Integrated networks
Several
networks based on different data sources can be combined into a
composite network, which can then be used by a prediction algorithm to
annotate candidate genes or proteins. For example, the developers of the bioPIXIE system used a wide variety of Saccharomyces cerevisiae (yeast) genomic data to produce a composite functional network for that species.
This resource allows the visualization of known networks representing
biological processes, as well as the prediction of novel components of
those networks. Many algorithms have been developed to predict function
based on the integration of several data sources (e.g. genomic,
proteomic, protein interaction, etc.), and testing on previously
annotated genes indicates a high level of accuracy.
Disadvantages of some function prediction algorithms have included a
lack of accessibility, and the time required for analysis. Faster, more
accurate algorithms such as GeneMANIA (multiple association network integration algorithm) have however been developed in recent years and are publicly available on the web, indicating the future direction of function prediction.
Tools and databases for protein function prediction
STRING: web tool that integrates various data sources for function prediction.
VisANT: Visual analysis of networks and integrative visual data-mining.
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
In its earliest days, "gene finding" was based on painstaking
experimentation on living cells and organisms. Statistical analysis of
the rates of homologous recombination of several different genes could determine their order on a certain chromosome, and information from many such experiments could be combined to create a genetic map
specifying the rough location of known genes relative to each other.
Today, with comprehensive genome sequence and powerful computational
resources at the disposal of the research community, gene finding has
been redefined as a largely computational problem.
Determining that a sequence is functional should be distinguished from determining the function
of the gene or its product. Predicting the function of a gene and
confirming that the gene prediction is accurate still demands in vivo experimentation through gene knockout and other assays, although frontiers of bioinformatics research are making it increasingly possible to predict the function of a gene based on its sequence alone.
Gene prediction is one of the key steps in genome annotation, following sequence assembly, the filtering of non-coding regions and repeat masking.
In
empirical (similarity, homology or evidence-based) gene finding
systems, the target genome is searched for sequences that are similar to
extrinsic evidence in the form of the known expressed sequence tags, messenger RNA (mRNA), protein
products, and homologous or orthologous sequences. Given an mRNA
sequence, it is trivial to derive a unique genomic DNA sequence from
which it had to have been transcribed. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the genetic code.
Once candidate DNA sequences have been determined, it is a relatively
straightforward algorithmic problem to efficiently search a target
genome for matches, complete or partial, and exact or inexact. Given a
sequence, local alignment algorithms such as BLAST, FASTA and Smith-Waterman
look for regions of similarity between the target sequence and possible
candidate matches. Matches can be complete or partial, and exact or
inexact. The success of this approach is limited by the contents and
accuracy of the sequence database.
A high degree of similarity to a known messenger RNA or protein
product is strong evidence that a region of a target genome is a
protein-coding gene. However, to apply this approach systemically
requires extensive sequencing of mRNA and protein products. Not only is
this expensive, but in complex organisms, only a subset of all genes in
the organism's genome are expressed at any given time, meaning that
extrinsic evidence for many genes is not readily accessible in any
single cell culture. Thus, to collect extrinsic evidence for most or all
of the genes in a complex organism requires the study of many hundreds
or thousands of cell types,
which presents further difficulties. For example, some human genes may
be expressed only during development as an embryo or fetus, which might
be difficult to study for ethical reasons.
Despite these difficulties, extensive transcript and protein
sequence databases have been generated for human as well as other
important model organisms in biology, such as mice and yeast. For
example, the RefSeq database contains transcript and protein sequence from many different species, and the Ensembl
system comprehensively maps this evidence to human and several other
genomes. It is, however, likely that these databases are both incomplete
and contain small but significant amounts of erroneous data.
New high-throughput transcriptome sequencing technologies such as RNA-Seq and ChIP-sequencing
open opportunities for incorporating additional extrinsic evidence into
gene prediction and validation, and allow structurally rich and more
accurate alternative to previous methods of measuring gene expression such as expressed sequence tag or DNA microarray.
Major challenges involved in gene prediction involve dealing with
sequencing errors in raw DNA data, dependence on the quality of the sequence assembly, handling short reads, frameshift mutations, overlapping genes and incomplete genes.
In prokaryotes it's essential to consider horizontal gene transfer
when searching for gene sequence homology. An additional important
factor underused in current gene detection tools is existence of gene
clusters—operons
in both prokaryotes and eukaryotes. Most popular gene detectors treat
each gene in isolation, independent of others, which is not biologically
accurate.
Ab initio methods
Ab
Initio gene prediction is an intrinsic method based on gene content and
signal detection. Because of the inherent expense and difficulty in
obtaining extrinsic evidence for many genes, it is also necessary to
resort to ab initio gene finding, in which the genomicDNA sequence
alone is systematically searched for certain tell-tale signs of
protein-coding genes. These signs can be broadly categorized as either signals, specific sequences that indicate the presence of a gene nearby, or content, statistical properties of the protein-coding sequence itself. Ab initio gene finding might be more accurately characterized as gene prediction, since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.
This
picture shows how Open Reading Frames (ORFs) can be used for gene
prediction. Gene prediction is the process of determining where a coding
gene might be in a genomic sequence. Functional proteins must begin
with a Start codon (where DNA transcription begins), and end with a Stop
codon (where transcription ends). By looking at where those codons
might fall in a DNA sequence, one can see where a functional protein
might be located. This is important in gene prediction because it can
reveal where coding genes are in an entire genomic sequence. In this
example, a functional protein can be discovered using ORF3 because it
begins with a Start codon, has multiple amino acids, and then ends with a
Stop codon, all within the same reading frame.
In the genomes of prokaryotes, genes have specific and relatively well-understood promoter sequences (signals), such as the Pribnow box and transcription factorbinding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame (ORF), which is typically many hundred or thousands of base pairs long. The statistics of stop codons
are such that even finding an open reading frame of this length is a
fairly informative sign. (Since 3 of the 64 possible codons in the
genetic code are stop codons, one would expect a stop codon
approximately every 20–25 codons, or 60–75 base pairs, in a random sequence.) Furthermore, protein-coding DNA has certain periodicities
and other statistical properties that are easy to detect in sequence of
this length. These characteristics make prokaryotic gene finding
relatively straightforward, and well-designed systems are able to
achieve high levels of accuracy.
Ab initio gene finding in eukaryotes,
especially complex organisms like humans, is considerably more
challenging for several reasons. First, the promoter and other
regulatory signals in these genomes are more complex and less
well-understood than in prokaryotes, making them more difficult to
reliably recognize. Two classic examples of signals identified by
eukaryotic gene finders are CpG islands and binding sites for a poly(A) tail.
Second, splicing
mechanisms employed by eukaryotic cells mean that a particular
protein-coding sequence in the genome is divided into several parts (exons), separated by non-coding sequences (introns).
(Splice sites are themselves another signal that eukaryotic gene
finders are often designed to identify.) A typical protein-coding gene
in humans might be divided into a dozen exons, each less than two
hundred base pairs in length, and some as short as twenty to thirty. It
is therefore much more difficult to detect periodicities and other known
content properties of protein-coding DNA in eukaryotes.
Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex probabilistic models, such as hidden Markov models (HMMs) to combine information from a variety of different signal and content measurements. The GLIMMER system is a widely used and highly accurate gene finder for prokaryotes. GeneMark is another popular approach. Eukaryotic ab initio gene finders, by comparison, have achieved only limited success; notable examples are the GENSCAN and geneid
programs. The SNAP gene finder is HMM-based like Genscan, and attempts
to be more adaptable to different organisms, addressing problems related
to using a gene finder on a genome sequence that it was not trained
against. A few recent approaches like mSplicer, CONTRAST, or mGene also use machine learning techniques like support vector machines for successful gene prediction. They build a discriminative model using hidden Markov support vector machines or conditional random fields to learn an accurate gene prediction scoring function.
Ab Initio methods have been benchmarked, with some approaching 100% sensitivity, however as the sensitivity increases, accuracy suffers as a result of increased false positives.
It has been suggested that signals other than those directly
detectable in sequences may improve gene prediction. For example, the
role of secondary structure in the identification of regulatory motifs has been reported. In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.
Neural networks
Artificial neural networks are computational models that excel at machine learning and pattern recognition. Neural networks must be trained
with example data before being able to generalise for experimental
data, and tested against benchmark data. Neural networks are able to
come up with approximate solutions to problems that are hard to solve
by algorithms, provided there is sufficient training data. When
applied to gene prediction, neural networks can be used alongside other ab initio methods to predict or identify biological features such as splice sites. One approach
involves using a sliding window, which traverses the sequence data in
an overlapping manner. The output at each position is a score based on
whether the network thinks the window contains a donor splice site or an
acceptor splice site. Larger windows offer more accuracy but also
require more computational power. A neural network is an example of a
signal sensor as its goal is to identify a functional site in the
genome.
Combined approaches
Programs such as Maker combine extrinsic and ab initio approaches by mapping protein and EST data to the genome to validate ab initio predictions. Augustus,
which may be used as part of the Maker pipeline, can also incorporate
hints in the form of EST alignments or protein profiles to increase the
accuracy of the gene prediction.
Comparative genomics approaches
As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a comparative genomics approach.
This is based on the principle that the forces of natural selection
cause genes and other functional elements to undergo mutation at a
slower rate than the rest of the genome, since mutations in functional
elements are more likely to negatively impact the organism than
mutations elsewhere. Genes can thus be detected by comparing the
genomes of related species to detect this evolutionary pressure for
conservation. This approach was first applied to the mouse and human
genomes, using programs such as SLAM, SGP and TWINSCAN/N-SCAN and
CONTRAST.
Multiple informants
TWINSCAN
examined only human-mouse synteny to look for orthologous genes.
Programs such as N-SCAN and CONTRAST allowed the incorporation of
alignments from multiple organisms, or in the case of N-SCAN, a single
alternate organism from the target. The use of multiple informants can
lead to significant improvements in accuracy.
CONTRAST is composed of two elements. The first is a smaller
classifier, identifying donor splice sites and acceptor splice sites as
well as start and stop codons. The second element involves constructing a
full model using machine learning. Breaking the problem into two means
that smaller targeted data sets can be used to train the classifiers,
and that classifier can operate independently and be trained with
smaller windows. The full model can use the independent classifier, and
not have to waste computational time or model complexity re-classifying
intron-exon boundaries. The paper in which CONTRAST is introduced
proposes that their method (and those of TWINSCAN, etc.) be classified
as de novo gene assembly, using alternate genomes, and identifying it as distinct from ab initio, which uses a target 'informant' genomes.
Comparative gene finding can also be used to project high quality
annotations from one genome to another. Notable examples include
Projector, GeneWise, GeneMapper and GeMoMa. Such techniques now play a
central role in the annotation of all genomes.
Pseudogene prediction
Pseudogenes are close relatives of genes, sharing very high sequence homology, but being unable to code for the same protein product. Whilst once relegated as byproducts of gene sequencing, increasingly, as regulatory roles are being uncovered, they are becoming predictive targets in their own right.
Pseudogene prediction utilizes existing sequence similarity and ab
initio methods, whilst adding additional filtering and methods of
identifying pseudogene characteristics.
Sequence similarity methods can be customized for pseudogene
prediction using additional filtering to find candidate pseudogenes.
This could use disablement detection, which looks for nonsense or
frameshift mutations that would truncate or collapse an otherwise
functional coding sequence. Additionally, translating DNA into proteins sequences can be more effective than just straight DNA homology.
Content sensors can be filtered according to the differences in
statistical properties between pseudogenes and genes, such as a reduced
count of CpG islands in pseudogenes, or the differences in G-C content
between pseudogenes and their neighbours. Signal sensors also can be
honed to pseudogenes, looking for the absence of introns or polyadenine
tails.
Metagenomic gene prediction
Metagenomics
is the study of genetic material recovered from the environment,
resulting in sequence information from a pool of organisms. Predicting
genes is useful for comparative metagenomics.
Metagenomics tools also fall into the basic categories of using
either sequence similarity approaches (MEGAN4) and ab initio techniques
(GLIMMER-MG).
Glimmer-MG is an extension to GLIMMER
that relies mostly on an ab initio approach for gene finding and by
using training sets from related organisms. The prediction strategy is
augmented by classification and clustering gene data sets prior to
applying ab initio gene prediction methods. The data is clustered by
species. This classification method leverages techniques from
metagenomic phylogenetic classification. An example of software for this
purpose is, Phymm, which uses interpolated markov models—and PhymmBL,
which integrates BLAST into the classification routines.
MEGAN4
uses a sequence similarity approach, using local alignment against
databases of known sequences, but also attempts to classify using
additional information on functional roles, biological pathways and
enzymes. As in single organism gene prediction, sequence similarity
approaches are limited by the size of the database.
FragGeneScan and MetaGeneAnnotator are popular gene prediction programs based on Hidden Markov model. These predictors account for sequencing errors, partial genes and work for short reads.
Another fast and accurate tool for gene prediction in metagenomes is MetaGeneMark. This tool is used by the DOE Joint Genome Institute to annotate IMG/M, the largest metagenome collection to date.
Constituent amino-acids can be analyzed to predict secondary, tertiary and quaternary protein structure.
Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its folding and its secondary and tertiary structure from its primary structure. Structure prediction is fundamentally different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry; it is highly important in medicine (for example, in drug design) and biotechnology (for example, in the design of novel enzymes). Every two years, the performance of current methods is assessed in the CASP
experiment (Critical Assessment of Techniques for Protein Structure
Prediction). A continuous evaluation of protein structure prediction web
servers is performed by the community project CAMEO3D.
Protein structure and terminology
Proteins
are chains of amino acids joined together by peptide bonds. Many
conformations of this chain are possible due to the rotation of the
chain about each Cα atom. It is these conformational changes that are
responsible for differences in the three dimensional structure of
proteins. Each amino acid in the chain is polar, i.e. it has separated
positive and negative charged regions with a free carbonyl group,
which can act as hydrogen bond acceptor and an NH group, which can act
as hydrogen bond donor. These groups can therefore interact in the
protein structure. The 20 amino acids can be classified according to
the chemistry of the side chain which also plays an important structural
role. Glycine
takes on a special position, as it has the smallest side chain, only
one hydrogen atom, and therefore can increase the local flexibility in
the protein structure. Cysteine on the other hand can react with another cysteine residue and thereby form a cross link stabilizing the whole structure.
The protein structure can be considered as a sequence of
secondary structure elements, such as α helices and β sheets, which
together constitute the overall three-dimensional configuration of the
protein chain. In these secondary structures regular patterns of H bonds
are formed between neighboring amino acids, and the amino acids have
similar Φ and Ψ angles.
Bond angles for ψ and ω
The formation of these structures neutralizes the polar groups on
each amino acid. The secondary structures are tightly packed in the
protein core in a hydrophobic environment. Each amino acid side group
has a limited volume to occupy and a limited number of possible
interactions with other nearby side chains, a situation that must be
taken into account in molecular modeling and alignments.
α Helix
The α helix is the most abundant type of secondary structure in
proteins. The α helix has 3.6 amino acids per turn with an H bond formed
between every fourth residue; the average length is 10 amino acids (3
turns) or 10 Å but varies from 5 to 40 (1.5 to 11 turns). The alignment
of the H bonds creates a dipole moment for the helix with a resulting
partial positive charge at the amino end of the helix. Because this
region has free NH2 groups, it will interact with
negatively charged groups such as phosphates. The most common location
of α helices is at the surface of protein cores, where they provide an
interface with the aqueous environment. The inner-facing side of the
helix tends to have hydrophobic amino acids and the outer-facing side
hydrophilic amino acids. Thus, every third of four amino acids along the
chain will tend to be hydrophobic, a pattern that can be quite readily
detected. In the leucine zipper motif, a repeating pattern of leucines
on the facing sides of two adjacent helices is highly predictive of the
motif. A helical-wheel plot can be used to show this repeated pattern.
Other α helices buried in the protein core or in cellular
membranes have a higher and more regular distribution of
hydrophobic amino acids, and are highly predictive of such structures.
Helices exposed on the surface have a lower proportion of hydrophobic
amino acids. Amino acid content can be predictive of an α -helical
region. Regions richer in alanine (A), glutamic acid (E), leucine (L), and methionine (M) and poorer in proline (P), glycine (G), tyrosine (Y), and serine
(S) tend to form an α helix. Proline destabilizes or breaks an α
helix but can be present in longer helices, forming a bend.
An alpha-helix with hydrogen bonds (yellow dots)
β sheet
β sheets are formed by H bonds between an average of 5–10 consecutive
amino acids in one portion of the chain with another 5–10 farther down
the chain. The interacting regions may be adjacent, with a short loop in
between, or far apart, with other structures in between. Every chain
may run in the same direction to form a parallel sheet, every other
chain may run in the reverse chemical direction to form an anti parallel
sheet, or the chains may be parallel and anti parallel to form a mixed
sheet. The pattern of H bonding is different in the parallel and anti
parallel configurations. Each amino acid in the interior strands of the
sheet forms two H bonds with neighboring amino acids, whereas each amino
acid on the outside strands forms only one bond with an interior
strand. Looking across the sheet at right angles to the strands, more
distant strands are rotated slightly counterclockwise to form a
left-handed twist. The Cα atoms alternate above and below the sheet in a
pleated structure, and the R side groups of the amino acids alternate
above and below the pleats. The Φ and Ψ angles of the amino acids in
sheets vary considerably in one region of the Ramachandran plot.
It is more difficult to predict the location of β sheets than of α
helices. The situation improves somewhat when the amino acid variation
in multiple sequence alignments is taken into account.
Loop
Loops are
regions of a protein chain that are 1) between α helices and β
sheets, 2) of various lengths and three-dimensional configurations, and
3) on the surface of the structure.
Hairpin loops that represent a complete turn in the polypeptide
chain joining two antiparallel β strands may be as short as two amino
acids in length. Loops interact with the surrounding aqueous environment
and other proteins. Because amino acids in loops are not constrained by
space and environment as are amino acids in the core region, and do not
have an effect on the arrangement of secondary structures in the core,
more substitutions, insertions, and deletions may occur. Thus, in a
sequence alignment, the presence of these features may be an indication
of a loop. The positions of introns in genomic DNA sometimes correspond to the locations of loops in the encoded protein.
Loops also tend to have charged and polar amino acids and are
frequently a component of active sites. A detailed examination of loop
structures has shown that they fall into distinct families.
Coils
A region of secondary structure that is not a α helix, a β sheet, or a recognizable turn is commonly referred to as a coil.
Protein classification
Proteins
may be classified according to both structural and sequence similarity.
For structural classification, the sizes and spatial arrangements of
secondary structures described in the above paragraph are compared in
known three-dimensional structures. Classification based on sequence
similarity was historically the first to be used. Initially, similarity
based on alignments of whole sequences was performed. Later, proteins
were classified on the basis of the occurrence of conserved amino acid
patterns. Databases
that classify proteins by one or more of these schemes are available.
In considering protein classification schemes, it is important to keep
several observations in mind. First, two entirely different protein
sequences from different evolutionary origins may fold into a similar
structure. Conversely, the sequence of an ancient gene for a given
structure may have diverged considerably in different species while at
the same time maintaining the same basic structural features.
Recognizing any remaining sequence similarity in such cases may be a
very difficult task. Second, two proteins that share a significant
degree of sequence similarity either with each other or with a third
sequence also share an evolutionary origin and should share some
structural features also. However, gene duplication and genetic
rearrangements during evolution may give rise to new gene copies, which
can then evolve into proteins with new function and structure.
Terms used for classifying protein structures and sequences
The
more commonly used terms for evolutionary and structural relationships
among proteins are listed below. Many additional terms are used for
various kinds of structural features found in proteins. Descriptions of
such terms may be found at the CATH Web site, the Structural Classification of Proteins (SCOP) Web site, and a Glaxo-Wellcome tutorial on the Swiss bioinformatics Expasy Web site.
Active site
is a localized combination of amino acid side groups within the
tertiary (three-dimensional) or quaternary (protein subunit) structure
that can interact with a chemically specific substrate and that provides
the protein with biological activity. Proteins of very different amino
acid sequences may fold into a structure that produces the same active
site.
Architecture is the relative orientations of secondary structures
in a three-dimensional structure without regard to whether or not they
share a similar loop structure.
Fold is a type of architecture that also has a conserved loop structure.
Blocks is a conserved amino acid sequence pattern in a family of
proteins. The pattern includes a series of possible matches at each
position in the represented sequences, but there are not any inserted or
deleted positions in the pattern or in the sequences. By way of
contrast, sequence profiles are a type of scoring matrix that represents
a similar set of patterns that includes insertions and deletions.
Class is a term used to classify protein domains according to their secondary structural content and organization. Four classes
were originally recognized by Levitt and Chothia (1976), and several
others have been added in the SCOP database. Three classes are given in
the CATH database: mainly-α, mainly-β, and α–β, with the α–β class
including both alternating α/β and α+β structures.
Core is the portion of a folded protein molecule that comprises
the hydrophobic interior of α-helices and β-sheets. The compact
structure brings together side groups of amino acids into close enough
proximity so that they can interact. When comparing protein structures,
as in the SCOP database, core is the region common to most of the
structures that share a common fold or that are in the same superfamily.
In structure prediction, core is sometimes defined as the arrangement
of secondary structures that is likely to be conserved during
evolutionary change.
Domain
(sequence context) is a segment of a polypeptide chain that can fold
into a three-dimensional structure irrespective of the presence of other
segments of the chain. The separate domains of a given protein may
interact extensively or may be joined only by a length of polypeptide
chain. A protein with several domains may use these domains for
functional interactions with different molecules.
Family
(sequence context) is a group of proteins of similar biochemical
function that are more than 50% identical when aligned. This same cutoff
is still used by the Protein Information Resource
(PIR). A protein family comprises proteins with the same function in
different organisms (orthologous sequences) but may also include
proteins in the same organism (paralogous sequences) derived from gene
duplication and rearrangements. If a multiple sequence alignment of a
protein family reveals a common level of similarity throughout the
lengths of the proteins, PIR refers to the family as a homeomorphic
family. The aligned region is referred to as a homeomorphic domain, and
this region may comprise several smaller homology domains that are
shared with other families. Families may be further subdivided into
subfamilies or grouped into superfamilies based on respective higher or
lower levels of sequence similarity. The SCOP database reports 1296
families and the CATH database (version 1.7 beta), reports 1846
families.
When the sequences of proteins with the same function are
examined in greater detail, some are found to share high sequence
similarity. They are obviously members of the same family by the above
criteria. However, others are found that have very little, or even
insignificant, sequence similarity with other family members. In such
cases, the family relationship between two distant family members A and C
can often be demonstrated by finding an additional family member B that
shares significant similarity with both A and C. Thus, B provides a
connecting link between A and C. Another approach is to examine distant
alignments for highly conserved matches.
At a level of identity of 50%, proteins are likely to have the
same three-dimensional structure, and the identical atoms in the
sequence alignment will also superimpose within approximately 1 Å in the
structural model. Thus, if the structure of one member of a family is
known, a reliable prediction may be made for a second member of the
family, and the higher the identity level, the more reliable the
prediction. Protein structural modeling can be performed by examining
how well the amino acid substitutions fit into the core of the
three-dimensional structure.
Family (structural context) is as used in the FSSP database (Families of structurally similar proteins)
and the DALI/FSSP Web site, two structures that have a significant
level of structural similarity but not necessarily significant sequence
similarity.
Fold is similar to structural motif, includes a larger
combination of secondary structural units in the same configuration.
Thus, proteins sharing the same fold have the same combination of
secondary structures that are connected by similar loops. An example is
the Rossman fold comprising several alternating α helices and parallel
β strands. In the SCOP, CATH, and FSSP databases, the known protein
structures have been classified into hierarchical levels of structural
complexity with the fold as a basic level of classification.
Homologous domain (sequence context) is an extended sequence
pattern, generally found by sequence alignment methods, that indicates a
common evolutionary origin among the aligned sequences. A homology
domain is generally longer than motifs. The domain may include all of a
given protein sequence or only a portion of the sequence. Some domains
are complex and made up of several smaller homology domains that became
joined to form a larger one during evolution. A domain that covers an
entire sequence is called the homeomorphic domain by PIR (Protein Information Resource).
Module is a region of conserved amino acid patterns comprising
one or more motifs and considered to be a fundamental unit of structure
or function. The presence of a module has also been used to classify
proteins into families.
Motif (sequence context) is a conserved pattern of amino acids that is found in two or more proteins. In the Prosite
catalog, a motif is an amino acid pattern that is found in a group of
proteins that have a similar biochemical activity, and that often is
near the active site of the protein. Examples of sequence motif
databases are the Prosite catalog and the Stanford Motifs Database.
Motif (structural context) is a combination of several secondary
structural elements produced by the folding of adjacent sections of the
polypeptide chain into a specific three-dimensional configuration. An
example is the helix-loop-helix motif. Structural motifs are also
referred to as supersecondary structures and folds.
Position-specific scoring matrix (sequence context, also known as
weight or scoring matrix) is represents a conserved region in a
multiple sequence alignment with no gaps. Each matrix column represents
the variation found in one column of the multiple sequence alignment.
Position-specific scoring matrix—3D (structural context)
represents the amino acid variation found in an alignment of proteins
that fall into the same structural class. Matrix columns represent the
amino acid variation found at one amino acid position in the aligned
structures.
Primary structure
is the linear amino acid sequence of a protein, which chemically is a
polypeptide chain composed of amino acids joined by peptide bonds.
Profile (sequence context) is a scoring matrix that represents a
multiple sequence alignment of a protein family. The profile is usually
obtained from a well-conserved region in a multiple sequence alignment.
The profile is in the form of a matrix with each column representing a
position in the alignment and each row one of the amino acids. Matrix
values give the likelihood of each amino acid at the corresponding
position in the alignment. The profile is moved along the target
sequence to locate the best scoring regions by a dynamic programming
algorithm. Gaps are allowed during matching and a gap penalty is
included in this case as a negative score when no amino acid is matched.
A sequence profile may also be represented by a hidden Markov model, referred to as a profile HMM.
Profile (structural context) is a scoring matrix that represents
which amino acids should fit well and which should fit poorly at
sequential positions in a known protein structure. Profile columns
represent sequential positions in the structure, and profile rows
represent the 20 amino acids. As with a sequence profile, the structural
profile is moved along a target sequence to find the highest possible
alignment score by a dynamic programming algorithm. Gaps may be included
and receive a penalty. The resulting score provides an indication as to
whether or not the target protein might adopt such a structure.
Quaternary structure is the three-dimensional configuration of a protein molecule comprising several independent polypeptide chains.
Secondary structure
is the interactions that occur between the C, O, and NH groups on amino
acids in a polypeptide chain to form α-helices, β-sheets, turns, loops,
and other forms, and that facilitate the folding into a
three-dimensional structure.
Superfamily
is a group of protein families of the same or different lengths that
are related by distant yet detectable sequence similarity. Members of a
given superfamily
thus have a common evolutionary origin. Originally, Dayhoff defined the
cutoff for superfamily status as being the chance that the sequences
are not related of 10 6, on the basis of an alignment score (Dayhoff et
al. 1978). Proteins with few identities in an alignment of the sequences
but with a convincingly common number of structural and functional
features are placed in the same superfamily. At the level of
three-dimensional structure, superfamily proteins will share common
structural features such as a common fold, but there may also be
differences in the number and arrangement of secondary structures. The
PIR resource uses the term homeomorphic superfamilies to refer to
superfamilies that are composed of sequences that can be aligned from
end to end, representing a sharing of single sequence homology domain, a
region of similarity that extends throughout the alignment. This domain
may also comprise smaller homology domains that are shared with other
protein families and superfamilies. Although a given protein sequence
may contain domains found in several superfamilies, thus indicating a
complex evolutionary history, sequences will be assigned to only one
homeomorphic superfamily based on the presence of similarity throughout
a multiple sequence alignment. The superfamily alignment may also
include regions that do not align either within or at the ends of the
alignment. In contrast, sequences in the same family align well
throughout the alignment.
Supersecondary structure
is a term with similar meaning to a structural motif. Tertiary
structure is the three-dimensional or globular structure formed by the
packing together or folding of secondary structures of a polypeptide
chain.
Secondary structure
Secondary structure prediction is a set of techniques in bioinformatics that aim to predict the local secondary structures of proteins based only on knowledge of their amino acid sequence. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta strands (often noted as "extended" conformations), or turns. The success of a prediction is determined by comparing it to the results of the DSSP algorithm (or similar e.g. STRIDE) applied to the crystal structure of the protein. Specialized algorithms have been developed for the detection of specific well-defined patterns such as transmembrane helices and coiled coils in proteins.
The best modern methods of secondary structure prediction in proteins reach about 80% accuracy; this high accuracy allows the use of the predictions as feature improving fold recognition and ab initio protein structure prediction, classification of structural motifs, and refinement of sequence alignments. The accuracy of current protein secondary structure prediction methods is assessed in weekly benchmarks such as LiveBench and EVA.
Background
Early methods of secondary structure prediction, introduced in the 1960s and early 1970s, focused on identifying likely alpha helices and were based mainly on helix-coil transition models.
Significantly more accurate predictions that included beta sheets were
introduced in the 1970s and relied on statistical assessments based on
probability parameters derived from known solved structures. These
methods, applied to a single sequence, are typically at most about
60-65% accurate, and often underpredict beta sheets. The evolutionaryconservation of secondary structures can be exploited by simultaneously assessing many homologous sequences in a multiple sequence alignment,
by calculating the net secondary structure propensity of an aligned
column of amino acids. In concert with larger databases of known protein
structures and modern machine learning methods such as neural nets and support vector machines, these methods can achieve up to 80% overall accuracy in globular proteins. The theoretical upper limit of accuracy is around 90%,
partly due to idiosyncrasies in DSSP assignment near the ends of
secondary structures, where local conformations vary under native
conditions but may be forced to assume a single conformation in crystals
due to packing constraints. Limitations are also imposed by secondary
structure prediction's inability to account for tertiary structure;
for example, a sequence predicted as a likely helix may still be able
to adopt a beta-strand conformation if it is located within a beta-sheet
region of the protein and its side chains pack well with their
neighbors. Dramatic conformational changes related to the protein's
function or environment can also alter local secondary structure.
Historical perspective
To date, over 20 different secondary structure prediction methods have been developed. One of the first algorithms was Chou-Fasman method,
which relies predominantly on probability parameters determined from
relative frequencies of each amino acid's appearance in each type of
secondary structure.
The original Chou-Fasman parameters, determined from the small sample
of structures solved in the mid-1970s, produce poor results compared to
modern methods, though the parameterization has been updated since it
was first published. The Chou-Fasman method is roughly 50-60% accurate
in predicting secondary structures.
The next notable program was the GOR method, named for the three scientists who developed it — Garnier, Osguthorpe, and Robson, is an information theory-based method. It uses the more powerful probabilistic technique of Bayesian inference.
The GOR method takes into account not only the probability of each
amino acid having a particular secondary structure, but also the conditional probability
of the amino acid assuming each structure given the contributions of
its neighbors (it does not assume that the neighbors have that same
structure). The approach is both more sensitive and more accurate than
that of Chou and Fasman because amino acid structural propensities are
only strong for a small number of amino acids such as proline and glycine.
Weak contributions from each of many neighbors can add up to strong
effects overall. The original GOR method was roughly 65% accurate and is
dramatically more successful in predicting alpha helices than beta
sheets, which it frequently mispredicted as loops or disorganized
regions.
Another big step forward, was using machine learning methods. First artificial neural networks
methods were used. As a training sets they use solved structures to
identify common sequence motifs associated with particular arrangements
of secondary structures. These methods are over 70% accurate in their
predictions, although beta strands are still often underpredicted due to
the lack of three-dimensional structural information that would allow
assessment of hydrogen bonding patterns that can promote formation of the extended conformation required for the presence of a complete beta sheet. PSIPRED and JPRED are some of the most known programs based on neural networks for protein secondary structure prediction. Next, support vector machines have proven particularly useful for predicting the locations of turns, which are difficult to identify with statistical methods.
Extensions of machine learning techniques attempt to predict more fine-grained local properties of proteins, such as backbonedihedral angles in unassigned regions. Both SVMs and neural networks have been applied to this problem.
More recently, real-value torsion angles can be accurately predicted by
SPINE-X and successfully employed for ab initio structure prediction.
Other improvements
It
is reported that in addition to the protein sequence, secondary
structure formation depends on other factors. For example, it is
reported that secondary structure tendencies depend also on local
environment, solvent accessibility of residues, protein structural class, and even the organism from which the proteins are obtained.
Based on such observations, some studies have shown that secondary
structure prediction can be improved by addition of information about
protein structural class, residue accessible surface area and also contact number information.
Tertiary structure
The
practical role of protein structure prediction is now more important
than ever. Massive amounts of protein sequence data are produced by
modern large-scale DNA sequencing efforts such as the Human Genome Project. Despite community-wide efforts in structural genomics, the output of experimentally determined protein structures—typically by time-consuming and relatively expensive X-ray crystallography or NMR spectroscopy—is lagging far behind the output of protein sequences.
The protein structure prediction remains an extremely difficult
and unresolved undertaking. The two main problems are calculation of protein free energy and finding the global minimum of this energy. A protein structure prediction method must explore the space of possible protein structures which is astronomically large. These problems can be partially bypassed in "comparative" or homology modeling and fold recognition
methods, in which the search space is pruned by the assumption that the
protein in question adopts a structure that is close to the
experimentally determined structure of another homologous protein. On
the other hand, the de novo or ab initio protein structure prediction
methods must explicitly resolve these problems. The progress and
challenges in protein structure prediction has been reviewed in Zhang
2008.
Ab initio protein modelling
Energy- and fragment-based methods
Ab initio- or de novo-
protein modelling methods seek to build three-dimensional protein
models "from scratch", i.e., based on physical principles rather than
(directly) on previously solved structures. There are many possible
procedures that either attempt to mimic protein folding or apply some stochastic method to search possible solutions (i.e., global optimization
of a suitable energy function). These procedures tend to require vast
computational resources, and have thus only been carried out for tiny
proteins. To predict protein structure de novo for larger
proteins will require better algorithms and larger computational
resources like those afforded by either powerful supercomputers (such as
Blue Gene or MDGRAPE-3) or distributed computing (such as Folding@home, the Human Proteome Folding Project and Rosetta@Home).
Although these computational barriers are vast, the potential benefits
of structural genomics (by predicted or experimental methods) make ab initio structure prediction an active research field.
As of 2009, a 50-residue protein could be simulated atom-by-atom on a supercomputer for 1 millisecond.
As of 2012, comparable stable-state sampling could be done on a
standard desktop with a new graphics card and more sophisticated
algorithms. A much larger simulation timescales can be achieved using coarse-grained modeling.
Evolutionary covariation to predict 3D contacts
As sequencing became more commonplace in the 1990s several groups used protein sequence alignments to predict correlated mutations
and it was hoped that these coevolved residues could be used to predict
tertiary structure (using the analogy to distance constraints from
experimental procedures such as NMR).
The assumption is when single residue mutations are slightly
deleterious, compensatory mutations may occur to restabilize
residue-residue interactions.
This early work used what are known as local methods to calculate
correlated mutations from protein sequences, but suffered from indirect
false correlations which result from treating each pair of residues as
independent of all other pairs.
In 2011, a different, and this time global statistical
approach, demonstrated that predicted coevolved residues were sufficient
to predict the 3D fold of a protein, providing there are enough
sequences available (>1,000 homologous sequences are needed). The method, EVfold,
uses no homology modeling, threading or 3D structure fragments and can
be run on a standard personal computer even for proteins with hundreds
of residues. The accuracy of the contacts predicted using this and
related approaches has now been demonstrated on many known structures
and contact maps, including the prediction of experimentally unsolved transmembrane proteins.
Comparative protein modeling
Comparative
protein modelling uses previously solved structures as starting points,
or templates. This is effective because it appears that although the
number of actual proteins is vast, there is a limited set of tertiarystructural motifs
to which most proteins belong. It has been suggested that there are
only around 2,000 distinct protein folds in nature, though there are
many millions of different proteins.
These methods may also be split into two groups:
Homology modeling is based on the reasonable assumption that two homologous
proteins will share very similar structures. Because a protein's fold
is more evolutionarily conserved than its amino acid sequence, a target
sequence can be modeled with reasonable accuracy on a very distantly
related template, provided that the relationship between target and
template can be discerned through sequence alignment.
It has been suggested that the primary bottleneck in comparative
modelling arises from difficulties in alignment rather than from errors
in structure prediction given a known-good alignment. Unsurprisingly, homology modelling is most accurate when the target and template have similar sequences.
Protein threading:
scans the amino acid sequence of an unknown structure against a
database of solved structures. In each case, a scoring function is used
to assess the compatibility of the sequence to the structure, thus
yielding possible three-dimensional models. This type of method is also
known as 3D-1D fold recognition due to its compatibility analysis
between three-dimensional structures and linear protein sequences. This
method has also given rise to methods performing an inverse folding search
by evaluating the compatibility of a given structure with a large
database of sequences, thus predicting which sequences have the
potential to produce a given fold.
Side-chain geometry prediction
Accurate packing of the amino acid side chains
represents a separate problem in protein structure prediction. Methods
that specifically address the problem of predicting side-chain geometry
include dead-end elimination and the self-consistent mean field
methods. The side chain conformations with low energy are usually
determined on the rigid polypeptide backbone and using a set of discrete
side chain conformations known as "rotamers." The methods attempt to identify the set of rotamers that minimize the model's overall energy.
These methods use rotamer libraries, which are collections of
favorable conformations for each residue type in proteins. Rotamer
libraries may contain information about the conformation, its frequency,
and the standard deviations about mean dihedral angles, which can be
used in sampling. Rotamer libraries are derived from structural bioinformatics
or other statistical analysis of side-chain conformations in known
experimental structures of proteins, such as by clustering the observed
conformations for tetrahedral carbons near the staggered (60°, 180°,
-60°) values.
Rotamer libraries can be backbone-independent,
secondary-structure-dependent, or backbone-dependent.
Backbone-independent rotamer libraries make no reference to backbone
conformation, and are calculated from all available side chains of a
certain type (for instance, the first example of a rotamer library, done
by Ponder and Richards at Yale in 1987). Secondary-structure-dependent libraries present different dihedral angles and/or rotamer frequencies for -helix, -sheet, or coil secondary structures.
Backbone-dependent rotamer libraries present conformations and/or
frequencies dependent on the local backbone conformation as defined by
the backbone dihedral angles and , regardless of secondary structure.
The modern versions of these libraries as used in most software
are presented as multidimensional distributions of probability or
frequency, where the peaks correspond to the dihedral-angle
conformations considered as individual rotamers in the lists. Some
versions are based on very carefully curated data and are used primarily
for structure validation,
while others emphasize relative frequencies in much larger data sets
and are the form used primarily for structure prediction, such as the
Dunbrack rotamer libraries.
Side-chain packing methods are most useful for analyzing the protein's hydrophobic
core, where side chains are more closely packed; they have more
difficulty addressing the looser constraints and higher flexibility of
surface residues, which often occupy multiple rotamer conformations
rather than just one.
Prediction of structural classes
Statistical methods have been developed for predicting structural classes of proteins based on their amino acid composition, pseudo amino acid composition and functional domain composition.
Quaternary structure
In the case of complexes of two or more proteins, where the structures of the proteins are known or can be predicted with high accuracy, protein–protein docking
methods can be used to predict the structure of the complex.
Information of the effect of mutations at specific sites on the affinity
of the complex helps to understand the complex structure and to guide
docking methods.
Evaluation of automatic structure prediction servers
CASP,
which stands for Critical Assessment of Techniques for Protein Structure
Prediction, is a community-wide experiment for protein structure
prediction taking place every two years since 1994. CASP provides with
an opportunity to assess the quality of available human, non-automated
methodology (human category) and automatic servers for protein structure
prediction (server category, introduced in the CASP7).
The CAMEO3D
Continuous Automated Model EvaluatiOn Server evaluates automated
protein structure prediction servers on a weekly basis using blind
predictions for newly release protein structures. CAMEO publishes the
results on its website.