Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.
Generally, function can be thought of as, "anything that happens to or through a protein". The Gene Ontology Consortium
provides a useful classification of functions, based on a dictionary of
well-defined terms divided into three main categories of molecular function, biological process and cellular component. Researchers can query this database with a protein name or accession number to retrieve associated Gene Ontology (GO) terms or annotations based on computational or experimental evidence.
While techniques such as microarray analysis, RNA interference, and the yeast two-hybrid system can be used to experimentally demonstrate the function of a protein, advances in sequencing technologies have made the rate at which proteins can be experimentally characterized much slower than the rate at which new sequences become available. Thus, the annotation of new sequences is mostly by prediction through computational methods, as these types of annotation can often be done quickly and for many genes or proteins at once. The first such methods inferred function based on homologous proteins with known functions (homology-based function prediction). The development of context-based and structure based methods have expanded what information can be predicted, and a combination of methods can now be used to get a picture of complete cellular pathways based on sequence data. The importance and prevalence of computational prediction of gene function is underlined by an analysis of 'evidence codes' used by the GO database: as of 2010, 98% of annotations were listed under the code IEA (inferred from electronic annotation) while only 0.6% were based on experimental evidence.
While techniques such as microarray analysis, RNA interference, and the yeast two-hybrid system can be used to experimentally demonstrate the function of a protein, advances in sequencing technologies have made the rate at which proteins can be experimentally characterized much slower than the rate at which new sequences become available. Thus, the annotation of new sequences is mostly by prediction through computational methods, as these types of annotation can often be done quickly and for many genes or proteins at once. The first such methods inferred function based on homologous proteins with known functions (homology-based function prediction). The development of context-based and structure based methods have expanded what information can be predicted, and a combination of methods can now be used to get a picture of complete cellular pathways based on sequence data. The importance and prevalence of computational prediction of gene function is underlined by an analysis of 'evidence codes' used by the GO database: as of 2010, 98% of annotations were listed under the code IEA (inferred from electronic annotation) while only 0.6% were based on experimental evidence.
Function prediction methods
Homology-based methods
Proteins of similar sequence are usually homologous and thus have a similar function. Hence proteins in a newly sequenced genome are routinely annotated using the sequences of similar proteins in related genomes.
However, closely related proteins do not always share the same function. For example, the yeast Gal1 and Gal3 proteins are paralogs (73% identity and 92% similarity) that have evolved very different functions with Gal1 being a galactokinase and Gal3 being a transcriptional inducer.
There is no hard sequence-similarity threshold for "safe"
function prediction; many proteins of barely detectable sequence
similarity have the same function while others (such as Gal1 and Gal3)
are highly similar but have evolved different functions. As a rule of
thumb, sequences that are more than 30-40% identical are usually
considered as having the same or a very similar function.
For enzymes, predictions of specific functions are especially difficult, as they only need a few key residues in their active site,
hence very different sequences can have very similar activities. By
contrast, even with sequence identity of 70% or greater, 10% of any pair
of enzymes have different substrates; and differences in the actual
enzymatic reactions are not uncommon near 50% sequence identity.
Sequence motif-based methods
The development of protein domain databases such as Pfam (Protein Families Database) allow us to find known domains within a query sequence, providing evidence for likely functions. The dcGO website
contains annotations to both the individual domains and supra-domains
(i.e., combinations of two or more successive domains), thus via dcGO
Predictor allowing for the function predictions in a more realistic
manner. Within protein domains, shorter signatures known as 'motifs' are associated with particular functions, and motif databases such as PROSITE ('database of protein domains, families and functional sites') can be searched using a query sequence.
Motifs can, for example, be used to predict subcellular localization
of a protein (where in the cell the protein is sent after synthesis).
Short signal peptides direct certain proteins to a particular location
such as the mitochondria, and various tools exist for the prediction of
these signals in a protein sequence. For example, SignalP, which has been updated several times as methods are improved.
Thus, aspects of a protein's function can be predicted without comparison to other full-length homologous protein sequences.
Structure-based methods
Because 3D protein structure
is generally more well conserved than protein sequence, structural
similarity is a good indicator of similar function in two or more
proteins. Many programs have been developed to screen an unknown protein structure against the Protein Data Bank
and report similar structures (for example, FATCAT (Flexible structure
AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists), CE (combinatorial extension)) and DeepAlign (protein structure alignment beyond spatial proximity). To deal with the situation that many protein sequences have no solved structures, some function prediction servers such as RaptorX
are also developed that can first predict the 3D model of a sequence
and then use structure-based method to predict functions based upon the
predicted 3D model. In many cases instead of the whole protein
structure, the 3D structure of a particular motif representing an active site or binding site can be targeted. The Structurally Aligned Local Sites of Activity (SALSA) method, developed by Mary Jo Ondrechen
and students, utilizes computed chemical properties of the individual
amino acids to identify local biochemically active sites. Databases such
as Catalytic Site Atlas have been developed that can be searched using novel protein sequences to predict specific functional sites.
Genomic context-based methods
Many
of the newer methods for protein function prediction are not based on
comparison of sequence or structure as above, but on some type of
correlation between novel genes/proteins and those that already have
annotations. Also known as phylogenomic profiling, these genomic context
based methods are based on the observation that two or more proteins
with the same pattern of presence or absence in many different genomes
most likely have a functional link.
Whereas homology-based methods can often be used to identify molecular
functions of a protein, context-based approaches can be used to predict
cellular function, or the biological process in which a protein acts.
For example, proteins involved in the same signal transduction pathway
are likely to share a genomic context across all species.
Gene fusion
Gene fusion
occurs when two or more genes encode two or more proteins in one
organism and have, through evolution, combined to become a single gene
in another organism (or vice versa for gene fission).
This concept has been used, for example, to search all E. coli
protein sequences for homology in other genomes and find over 6000
pairs of sequences with shared homology to single proteins in another
genome, indicating potential interaction between each of the pairs.
Because the two sequences in each protein pair are non-homologous,
these interactions could not be predicted using homology-based methods.
Co-location/co-expression
In prokaryotes,
clusters of genes that are physically close together in the genome
often conserve together through evolution, and tend to encode proteins
that interact or are part of the same operon. Thus, chromosomal proximity also called the gene neighbour method
can be used to predict functional similarity between proteins, at least
in prokaryotes. Chromosomal proximity has also been seen to apply for
some pathways in selected eukaryotic genomes, including Homo sapiens, and with further development gene neighbor methods may be valuable for studying protein interactions in eukaryotes.
Genes involved in similar functions are also often
co-transcribed, so that an unannotated protein can often be predicted to
have a related function to proteins with which it co-expresses. The guilt by association algorithms
developed based on this approach can be used to analyze large amounts
of sequence data and identify genes with expression patterns similar to
those of known genes. Often, a guilt by association study compares a group of candidate genes
(unknown function) to a target group (for example, a group of genes
known to be associated with a particular disease), and rank the
candidate genes by their likelihood of belonging to the target group
based on the data.
Based on recent studies, however, it has been suggested that some
problems exist with this type of analysis. For example, because many
proteins are multifunctional, the genes encoding them may belong to
several target groups. It is argued that such genes are more likely to
be identified in guilt by association studies, and thus predictions are
not specific.
With the accumulation of RNA-seq data that are capable of
estimating expression profiles for alternatively spliced isoforms,
machine learning algorithms have also been developed for predicting and
differentiating functions at the isoform level.
This represents an emerging research area in function prediction, which
integrates large-scale, heterogeneous genomic data to infer functions
at the isoform level.
Computational solvent mapping
One of the challenges involved in protein function prediction is
discovery of the active site. This is complicated by certain active
sites not being formed – essentially existing – until the protein
undergoes conformational changes brought on by the binding of small
molecules. Most protein structures have been determined by X-ray crystallography which requires a purified protein crystal.
As a result, existing structural models are generally of a purified
protein and as such lack the conformational changes that are created
when the protein interacts with small molecules.
Computational solvent mapping utilizes probes (small organic
molecules) that are computationally 'moved' over the surface of the
protein searching for sites where they tend to cluster. Multiple
different probes are generally applied with the goal being to obtain a
large number of different protein-probe conformations. The generated
clusters are then ranked based on the cluster's average free energy.
After computationally mapping multiple probes, the site of the protein
where relatively large numbers of clusters form typically corresponds to
an active site on the protein.
This technique is a computational adaptation of 'wet lab' work
from 1996. It was discovered that ascertaining the structure of a
protein while it is suspended in different solvents and then
superimposing those structures on one another produces data where the
organic solvent molecules (that the proteins were suspended in)
typically cluster at the protein's active site. This work was carried
out as a response to realizing that water molecules are visible in the
electron density maps produced by X-ray crystallography.
The water molecules are interacting with the protein and tend to
cluster at the protein's polar regions. This led to the idea of
immersing the purified protein crystal in other solvents (e.g. ethanol, isopropanol,
etc.) to determine where these molecules cluster on the protein. The
solvents can be chosen based on what they approximate, that is, what
molecule this protein may interact with (e.g. ethanol can probe for interactions with the amino acid serine, isopropanol a probe for threonine, etc.). It is vital that the protein crystal maintains its tertiary structure
in each solvent. This process is repeated for multiple solvents and
then this data can be used to try to determine potential active sites on
the protein. Ten years later this technique was developed into an algorithm by Clodfelter et al.
Network-based methods
Guilt
by association type algorithms may be used to produce a functional
association network for a given target group of genes or proteins. These
networks serve as a representation of the evidence for shared/similar
function within a group of genes, where nodes represent genes/proteins and are linked to each other by edges representing evidence of shared function.
Integrated networks
Several
networks based on different data sources can be combined into a
composite network, which can then be used by a prediction algorithm to
annotate candidate genes or proteins. For example, the developers of the bioPIXIE system used a wide variety of Saccharomyces cerevisiae (yeast) genomic data to produce a composite functional network for that species.
This resource allows the visualization of known networks representing
biological processes, as well as the prediction of novel components of
those networks. Many algorithms have been developed to predict function
based on the integration of several data sources (e.g. genomic,
proteomic, protein interaction, etc.), and testing on previously
annotated genes indicates a high level of accuracy.
Disadvantages of some function prediction algorithms have included a
lack of accessibility, and the time required for analysis. Faster, more
accurate algorithms such as GeneMANIA (multiple association network integration algorithm) have however been developed in recent years and are publicly available on the web, indicating the future direction of function prediction.
Tools and databases for protein function prediction
STRING: web tool that integrates various data sources for function prediction.
VisANT: Visual analysis of networks and integrative visual data-mining.