A protein domain is a conserved part of a given protein sequence and (tertiary) structure that can evolve,
function, and exist independently of the rest of the protein chain.
Each domain forms a compact three-dimensional structure and often can be
independently stable and folded. Many proteins consist of several structural domains. One domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.
Background
The concept of the domain was first proposed in 1973 by Wetlaufer after X-ray
crystallographic studies of hen lysozyme and papain
and by limited proteolysis studies of immunoglobulins. Wetlaufer defined domains as stable units of protein structure that could fold autonomously. In the past domains have been described as units of:
- compact structure
- function and evolution
- folding.
Each definition is valid and will often overlap, i.e. a compact
structural domain that is found among diverse proteins is likely to
fold independently within its structural environment. Nature often
brings several domains together to form multi-domain and multi-functional
proteins with a vast number of possibilities.
In a multi-domain protein, each domain may fulfill its own function
independently, or in a concerted manner with its neighbors. Domains can
either serve as modules for building up large assemblies such as virus
particles or muscle fibers, or can provide specific catalytic or binding
sites as found in enzymes or regulatory proteins.
Example: Pyruvate kinase
An appropriate example is pyruvate kinase
(see first figure), a glycolytic enzyme that plays an important role in
regulating the flux from fructose-1,6-biphosphate to pyruvate. It
contains an all-β nucleotide binding domain (in blue), an α/β-substrate
binding domain (in grey) and an α/β-regulatory domain (in olive green), connected by several polypeptide linkers. Each domain in this protein occurs in diverse sets of protein families.
The central α/β-barrel substrate binding domain is one of the most common enzyme folds. It is seen in many different enzyme families catalysing completely unrelated reactions. The α/β-barrel is commonly called the TIM barrel named after triose phosphate isomerase, which was the first such structure to be solved. It is currently classified into 26 homologous families in the CATH domain database.
The TIM barrel is formed from a sequence of β-α-β motifs closed by the
first and last strand hydrogen bonding together, forming an eight
stranded barrel. There is debate about the evolutionary origin of this
domain. One study has suggested
that a single ancestral enzyme could have diverged into several
families, while another suggests that a stable TIM-barrel structure has evolved
through convergent evolution.
The TIM-barrel in pyruvate kinase is 'discontinuous', meaning
that more than one segment of the polypeptide is required to form the
domain. This is likely to be the result of the insertion of one domain
into another during the protein's evolution. It has been shown from
known structures that about a quarter of structural domains are
discontinuous. The inserted β-barrel regulatory domain is 'continuous', made up of a single stretch of polypeptide.
Units of protein structure
The primary structure (string of amino acids) of a protein ultimately encodes its uniquely folded three-dimensional (3D) conformation.
The most important factor governing the folding of a protein into 3D
structure is the distribution of polar and non-polar side chains.
Folding is driven by the burial of hydrophobic side chains into the
interior of the molecule so to avoid contact with the aqueous
environment. Generally proteins have a core of hydrophobic residues
surrounded by a shell of hydrophilic residues. Since the peptide bonds
themselves are polar they are neutralized by hydrogen bonding with each
other when in the hydrophobic environment. This gives rise to regions of
the polypeptide that form regular 3D structural patterns called secondary structure. There are two main types of secondary structure: α-helices and β-sheets.
Some simple combinations of secondary structure elements have been found to frequently occur in protein structure and are referred to as supersecondary structure or motifs.
For example, the β-hairpin motif consists of two adjacent antiparallel
β-strands joined by a small loop. It is present in most antiparallel β
structures both as an isolated ribbon and as part of more complex
β-sheets. Another common super-secondary structure is the β-α-β motif,
which is frequently used to connect two parallel β-strands. The central
α-helix connects the C-termini of the first strand to the N-termini of
the second strand, packing its side chains against the β-sheet and
therefore shielding the hydrophobic residues of the β-strands from the
surface.
Covalent association of two domains represents a functional and
structural advantage since there is an increase in stability when
compared with the same structures non-covalently associated.
Other, advantages are the protection of intermediates within
inter-domain enzymatic clefts that may
otherwise be unstable in aqueous environments, and a fixed
stoichiometric ratio of the enzymatic activity necessary for a
sequential set of reactions.
Structural alignment is an important tool for determining domains.
Tertiary structure
Several motifs pack together to form compact, local, semi-independent units called domains.
The overall 3D structure of the polypeptide chain is referred to as the protein's tertiary structure.
Domains are the fundamental units of tertiary structure, each domain
containing an individual hydrophobic core built from secondary
structural units connected by loop regions. The packing of the
polypeptide is usually much tighter in the interior than the exterior of
the domain producing a solid-like core and a fluid-like surface.
Core residues are often conserved in a protein family, whereas the
residues in loops are less conserved, unless they are involved in the
protein's function. Protein tertiary structure can be divided into four
main classes based on the secondary structural content of the domain.
- All-α domains have a domain core built exclusively from α-helices. This class is dominated by small folds, many of which form a simple bundle with helices running up and down.
- All-β domains have a core composed of antiparallel β-sheets, usually two sheets packed against each other. Various patterns can be identified in the arrangement of the strands, often giving rise to the identification of recurring motifs, for example the Greek key motif.
- α+β domains are a mixture of all-α and all-β motifs. Classification of proteins into this class is difficult because of overlaps to the other three classes and therefore is not used in the CATH domain database.
- α/β domains are made from a combination of β-α-β motifs that predominantly form a parallel β-sheet surrounded by amphipathic α-helices. The secondary structures are arranged in layers or barrels.
Limits on size
Domains have limits on size. The size of individual structural domains varies from 36 residues in E-selectin to 692 residues in lipoxygenase-1, but the majority, 90%, have fewer than 200 residues with an average of approximately 100 residues.
Very short domains, less than 40 residues, are often stabilised by
metal ions or disulfide bonds. Larger domains, greater than 300
residues, are likely to consist of multiple hydrophobic cores.
Quaternary structure
Many proteins have a quaternary structure,
which consists of several polypeptide chains that associate into an
oligomeric molecule. Each polypeptide chain in such a protein is called a
subunit. Hemoglobin, for example, consists of two α and two β subunits.
Each of the four chains has an all-α globin fold with a heme pocket.
Domain swapping is a mechanism for forming oligomeric assemblies.
In domain swapping, a secondary or tertiary element of a monomeric
protein is replaced by the same element of another protein. Domain
swapping can range from secondary structure elements to whole structural
domains. It also represents a model of evolution for functional
adaptation by oligomerization, e.g. oligomeric enzymes that have their
active site at subunit interfaces.
Domains as evolutionary modules
As nature is a tinkerer and not an inventor,
new sequences are adapted from pre-existing sequences rather than
invented. Domains are the common material used by nature to generate new
sequences; they can be thought of as genetically mobile units, referred
to as 'modules'. Often, the C and N termini of domains are close
together in space, allowing them to easily be "slotted into" parent
structures during the process of evolution. Many domain families are
found in all three forms of life, Archaea, Bacteria and Eukarya.
Protein modules are a subset of protein domains which are found across a
range of different proteins with a particularly versatile structure.
Examples can be found among extracellular proteins associated with
clotting, fibrinolysis, complement, the extracellular matrix, cell
surface adhesion molecules and cytokine receptors. Four concrete examples of widespread protein modules are the following domains: SH2, immunoglobulin, fibronectin type 3 and the kringle.
Molecular evolution
gives rise to families of related proteins with similar sequence and
structure. However, sequence similarities can be extremely low between
proteins that share the same structure. Protein structures may be
similar because proteins have diverged from a common ancestor.
Alternatively, some folds may be more favored than others as they
represent stable arrangements of secondary structures and some proteins
may converge towards these folds over the course of evolution. There are
currently about 110,000 experimentally determined protein 3D structures
deposited within the Protein Data Bank (PDB).
However, this set contains many identical or very similar structures.
All proteins should be classified to structural families to understand
their evolutionary relationships. Structural comparisons are best
achieved at the domain level. For this reason many algorithms have been
developed to automatically assign domains in proteins with known 3D
structure.
The CATH domain database classifies domains into approximately
800 fold families; ten of these folds are highly populated and are
referred to as 'super-folds'. Super-folds are defined as folds for which
there are at least three structures without significant sequence
similarity. The most populated is the α/β-barrel super-fold, as described previously.
Multidomain proteins
The majority of proteins, two-thirds in unicellular organisms and more than 80% in metazoa, are multidomain proteins.
However, other studies concluded that 40% of prokaryotic proteins
consist of multiple domains while eukaryotes have approximately 65%
multi-domain proteins.
Many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes,
suggesting that domains in multidomain proteins have once existed as
independent proteins. For example, vertebrates have a multi-enzyme
polypeptide containing the GAR synthetase, AIR synthetase and GAR transformylase
domains (GARs-AIRs-GARt; GAR: glycinamide ribonucleotide
synthetase/transferase; AIR: aminoimidazole ribonucleotide synthetase).
In insects, the polypeptide appears as GARs-(AIRs)2-GARt, in yeast
GARs-AIRs is encoded separately from GARt, and in bacteria each domain
is encoded separately.
Origin
Multidomain proteins are likely to have emerged from selective pressure during evolution
to create new functions. Various proteins have diverged from common
ancestors by different combinations and associations of domains. Modular
units frequently move about, within and between biological systems
through mechanisms of genetic shuffling:
- transposition of mobile elements including horizontal transfers (between species);
- gross rearrangements such as inversions, translocations, deletions and duplications;
- homologous recombination;
- slippage of DNA polymerase during replication.
Types of organization
The simplest multidomain organization seen in proteins is that of a single domain repeated in tandem. The domains may interact with each other or remain isolated, like beads on string. The giant 30,000 residue muscle protein titin comprises about 120 fibronectin-III-type and Ig-type domains. In the serine proteases, a gene duplication event has led to the formation of a two β-barrel domain enzyme.
The repeats have diverged so widely that there is no obvious sequence
similarity between them. The active site is located at a cleft between
the two β-barrel domains, in which functionally important residues are
contributed from each domain. Genetically engineered mutants of the chymotrypsin serine protease
were shown to have some proteinase activity even though their active
site residues were abolished and it has therefore been postulated that
the duplication event enhanced the enzyme's activity.
Modules frequently display different connectivity relationships, as illustrated by the kinesins and ABC transporters. The kinesin motor domain can be at either end of a polypeptide chain that includes a coiled-coil region and a cargo domain.
ABC transporters are built with up to four domains consisting of two
unrelated modules, ATP-binding cassette and an integral membrane module,
arranged in various combinations.
Not only do domains recombine, but there are many examples of a
domain having been inserted into another. Sequence or structural
similarities to other
domains demonstrate that homologues of inserted and parent domains can
exist independently. An example is that of the 'fingers' inserted into
the 'palm' domain within the polymerases of the Pol I family.
Since a domain can be inserted into another, there should always be at
least one continuous domain in a multidomain protein. This is the main
difference between definitions of structural domains and
evolutionary/functional domains. An evolutionary domain will be limited
to one or two connections between domains, whereas structural domains
can have unlimited connections, within a given criterion of the
existence of a common core. Several structural domains could be assigned
to an evolutionary domain.
A superdomain consists of two or more conserved domains of
nominally independent origin, but subsequently inherited as a single
structural/functional unit.
This combined superdomain can occur in diverse proteins that are not
related by gene duplication alone. An example of a superdomain is the protein tyrosine phosphatase–C2 domain pair in PTEN, tensin, auxilin
and the membrane protein TPTE2. This superdomain is found in proteins
in animals, plants and fungi. A key feature of the PTP-C2 superdomain is
amino acid residue conservation in the domain interface.
Domains are autonomous folding units
Folding
Protein folding - the unsolved problem : Since the seminal work of Anfinsen in the early 1960s,
the goal to completely understand the mechanism by which a polypeptide
rapidly folds into its stable native conformation remains elusive. Many
experimental folding studies have contributed much to our understanding,
but the principles that govern protein folding are still based on those
discovered in the very first studies of folding. Anfinsen showed that
the native state of a protein is thermodynamically stable, the
conformation being at a global minimum of its free energy.
Folding is a directed search of conformational space allowing the protein to fold on a biologically feasible time scale. The Levinthal paradox
states that if an averaged sized protein would sample all possible
conformations before finding the one with the lowest energy, the whole
process would take billions of years.
Proteins typically fold within 0.1 and 1000 seconds. Therefore, the
protein folding process must be directed some way through a specific
folding pathway. The forces
that direct this search are likely to be a combination of local and
global influences whose effects are felt at various stages of the
reaction.
Advances in experimental and theoretical studies have shown that folding can be viewed in terms of energy landscapes,
where folding kinetics is considered as a progressive organization of
an ensemble of partially folded structures through which a protein
passes on its way to the folded structure. This has been described in
terms of a folding funnel,
in which an unfolded protein has a large number of conformational
states available and there are fewer states available to the folded
protein. A funnel implies that for protein folding there is a decrease
in energy and loss of entropy with increasing tertiary structure
formation. The local roughness of the funnel reflects kinetic traps,
corresponding to the accumulation of misfolded intermediates. A folding
chain progresses toward lower intra-chain free-energies by increasing
its compactness. The chain's conformational options become increasingly
narrowed ultimately toward one native structure.
Advantage of domains in protein folding
The
organisation of large proteins by structural domains represents an
advantage for protein folding, with each domain being able to
individually fold, accelerating the folding process and reducing a
potentially large combination of residue interactions. Furthermore,
given the observed random distribution of hydrophobic residues in
proteins,
domain formation appears to be the optimal solution for a large protein
to bury its hydrophobic residues while keeping the hydrophilic residues
at the surface.
However, the role of inter-domain interactions in protein folding
and in energetics of stabilisation of the native structure, probably
differs for each protein. In T4 lysozyme, the influence of one domain on
the other is so strong that the entire molecule is resistant to
proteolytic cleavage. In this case, folding is a sequential process
where the C-terminal domain is required to fold independently in an
early step, and the other domain requires the presence of the folded
C-terminal domain for folding and stabilization.
It has been found that the folding of an isolated domain can take
place at the same rate or sometimes faster than that of the integrated
domain,
suggesting that unfavorable interactions with the rest of the protein
can occur during folding. Several arguments suggest that the slowest
step in the folding of large proteins is the pairing of the folded
domains.
This is either because the domains are not folded entirely correctly or
because the small adjustments required for their interaction are
energetically unfavorable, such as the removal of water from the domain interface.
Domains and protein flexibility
Protein domain dynamics play a key role in a multitude of molecular recognition and signaling processes.
Protein domains, connected by intrinsically disordered flexible linker domains, induce long-range allostery via protein domain dynamics.
The resultant dynamic modes cannot be generally predicted from static
structures of either the entire protein or individual domains. They can
however be inferred by comparing different structures of a protein. They can also be suggested by sampling in extensive molecular dynamics trajectories and principal component analysis, or they can be directly observed using spectra
measured by neutron spin echo spectroscopy.
Domain definition from structural co-ordinates
The
importance of domains as structural building blocks and elements of
evolution has brought about many automated methods for their
identification and classification in proteins of known structure.
Automatic procedures for reliable domain assignment is essential for the
generation of the domain databases, especially as the number of known
protein structures is increasing. Although the boundaries of a domain
can be determined by visual inspection, construction of an automated
method is not straightforward. Problems occur when faced with domains
that are discontinuous or highly associated.
The fact that there is no standard definition of what a domain really
is has meant that domain assignments have varied enormously, with each
researcher using a unique set of criteria.
A structural domain is a compact, globular sub-structure with more interactions within it than with the rest of the protein.
Therefore, a structural domain can be determined by two visual characteristics: its compactness and its extent of isolation. Measures of local compactness in proteins have been used in many of the early methods of domain assignment and in several of the more recent methods.
Methods
One of the first algorithms used a Cα-Cα distance map together with a hierarchical clustering
routine that considered proteins as several small segments, 10 residues
in length. The initial segments were clustered one after another based
on inter-segment distances; segments with the shortest distances were
clustered and considered as single segments thereafter. The step-wise
clustering finally included the full protein. Go also exploited the fact that inter-domain distances are normally larger than intra-domain distances; all possible Cα-Cα distances
were represented as diagonal plots in which there were distinct
patterns for helices, extended strands and combinations of secondary
structures.
The method by Sowdhamini and Blundell clusters secondary
structures in a protein based on their Cα-Cα distances and identifies
domains from the pattern in
their dendrograms.
As the procedure does not consider the protein as a continuous chain of
amino acids there are no problems in treating discontinuous domains.
Specific nodes in these dendrograms are identified as tertiary
structural clusters of the protein, these include both super-secondary
structures and domains. The DOMAK algorithm is used to create the 3Dee
domain database.
It calculates a 'split value' from the number of each type of contact
when the protein is divided arbitrarily into two parts. This split value
is
large when the two parts of the structure are distinct.
The method of Wodak and Janin
was based on the calculated interface areas between two chain segments
repeatedly cleaved at various residue positions. Interface areas were
calculated by comparing surface areas of the cleaved segments with that
of the native structure. Potential domain boundaries can be identified
at a site where the interface area was at a minimum. Other methods have
used measures of solvent accessibility to calculate compactness.
The PUU algorithm
incorporates a harmonic model used to approximate inter-domain
dynamics. The underlying physical concept is that many rigid
interactions will occur within each domain and loose interactions will
occur between domains. This algorithm is used to define domains in the FSSP domain database.
Swindells (1995) developed a method, DETECTIVE, for
identification of domains in protein structures based on the idea that
domains have a hydrophobic
interior. Deficiencies were found to occur when hydrophobic cores from
different domains continue through the interface region.
RigidFinder
is a novel method for identification of protein rigid blocks (domains
and loops) from two different conformations. Rigid blocks are defined as
blocks where all inter residue distances are conserved across
conformations.
A general method to identify dynamical domains, that is protein
regions that behave approximately as rigid units in the course of
structural fluctuations, has been introduced by Potestio et al. and, among other applications was also used
to compare the consistency of the dynamics-based domain
subdivisions with standard structure-based ones. The method,
termed PiSQRD, is publicly available in the form of a web server. The latter allows users to optimally subdivide single-chain
or multimeric proteins into quasi-rigid domains based on the collective modes of fluctuation of the system. By default the
latter are calculated through an elastic network model;
alternatively pre-calculated essential dynamical spaces can be
uploaded by the user.
Example domains
- Armadillo repeats : named after the β-catenin-like Armadillo protein of the fruit fly Drosophila.
- Basic Leucine zipper domain (bZIP domain) : is found in many DNA-binding eukaryotic proteins. One part of the domain contains a region that mediates sequence-specific DNA-binding properties and the Leucine zipper that is required for the dimerization of two DNA-binding regions. The DNA-binding region comprises a number of basic aminoacids such as arginine and lysine
- Cadherin repeats : Cadherins function as Ca2+-dependent cell–cell adhesion proteins. Cadherin domains are extracellular regions which mediate cell-to-cell homophilic binding between cadherins on the surface of adjacent cells.
- Death effector domain (DED) : allows protein–protein binding by homotypic interactions (DED-DED). Caspase proteases trigger apoptosis via proteolytic cascades. Pro-Caspase-8 and pro-caspase-9 bind to specific adaptor molecules via DED domains and this leads to autoactivation of caspases.
- EF hand : a helix-turn-helix structural motif found in each structural domain of the signaling protein calmodulin and in the muscle protein troponin-C.
- Immunoglobulin-like domains : are found in proteins of the immunoglobulin superfamily (IgSF). They contain about 70-110 amino acids and are classified into different categories (IgV, IgC1, IgC2 and IgI) according to their size and function. They possess a characteristic fold in which two beta sheets form a "sandwich" that is stabilized by interactions between conserved cysteines and other charged amino acids. They are important for protein–protein interactions in processes of cell adhesion, cell activation, and molecular recognition. These domains are commonly found in molecules with roles in the immune system.
- Phosphotyrosine-binding domain (PTB) : PTB domains usually bind to phosphorylated tyrosine residues. They are often found in signal transduction proteins. PTB-domain binding specificity is determined by residues to the amino-terminal side of the phosphotyrosine. Examples: the PTB domains of both SHC and IRS-1 bind to a NPXpY sequence. PTB-containing proteins such as SHC and IRS-1 are important for insulin responses of human cells.
- Pleckstrin homology domain (PH) : PH domains bind phosphoinositides with high affinity. Specificity for PtdIns(3)P, PtdIns(4)P, PtdIns(3,4)P2, PtdIns(4,5)P2, and PtdIns(3,4,5)P3 have all been observed. Given the fact that phosphoinositides are sequestered to various cell membranes (due to their long lipophilic tail) the PH domains usually causes recruitment of the protein in question to a membrane where the protein can exert a certain function in cell signalling, cytoskeletal reorganization or membrane trafficking.
- Src homology 2 domain (SH2) : SH2 domains are often found in signal transduction proteins. SH2 domains confer binding to phosphorylated tyrosine (pTyr). Named after the phosphotyrosine binding domain of the src viral oncogene, which is itself a tyrosine kinase. See also: SH3 domain.
- Zinc finger DNA binding domain (ZnF_GATA) : ZnF_GATA domain-containing proteins are typically transcription factors that usually bind to the DNA sequence [AT]GATA[AG] of promoters.
Domains of unknown function
A large fraction of domains are of unknown function. A domain of unknown function (DUF)
is a protein domain that has no characterized function. These families
have been collected together in the Pfam database using the prefix DUF
followed by a number, with examples being DUF2992 and DUF1220. There are
now over 3,000 DUF families within the Pfam database representing over
20% of known families.