Search This Blog

Monday, February 11, 2019

Structural alignment (proteins)

From Wikipedia, the free encyclopedia

Structural alignment of thioredoxins from humans and the fly Drosophila melanogaster. The proteins are shown as ribbons, with the human protein in red, and the fly protein in yellow. Generated from PDB 3TRX and 1XWC.

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

Structural alignments can compare two sequences or multiple sequences. Because these alignments rely on information about all the query sequences' three-dimensional conformations, the method can only be used on sequences where these structures are known. These are usually found by X-ray crystallography or NMR spectroscopy. It is possible to perform a structural alignment on structures produced by structure prediction methods. Indeed, evaluating such predictions often requires a structural alignment between the model and the true known structure to assess the model's quality. Structural alignments are especially useful in analyzing data from structural genomics and proteomics efforts, and they can be used as comparison points to evaluate alignments produced by purely sequence-based bioinformatics methods.

The outputs of a structural alignment are a superposition of the atomic coordinate sets and a minimal root mean square deviation (RMSD) between the structures. The RMSD of two aligned structures indicates their divergence from one another. Structural alignment can be complicated by the existence of multiple protein domains within one or more of the input structures, because changes in relative orientation of the domains between two structures to be aligned can artificially inflate the RMSD.

Data produced by structural alignment

The minimum information produced from a successful structural alignment is a set of residues that are considered equivalent between the structures. This set of equivalences is then typically used to superpose the three-dimensional coordinates for each input structure. (Note that one input element may be fixed as a reference and therefore its superposed coordinates do not change.) The fitted structures can be used to calculate mutual RMSD values, as well as other more sophisticated measures of structural similarity such as the global distance test (GDT, the metric used in CASP). The structural alignment also implies a corresponding one-dimensional sequence alignment from which a sequence identity, or the percentage of residues that are identical between the input structures, can be calculated as a measure of how closely the two sequences are related.

Types of comparisons

Because protein structures are composed of amino acids whose side chains are linked by a common protein backbone, a number of different possible subsets of the atoms that make up a protein macromolecule can be used in producing a structural alignment and calculating the corresponding RMSD values. When aligning structures with very different sequences, the side chain atoms generally are not taken into account because their identities differ between many aligned residues. For this reason it is common for structural alignment methods to use by default only the backbone atoms included in the peptide bond. For simplicity and efficiency, often only the alpha carbon positions are considered, since the peptide bond has a minimally variant planar conformation. Only when the structures to be aligned are highly similar or even identical is it meaningful to align side-chain atom positions, in which case the RMSD reflects not only the conformation of the protein backbone but also the rotameric states of the side chains. Other comparison criteria that reduce noise and bolster positive matches include secondary structure assignment, native contact maps or residue interaction patterns, measures of side chain packing, and measures of hydrogen bond retention.

Structural superposition

The most basic possible comparison between protein structures makes no attempt to align the input structures and requires a precalculated alignment as input to determine which of the residues in the sequence are intended to be considered in the RMSD calculation. Structural superposition is commonly used to compare multiple conformations of the same protein (in which case no alignment is necessary, since the sequences are the same) and to evaluate the quality of alignments produced using only sequence information between two or more sequences whose structures are known. This method traditionally uses a simple least-squares fitting algorithm, in which the optimal rotations and translations are found by minimizing the sum of the squared distances among all structures in the superposition. More recently, maximum likelihood and Bayesian methods have greatly increased the accuracy of the estimated rotations, translations, and covariance matrices for the superposition.

Algorithms based on multidimensional rotations and modified quaternions have been developed to identify topological relationships between protein structures without the need for a predetermined alignment. Such algorithms have successfully identified canonical folds such as the four-helix bundle. The SuperPose method is sufficiently extensible to correct for relative domain rotations and other structural pitfalls.

Algorithmic complexity

Optimal solution

The optimal "threading" of a protein sequence onto a known structure and the production of an optimal multiple sequence alignment have been shown to be NP-complete. However, this does not imply that the structural alignment problem is NP-complete. Strictly speaking, an optimal solution to the protein structure alignment problem is only known for certain protein structure similarity measures, such as the measures used in protein structure prediction experiments, GDT_TS and MaxSub. These measures can be rigorously optimized using an algorithm capable of maximizing the number of atoms in two proteins that can be superimposed under a predefined distance cutoff. Unfortunately, the algorithm for optimal solution is not practical, since its running time depends not only on the lengths but also on the intrinsic geometry of input proteins.

Approximate solution

Approximate polynomial-time algorithms for structural alignment that produce a family of "optimal" solutions within an approximation parameter for a given scoring function have been developed. Although these algorithms theoretically classify the approximate protein structure alignment problem as "tractable", they are still computationally too expensive for large-scale protein structure analysis. As a consequence, practical algorithms that converge to the global solutions of the alignment, given a scoring function, do not exist. Most algorithms are, therefore, heuristic, but algorithms that guarantee the convergence to at least local maximizers of the scoring functions, and are practical, have been developed.

Representation of structures

Protein structures have to be represented in some coordinate-independent space to make them comparable. This is typically achieved by constructing a sequence-to-sequence matrix or series of matrices that encompass comparative metrics: rather than absolute distances relative to a fixed coordinate space. An intuitive representation is the distance matrix, which is a two-dimensional matrix containing all pairwise distances between some subset of the atoms in each structure (such as the alpha carbons). The matrix increases in dimensionality as the number of structures to be simultaneously aligned increases. Reducing the protein to a coarse metric such as secondary structure elements (SSEs) or structural fragments can also produce sensible alignments, despite the loss of information from discarding distances, as noise is also discarded. Choosing a representation to facilitate computation is critical to developing an efficient alignment mechanism.

Methods

Structural alignment techniques have been used in comparing individual structures or sets of structures as well as in the production of "all-to-all" comparison databases that measure the divergence between every pair of structures present in the Protein Data Bank (PDB). Such databases are used to classify proteins by their fold.

DALI

Illustration of the atom-to-atom vectors calculated in SSAP. From these vectors a series of vector differences, e.g., between (FA) in Protein 1 and (SI) in Protein 2 would be constructed. The two sequences are plotted on the two dimensions of a matrix to form a difference matrix between the two proteins. Dynamic programming is applied to all possible difference matrices to construct a series of optimal local alignment paths that are then summed to form the summary matrix, on which a second round of dynamic programming is performed.

A common and popular structural alignment method is the DALI, or Distance-matrix ALIgnment method, which breaks the input structures into hexapeptide fragments and calculates a distance matrix by evaluating the contact patterns between successive fragments. Secondary structure features that involve residues that are contiguous in sequence appear on the matrix's main diagonal; other diagonals in the matrix reflect spatial contacts between residues that are not near each other in the sequence. When these diagonals are parallel to the main diagonal, the features they represent are parallel; when they are perpendicular, their features are anti-parallel. This representation is memory-intensive because the features in the square matrix are symmetrical (and thus redundant) about the main diagonal. 

When two proteins' distance matrices share the same or similar features in approximately the same positions, they can be said to have similar folds with similar-length loops connecting their secondary structure elements. DALI's actual alignment process requires a similarity search after the two proteins' distance matrices are built; this is normally conducted via a series of overlapping submatrices of size 6x6. Submatrix matches are then reassembled into a final alignment via a standard score-maximization algorithm — the original version of DALI used a Monte Carlo simulation to maximize a structural similarity score that is a function of the distances between putative corresponding atoms. In particular, more distant atoms within corresponding features are exponentially downweighted to reduce the effects of noise introduced by loop mobility, helix torsions, and other minor structural variations. Because DALI relies on an all-to-all distance matrix, it can account for the possibility that structurally aligned features might appear in different orders within the two sequences being compared. 

The DALI method has also been used to construct a database known as FSSP (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins) in which all known protein structures are aligned with each other to determine their structural neighbors and fold classification. There is an searchable database based on DALI as well as a downloadable program and web search based on a standalone version known as DaliLite.

Combinatorial extension

The combinatorial extension (CE) method is similar to DALI in that it too breaks each structure in the query set into a series of fragments that it then attempts to reassemble into a complete alignment. A series of pairwise combinations of fragments called aligned fragment pairs, or AFPs, are used to define a similarity matrix through which an optimal path is generated to identify the final alignment. Only AFPs that meet given criteria for local similarity are included in the matrix as a means of reducing the necessary search space and thereby increasing efficiency. A number of similarity metrics are possible; the original definition of the CE method included only structural superpositions and inter-residue distances but has since been expanded to include local environmental properties such as secondary structure, solvent exposure, hydrogen-bonding patterns, and dihedral angles.

An alignment path is calculated as the optimal path through the similarity matrix by linearly progressing through the sequences and extending the alignment with the next possible high-scoring AFP pair. The initial AFP pair that nucleates the alignment can occur at any point in the sequence matrix. Extensions then proceed with the next AFP that meets given distance criteria restricting the alignment to low gap sizes. The size of each AFP and the maximum gap size are required input parameters but are usually set to empirically determined values of 8 and 30 respectively. Like DALI and SSAP, CE has been used to construct an all-to-all fold classification database from the known protein structures in the PDB. 

The RCSB PDB has recently released an updated version of CE and FATCAT as part of the RCSB PDB Protein Comparison Tool. It provides a new variation of CE that can detect circular permutations in protein structures.

SSAP

The SSAP (Sequential Structure Alignment Program) method uses double dynamic programming to produce a structural alignment based on atom-to-atom vectors in structure space. Instead of the alpha carbons typically used in structural alignment, SSAP constructs its vectors from the beta carbons for all residues except glycine, a method which thus takes into account the rotameric state of each residue as well as its location along the backbone. SSAP works by first constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein. A series of matrices are then constructed containing the vector differences between neighbors for each pair of residues for which vectors were constructed. Dynamic programming applied to each resulting matrix determines a series of optimal local alignments which are then summed into a "summary" matrix to which dynamic programming is applied again to determine the overall structural alignment.
SSAP originally produced only pairwise alignments but has since been extended to multiple alignments as well. It has been applied in an all-to-all fashion to produce a hierarchical fold classification scheme known as CATH (Class, Architecture, Topology, Homology), which has been used to construct the CATH Protein Structure Classification database.

Recent developments

Improvements in structural alignment methods constitute an active area of research, and new or modified methods are often proposed that are claimed to offer advantages over the older and more widely distributed techniques. A recent example, TM-align, uses a novel method for weighting its distance matrix, to which standard dynamic programming is then applied. The weighting is proposed to accelerate the convergence of dynamic programming and correct for effects arising from alignment lengths. In a benchmarking study, TM-align has been reported to improve in both speed and accuracy over DALI and CE.

Other promising methods of structural alignment are local structural alignment methods. These provide comparison of pre-selected parts of proteins (e.g. binding sites, user-defined structural motifs) against binding sites or whole-protein structural databases. The MultiBind and MAPPIS servers allow the identification of common spatial arrangements of physicochemical properties such as H-bond donor, acceptor, aliphatic, aromatic or hydrophobic in a set of user provided protein binding sites defined by interactions with small molecules (MultiBind) or in a set of user-provided protein–protein interfaces (MAPPIS). Others provide comparison of entire protein structures against a number of user submitted structures or against a large database of protein structures in reasonable time (ProBiS). Unlike global alignment approaches, local structural alignment approaches are suited to detection of locally conserved patterns of functional groups, which often appear in binding sites and have significant involvement in ligand binding. As an example, comparing G-Losa, a local structure alignment tool, with TM-align, a global structure alignment based method. While G-Losa predicts drug-like ligands’ positions in single-chain protein targets more precisely than TM-align, the overall success rate of TM-align is better.

However, as algorithmic improvements and computer performance have erased purely technical deficiencies in older approaches, it has become clear that there is no one universal criterion for the 'optimal' structural alignment. TM-align, for instance, is particularly robust in quantifying comparisons between sets of proteins with great disparities in sequence lengths, but it only indirectly captures hydrogen bonding or secondary structure order conservation which might be better metrics for alignment of evolutionarily related proteins. Thus recent developments have focused on optimizing particular attributes such as speed, quantification of scores, correlation to alternative gold standards, or tolerance of imperfection in structural data or ab initio structural models. An alternative methodology that is gaining popularity is to use the consensus of various methods to ascertain proteins structural similarities.

RNA structural alignment

Structural alignment techniques have traditionally been applied exclusively to proteins, as the primary biological macromolecules that assume characteristic three-dimensional structures. However, large RNA molecules also form characteristic tertiary structures, which are mediated primarily by hydrogen bonds formed between base pairs as well as base stacking. Functionally similar noncoding RNA molecules can be especially difficult to extract from genomics data because structure is more strongly conserved than sequence in RNA as well as in proteins, and the more limited alphabet of RNA decreases the information content of any given nucleotide at any given position. 

However, because of the increasing interest in RNA structures and because of the growth of the number of experimentally determined 3D RNA structures, few RNA structure similarity methods have been developed recently. One of those methods is, e.g., SETTER which decomposes each RNA structure into smaller parts called general secondary structure units (GSSUs). GSSUs are subsequently aligned and these partial alignments are merged into the final RNA structure alignment and scored. The method has been implemented into the SETTER web server.

A recent method for pairwise structural alignment of RNA sequences with low sequence identity has been published and implemented in the program FOLDALIGN. However, this method is not truly analogous to protein structural alignment techniques because it computationally predicts the structures of the RNA input sequences rather than requiring experimentally determined structures as input. Although computational prediction of the protein folding process has not been particularly successful to date, RNA structures without pseudoknots can often be sensibly predicted using free energy-based scoring methods that account for base pairing and stacking.

Software

Choosing a software tool for structural alignment can be a challenge due to the large variety of available packages that differ significantly in methodology and reliability. A partial solution to this problem was presented in  and made publicly accessible through the ProCKSI webserver. A more complete list of currently available and freely distributed structural alignment software can be found in structural alignment software

Properties of some structural alignment servers and software packages are summarized and tested with examples at Structural Alignment Tools in Proteopedia.Org.

Protein domain

From Wikipedia, the free encyclopedia

Pyruvate kinase, a protein with three domains (PDB: 1PKN​).
 
A protein domain is a conserved part of a given protein sequence and (tertiary) structure that can evolve, function, and exist independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and often can be independently stable and folded. Many proteins consist of several structural domains. One domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.

Background

The concept of the domain was first proposed in 1973 by Wetlaufer after X-ray crystallographic studies of hen lysozyme  and papain  and by limited proteolysis studies of immunoglobulins. Wetlaufer defined domains as stable units of protein structure that could fold autonomously. In the past domains have been described as units of:
  • compact structure
  • function and evolution
  • folding.
Each definition is valid and will often overlap, i.e. a compact structural domain that is found among diverse proteins is likely to fold independently within its structural environment. Nature often brings several domains together to form multi-domain and multi-functional proteins with a vast number of possibilities. In a multi-domain protein, each domain may fulfill its own function independently, or in a concerted manner with its neighbors. Domains can either serve as modules for building up large assemblies such as virus particles or muscle fibers, or can provide specific catalytic or binding sites as found in enzymes or regulatory proteins.

Example: Pyruvate kinase

An appropriate example is pyruvate kinase (see first figure), a glycolytic enzyme that plays an important role in regulating the flux from fructose-1,6-biphosphate to pyruvate. It contains an all-β nucleotide binding domain (in blue), an α/β-substrate binding domain (in grey) and an α/β-regulatory domain (in olive green), connected by several polypeptide linkers. Each domain in this protein occurs in diverse sets of protein families.

The central α/β-barrel substrate binding domain is one of the most common enzyme folds. It is seen in many different enzyme families catalysing completely unrelated reactions. The α/β-barrel is commonly called the TIM barrel named after triose phosphate isomerase, which was the first such structure to be solved. It is currently classified into 26 homologous families in the CATH domain database. The TIM barrel is formed from a sequence of β-α-β motifs closed by the first and last strand hydrogen bonding together, forming an eight stranded barrel. There is debate about the evolutionary origin of this domain. One study has suggested that a single ancestral enzyme could have diverged into several families, while another suggests that a stable TIM-barrel structure has evolved through convergent evolution.

The TIM-barrel in pyruvate kinase is 'discontinuous', meaning that more than one segment of the polypeptide is required to form the domain. This is likely to be the result of the insertion of one domain into another during the protein's evolution. It has been shown from known structures that about a quarter of structural domains are discontinuous. The inserted β-barrel regulatory domain is 'continuous', made up of a single stretch of polypeptide.

Units of protein structure

The primary structure (string of amino acids) of a protein ultimately encodes its uniquely folded three-dimensional (3D) conformation. The most important factor governing the folding of a protein into 3D structure is the distribution of polar and non-polar side chains. Folding is driven by the burial of hydrophobic side chains into the interior of the molecule so to avoid contact with the aqueous environment. Generally proteins have a core of hydrophobic residues surrounded by a shell of hydrophilic residues. Since the peptide bonds themselves are polar they are neutralized by hydrogen bonding with each other when in the hydrophobic environment. This gives rise to regions of the polypeptide that form regular 3D structural patterns called secondary structure. There are two main types of secondary structure: α-helices and β-sheets

Some simple combinations of secondary structure elements have been found to frequently occur in protein structure and are referred to as supersecondary structure or motifs. For example, the β-hairpin motif consists of two adjacent antiparallel β-strands joined by a small loop. It is present in most antiparallel β structures both as an isolated ribbon and as part of more complex β-sheets. Another common super-secondary structure is the β-α-β motif, which is frequently used to connect two parallel β-strands. The central α-helix connects the C-termini of the first strand to the N-termini of the second strand, packing its side chains against the β-sheet and therefore shielding the hydrophobic residues of the β-strands from the surface. 

Covalent association of two domains represents a functional and structural advantage since there is an increase in stability when compared with the same structures non-covalently associated. Other, advantages are the protection of intermediates within inter-domain enzymatic clefts that may otherwise be unstable in aqueous environments, and a fixed stoichiometric ratio of the enzymatic activity necessary for a sequential set of reactions.

Structural alignment is an important tool for determining domains.

Tertiary structure

Several motifs pack together to form compact, local, semi-independent units called domains. The overall 3D structure of the polypeptide chain is referred to as the protein's tertiary structure. Domains are the fundamental units of tertiary structure, each domain containing an individual hydrophobic core built from secondary structural units connected by loop regions. The packing of the polypeptide is usually much tighter in the interior than the exterior of the domain producing a solid-like core and a fluid-like surface. Core residues are often conserved in a protein family, whereas the residues in loops are less conserved, unless they are involved in the protein's function. Protein tertiary structure can be divided into four main classes based on the secondary structural content of the domain.
  • All-α domains have a domain core built exclusively from α-helices. This class is dominated by small folds, many of which form a simple bundle with helices running up and down.
  • All-β domains have a core composed of antiparallel β-sheets, usually two sheets packed against each other. Various patterns can be identified in the arrangement of the strands, often giving rise to the identification of recurring motifs, for example the Greek key motif.
  • α+β domains are a mixture of all-α and all-β motifs. Classification of proteins into this class is difficult because of overlaps to the other three classes and therefore is not used in the CATH domain database.
  • α/β domains are made from a combination of β-α-β motifs that predominantly form a parallel β-sheet surrounded by amphipathic α-helices. The secondary structures are arranged in layers or barrels.

Limits on size

Domains have limits on size. The size of individual structural domains varies from 36 residues in E-selectin to 692 residues in lipoxygenase-1, but the majority, 90%, have fewer than 200 residues with an average of approximately 100 residues. Very short domains, less than 40 residues, are often stabilised by metal ions or disulfide bonds. Larger domains, greater than 300 residues, are likely to consist of multiple hydrophobic cores.

Quaternary structure

Many proteins have a quaternary structure, which consists of several polypeptide chains that associate into an oligomeric molecule. Each polypeptide chain in such a protein is called a subunit. Hemoglobin, for example, consists of two α and two β subunits. Each of the four chains has an all-α globin fold with a heme pocket. 

Domain swapping is a mechanism for forming oligomeric assemblies. In domain swapping, a secondary or tertiary element of a monomeric protein is replaced by the same element of another protein. Domain swapping can range from secondary structure elements to whole structural domains. It also represents a model of evolution for functional adaptation by oligomerization, e.g. oligomeric enzymes that have their active site at subunit interfaces.

Domains as evolutionary modules

As nature is a tinkerer and not an inventor, new sequences are adapted from pre-existing sequences rather than invented. Domains are the common material used by nature to generate new sequences; they can be thought of as genetically mobile units, referred to as 'modules'. Often, the C and N termini of domains are close together in space, allowing them to easily be "slotted into" parent structures during the process of evolution. Many domain families are found in all three forms of life, Archaea, Bacteria and Eukarya. Protein modules are a subset of protein domains which are found across a range of different proteins with a particularly versatile structure. Examples can be found among extracellular proteins associated with clotting, fibrinolysis, complement, the extracellular matrix, cell surface adhesion molecules and cytokine receptors. Four concrete examples of widespread protein modules are the following domains: SH2, immunoglobulin, fibronectin type 3 and the kringle.

Molecular evolution gives rise to families of related proteins with similar sequence and structure. However, sequence similarities can be extremely low between proteins that share the same structure. Protein structures may be similar because proteins have diverged from a common ancestor. Alternatively, some folds may be more favored than others as they represent stable arrangements of secondary structures and some proteins may converge towards these folds over the course of evolution. There are currently about 110,000 experimentally determined protein 3D structures deposited within the Protein Data Bank (PDB). However, this set contains many identical or very similar structures. All proteins should be classified to structural families to understand their evolutionary relationships. Structural comparisons are best achieved at the domain level. For this reason many algorithms have been developed to automatically assign domains in proteins with known 3D structure.

The CATH domain database classifies domains into approximately 800 fold families; ten of these folds are highly populated and are referred to as 'super-folds'. Super-folds are defined as folds for which there are at least three structures without significant sequence similarity. The most populated is the α/β-barrel super-fold, as described previously.

Multidomain proteins

The majority of proteins, two-thirds in unicellular organisms and more than 80% in metazoa, are multidomain proteins. However, other studies concluded that 40% of prokaryotic proteins consist of multiple domains while eukaryotes have approximately 65% multi-domain proteins.

Many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes, suggesting that domains in multidomain proteins have once existed as independent proteins. For example, vertebrates have a multi-enzyme polypeptide containing the GAR synthetase, AIR synthetase and GAR transformylase domains (GARs-AIRs-GARt; GAR: glycinamide ribonucleotide synthetase/transferase; AIR: aminoimidazole ribonucleotide synthetase). In insects, the polypeptide appears as GARs-(AIRs)2-GARt, in yeast GARs-AIRs is encoded separately from GARt, and in bacteria each domain is encoded separately.


Attractin-like protein 1 (ATRNL1) is a multi-domain protein found in animals, including humans. Each unit is one domain, e.g. the EGF or Kelch domains.

Origin

Multidomain proteins are likely to have emerged from selective pressure during evolution to create new functions. Various proteins have diverged from common ancestors by different combinations and associations of domains. Modular units frequently move about, within and between biological systems through mechanisms of genetic shuffling:
  • transposition of mobile elements including horizontal transfers (between species);
  • gross rearrangements such as inversions, translocations, deletions and duplications;
  • homologous recombination;
  • slippage of DNA polymerase during replication.

Types of organization

Insertions of similar PH domain modules (maroon) into two different proteins.

The simplest multidomain organization seen in proteins is that of a single domain repeated in tandem. The domains may interact with each other or remain isolated, like beads on string. The giant 30,000 residue muscle protein titin comprises about 120 fibronectin-III-type and Ig-type domains. In the serine proteases, a gene duplication event has led to the formation of a two β-barrel domain enzyme. The repeats have diverged so widely that there is no obvious sequence similarity between them. The active site is located at a cleft between the two β-barrel domains, in which functionally important residues are contributed from each domain. Genetically engineered mutants of the chymotrypsin serine protease were shown to have some proteinase activity even though their active site residues were abolished and it has therefore been postulated that the duplication event enhanced the enzyme's activity.

Modules frequently display different connectivity relationships, as illustrated by the kinesins and ABC transporters. The kinesin motor domain can be at either end of a polypeptide chain that includes a coiled-coil region and a cargo domain. ABC transporters are built with up to four domains consisting of two unrelated modules, ATP-binding cassette and an integral membrane module, arranged in various combinations. 

Not only do domains recombine, but there are many examples of a domain having been inserted into another. Sequence or structural similarities to other domains demonstrate that homologues of inserted and parent domains can exist independently. An example is that of the 'fingers' inserted into the 'palm' domain within the polymerases of the Pol I family. Since a domain can be inserted into another, there should always be at least one continuous domain in a multidomain protein. This is the main difference between definitions of structural domains and evolutionary/functional domains. An evolutionary domain will be limited to one or two connections between domains, whereas structural domains can have unlimited connections, within a given criterion of the existence of a common core. Several structural domains could be assigned to an evolutionary domain.

A superdomain consists of two or more conserved domains of nominally independent origin, but subsequently inherited as a single structural/functional unit. This combined superdomain can occur in diverse proteins that are not related by gene duplication alone. An example of a superdomain is the protein tyrosine phosphataseC2 domain pair in PTEN, tensin, auxilin and the membrane protein TPTE2. This superdomain is found in proteins in animals, plants and fungi. A key feature of the PTP-C2 superdomain is amino acid residue conservation in the domain interface.

Domains are autonomous folding units

Folding

Protein folding - the unsolved problem : Since the seminal work of Anfinsen in the early 1960s, the goal to completely understand the mechanism by which a polypeptide rapidly folds into its stable native conformation remains elusive. Many experimental folding studies have contributed much to our understanding, but the principles that govern protein folding are still based on those discovered in the very first studies of folding. Anfinsen showed that the native state of a protein is thermodynamically stable, the conformation being at a global minimum of its free energy. 

Folding is a directed search of conformational space allowing the protein to fold on a biologically feasible time scale. The Levinthal paradox states that if an averaged sized protein would sample all possible conformations before finding the one with the lowest energy, the whole process would take billions of years. Proteins typically fold within 0.1 and 1000 seconds. Therefore, the protein folding process must be directed some way through a specific folding pathway. The forces that direct this search are likely to be a combination of local and global influences whose effects are felt at various stages of the reaction.

Advances in experimental and theoretical studies have shown that folding can be viewed in terms of energy landscapes, where folding kinetics is considered as a progressive organization of an ensemble of partially folded structures through which a protein passes on its way to the folded structure. This has been described in terms of a folding funnel, in which an unfolded protein has a large number of conformational states available and there are fewer states available to the folded protein. A funnel implies that for protein folding there is a decrease in energy and loss of entropy with increasing tertiary structure formation. The local roughness of the funnel reflects kinetic traps, corresponding to the accumulation of misfolded intermediates. A folding chain progresses toward lower intra-chain free-energies by increasing its compactness. The chain's conformational options become increasingly narrowed ultimately toward one native structure.

Advantage of domains in protein folding

The organisation of large proteins by structural domains represents an advantage for protein folding, with each domain being able to individually fold, accelerating the folding process and reducing a potentially large combination of residue interactions. Furthermore, given the observed random distribution of hydrophobic residues in proteins, domain formation appears to be the optimal solution for a large protein to bury its hydrophobic residues while keeping the hydrophilic residues at the surface.

However, the role of inter-domain interactions in protein folding and in energetics of stabilisation of the native structure, probably differs for each protein. In T4 lysozyme, the influence of one domain on the other is so strong that the entire molecule is resistant to proteolytic cleavage. In this case, folding is a sequential process where the C-terminal domain is required to fold independently in an early step, and the other domain requires the presence of the folded C-terminal domain for folding and stabilization.

It has been found that the folding of an isolated domain can take place at the same rate or sometimes faster than that of the integrated domain, suggesting that unfavorable interactions with the rest of the protein can occur during folding. Several arguments suggest that the slowest step in the folding of large proteins is the pairing of the folded domains. This is either because the domains are not folded entirely correctly or because the small adjustments required for their interaction are energetically unfavorable, such as the removal of water from the domain interface.

Domains and protein flexibility

Protein domain dynamics play a key role in a multitude of molecular recognition and signaling processes. Protein domains, connected by intrinsically disordered flexible linker domains, induce long-range allostery via protein domain dynamics. The resultant dynamic modes cannot be generally predicted from static structures of either the entire protein or individual domains. They can however be inferred by comparing different structures of a protein. They can also be suggested by sampling in extensive molecular dynamics trajectories and principal component analysis, or they can be directly observed using spectra measured by neutron spin echo spectroscopy.

Domain definition from structural co-ordinates

The importance of domains as structural building blocks and elements of evolution has brought about many automated methods for their identification and classification in proteins of known structure. Automatic procedures for reliable domain assignment is essential for the generation of the domain databases, especially as the number of known protein structures is increasing. Although the boundaries of a domain can be determined by visual inspection, construction of an automated method is not straightforward. Problems occur when faced with domains that are discontinuous or highly associated. The fact that there is no standard definition of what a domain really is has meant that domain assignments have varied enormously, with each researcher using a unique set of criteria.

A structural domain is a compact, globular sub-structure with more interactions within it than with the rest of the protein. Therefore, a structural domain can be determined by two visual characteristics: its compactness and its extent of isolation. Measures of local compactness in proteins have been used in many of the early methods of domain assignment and in several of the more recent methods.

Methods

One of the first algorithms used a Cα-Cα distance map together with a hierarchical clustering routine that considered proteins as several small segments, 10 residues in length. The initial segments were clustered one after another based on inter-segment distances; segments with the shortest distances were clustered and considered as single segments thereafter. The step-wise clustering finally included the full protein. Go also exploited the fact that inter-domain distances are normally larger than intra-domain distances; all possible Cα-Cα distances were represented as diagonal plots in which there were distinct patterns for helices, extended strands and combinations of secondary structures. 

The method by Sowdhamini and Blundell clusters secondary structures in a protein based on their Cα-Cα distances and identifies domains from the pattern in their dendrograms. As the procedure does not consider the protein as a continuous chain of amino acids there are no problems in treating discontinuous domains. Specific nodes in these dendrograms are identified as tertiary structural clusters of the protein, these include both super-secondary structures and domains. The DOMAK algorithm is used to create the 3Dee domain database. It calculates a 'split value' from the number of each type of contact when the protein is divided arbitrarily into two parts. This split value is large when the two parts of the structure are distinct. 

The method of Wodak and Janin was based on the calculated interface areas between two chain segments repeatedly cleaved at various residue positions. Interface areas were calculated by comparing surface areas of the cleaved segments with that of the native structure. Potential domain boundaries can be identified at a site where the interface area was at a minimum. Other methods have used measures of solvent accessibility to calculate compactness.

The PUU algorithm incorporates a harmonic model used to approximate inter-domain dynamics. The underlying physical concept is that many rigid interactions will occur within each domain and loose interactions will occur between domains. This algorithm is used to define domains in the FSSP domain database.

Swindells (1995) developed a method, DETECTIVE, for identification of domains in protein structures based on the idea that domains have a hydrophobic interior. Deficiencies were found to occur when hydrophobic cores from different domains continue through the interface region. 

RigidFinder is a novel method for identification of protein rigid blocks (domains and loops) from two different conformations. Rigid blocks are defined as blocks where all inter residue distances are conserved across conformations. 

A general method to identify dynamical domains, that is protein regions that behave approximately as rigid units in the course of structural fluctuations, has been introduced by Potestio et al. and, among other applications was also used to compare the consistency of the dynamics-based domain subdivisions with standard structure-based ones. The method, termed PiSQRD, is publicly available in the form of a web server. The latter allows users to optimally subdivide single-chain or multimeric proteins into quasi-rigid domains based on the collective modes of fluctuation of the system. By default the latter are calculated through an elastic network model; alternatively pre-calculated essential dynamical spaces can be uploaded by the user.

Example domains

  • Armadillo repeats : named after the β-catenin-like Armadillo protein of the fruit fly Drosophila.
  • Basic Leucine zipper domain (bZIP domain) : is found in many DNA-binding eukaryotic proteins. One part of the domain contains a region that mediates sequence-specific DNA-binding properties and the Leucine zipper that is required for the dimerization of two DNA-binding regions. The DNA-binding region comprises a number of basic aminoacids such as arginine and lysine
  • Cadherin repeats : Cadherins function as Ca2+-dependent cell–cell adhesion proteins. Cadherin domains are extracellular regions which mediate cell-to-cell homophilic binding between cadherins on the surface of adjacent cells.
  • Death effector domain (DED) : allows protein–protein binding by homotypic interactions (DED-DED). Caspase proteases trigger apoptosis via proteolytic cascades. Pro-Caspase-8 and pro-caspase-9 bind to specific adaptor molecules via DED domains and this leads to autoactivation of caspases.
  • EF hand : a helix-turn-helix structural motif found in each structural domain of the signaling protein calmodulin and in the muscle protein troponin-C.
  • Immunoglobulin-like domains : are found in proteins of the immunoglobulin superfamily (IgSF). They contain about 70-110 amino acids and are classified into different categories (IgV, IgC1, IgC2 and IgI) according to their size and function. They possess a characteristic fold in which two beta sheets form a "sandwich" that is stabilized by interactions between conserved cysteines and other charged amino acids. They are important for protein–protein interactions in processes of cell adhesion, cell activation, and molecular recognition. These domains are commonly found in molecules with roles in the immune system.
  • Phosphotyrosine-binding domain (PTB) : PTB domains usually bind to phosphorylated tyrosine residues. They are often found in signal transduction proteins. PTB-domain binding specificity is determined by residues to the amino-terminal side of the phosphotyrosine. Examples: the PTB domains of both SHC and IRS-1 bind to a NPXpY sequence. PTB-containing proteins such as SHC and IRS-1 are important for insulin responses of human cells.
  • Pleckstrin homology domain (PH) : PH domains bind phosphoinositides with high affinity. Specificity for PtdIns(3)P, PtdIns(4)P, PtdIns(3,4)P2, PtdIns(4,5)P2, and PtdIns(3,4,5)P3 have all been observed. Given the fact that phosphoinositides are sequestered to various cell membranes (due to their long lipophilic tail) the PH domains usually causes recruitment of the protein in question to a membrane where the protein can exert a certain function in cell signalling, cytoskeletal reorganization or membrane trafficking.
  • Src homology 2 domain (SH2) : SH2 domains are often found in signal transduction proteins. SH2 domains confer binding to phosphorylated tyrosine (pTyr). Named after the phosphotyrosine binding domain of the src viral oncogene, which is itself a tyrosine kinase. See also: SH3 domain.
  • Zinc finger DNA binding domain (ZnF_GATA) : ZnF_GATA domain-containing proteins are typically transcription factors that usually bind to the DNA sequence [AT]GATA[AG] of promoters.

Domains of unknown function

A large fraction of domains are of unknown function. A domain of unknown function (DUF) is a protein domain that has no characterized function. These families have been collected together in the Pfam database using the prefix DUF followed by a number, with examples being DUF2992 and DUF1220. There are now over 3,000 DUF families within the Pfam database representing over 20% of known families.

At 200, Marx Is Still Wrong

Tuesday, May 15, 2018

Two hundred years ago this month, Karl Marx was born in Trier, Germany, a small town in the western part of the country. To celebrate his bicentennial, the People’s Republic of China donated a larger-than-life statue of the founder of Communism to the city of his birth, which the Trier City Council voted to accept. It goes without saying that this memorialization was controversial, not only because of the devastation caused throughout the world during the twentieth century in the name of Marxism, but also because of the still living memory of Communist rule in East Germany. Henceforth when one thinks of Trier, one should remember Tiananmen Square.

As if the China connection were not sufficiently provocative, the Marx commemoration in Trier included a panegyric delivered in the town’s cathedral by none other than Jean-Claude Juncker, the president of the European Commission and effective head of the European Union. Juncker is hardly known as a deep philosophical thinker and his efforts to present Marx as a “philosopher who thought into the future” were the insipid ramblings of a career Eurocrat. But his very presence at the event became a scandal because he casually dismissed the letters of protest he admitted having received from concerned members of the public in central and eastern Europe—the territories which had suffered most under Communist rule and where the memory of that dictatorship is still very much alive.

In Juncker’s telling, Marx was a mild social democrat. But Juncker failed to explore the implications of what was done in Marx’s name. What was it in Marx’s writings that lent itself to the interpretations—or misinterpretations, according to Juncker—of his Stalinist followers?  There is of course a legitimate tension between judging a work—Marx’s work—on its own merits and judging it based on its impact. But Juncker cannot praise Marx for “thinking into the future” while simultaneously trying to insulate Marx from his own legacy.

As we approach the thirty-year anniversary of the collapse of the Soviet Union, it is astonishing that Marx continues to be a popular figure, and not just among people like Juncker. Of course, there are the dogmatists in the few remaining Communist countries  such as China and Cuba,  who continue to cling onto his sclerotic ideas. But there are also, closer to home, intellectuals and academics who purvey versions of Marxism in the humanities departments of many college campuses. Meanwhile outside of the universities, popularized versions of Marx’s ideas circulate among left-wing populists, like those of the “Occupy Wall Street” movement. 

All the more reason to review what was rotten about Marx’s ideas—ideas that gave rise to brutal dictatorships and the killing machines of the gulags.  

If you read nearly any passage in Marx’s oeuvre, it’s hard not to be struck by his sense of absolute certainty. He pronounces statements in an apodictic manner, laying claim to an unquestionable sense of truth, with no opportunity to doubt. He is therefore always on the attack as he decimates opponents with unyielding polemic—and he was a master of polemical style, to be sure. Meanwhile there is no self-reflection, no interrogation of his own views, and no sense that he might possibly be wrong.  

Marx channels a voice of infallibility, making sweeping claims with no margin of error and no exploration of evidence: “All history is the history of class struggle ” begins the Communist Manifesto, which he co-authored with Friedrich Engels. All history? Was there really nothing else than conflicts between different economic groups? For Marx, apparently, there was never any other dimension of human experience worthy of independent consideration: no history of technology, of ideas, of culture, or faith. He comes to this one-dimensional schema by deflating the philosophy of history he had found in his teacher, the German philosopher G.W.F. Hegel. In place of Hegelian complexity, he offered simplistic claims to predict the future in the form of “developmental laws of capitalism.”

Perhaps the kindest judgment on Marx is that he was just one more economist who thought he could predict the future. His delusion about his own predictive capacities is what made Marx so distasteful to a thinker like Friedrich von Hayek, who recognized that humans can get things wrong, so best not to endow any single human with too much power, and certainly not the government. Not so Marx, who claimed direct access to incontrovertible insights into the logic of history. For that reason he could conclude his Manifesto with a series of crushing verdicts on competing radical movements, denounced and condemned, without a shadow of doubt. These concluding diatribes against other socialist currents that dared to differ from Marx’s Communism are perhaps the most symptomatic elements of his work, setting up a pattern of defaming one’s opponents, especially those closest to him. Marx’s Bolshevik heirs would transform  that confidence in condemnation into rationale to send political competitors to their deaths. On the long list of victims of Marxism, companions on the left figured prominently.

While we might associate Marx with politics, in fact he lacked any real appreciation for a political sphere in which one would interact productively with advocates of varying programs.  While for others, politics represents a realm of compromise and negotiation, for Marx, it was really the pursuit of power and the obligation to command. He described the state simply as “the executive committee of the bourgeoisie,” meaning that politics was secondary to the economy. Moreover he promised to abolish the state, and therefore politics, once Communism would eliminate class difference—or so the story went, as the ultimate outcome of Communism would be a libertarian utopia of statelessness. 

Nothing, however, would be further from the truth. In practice, what Communism provided for was the development of a nomenklatura, a new class elite which talked the egalitarian talk while claiming for itself the privilege of dictators. The Communist cadre always knew better than the unenlightened populace, and therefore the cadre would claim the power to impose their views and programs on the rest of society. The real political legacy of Marxism was not the abolition of the state but, on the contrary, the expansion of the state over society, and the elevation of a Marxist elite over the populace. No wonder the East Germans calling for the end of their Communist government in November 1989 chanted, “We are the people,” a people which the Communists, when all is said and done, simply deplored. Marxism was not about achieving an egalitarian society: it was the vehicle through which party activists and thugs could pursue their own will to power. (For this reason, the young radical Max Eastman described the Communist revolutionaries in Russia as “Nietzschean.”)

The Marxist pursuit of power also meant denouncing all religion, which Marx described as an opiate, a drug intended to lull its consumers into passivity and false consciousness, so as to keep them from the truth (his truth). Because Marxism emphasized labor and the primacy of human experience, it could appeal to various twentieth-century philosophers, existentialists among others, who emphasized the problem of alienation. Yet Marxism, which treated all thought in a reductionist manner as an expression of economy, could never shake its own anti-intellectual legacy, famously expressed in Marx’s Eleventh Thesis on Feuerbach: “Philosophers have only interpreted the world, the point however is to change it.” 

Discussion of the meaning of life—interpreting the world—turn out to be of negligible import for Marxists, easily dismissed as only “epiphenomenal.” Marx’s alternative to reflective thought was changing the world, but without any room for the sort of ethical guidance that philosophical thinking might offer: change at all costs, change with no limits. The result was a modernizing fantasy of thorough-going transformation with scant attention to the human costs. As Hannah Arendt showed a century later in her Origins of Totalitarianism, this would lead to systematic violence in “experiments” to fashion a “new man,” no matter how much suffering would ensue. Ultimately, Marx had offered a false alternative: philosophical thinking or changing the world. In fact, what defines the human condition is the ability to engage in both, deep thinking and intentional action, and indeed we should prefer action to be guided by thinking, just as thinking should be informed by the experience of prior action.

The claim of infallibility, the will to political power, and the dismissal of ethical thought: such was the legacy that Karl Marx bequeathed to the Communist movement that once ruled half the world. President Juncker, in his celebration of Marx, recalled none of this, revealing himself to be just one more of Marx’s defenders who still insist on the fantasy of a good Marx motivated by sympathy for the poverty of the workers during the industrial revolution.

But Marx was hardly the only thinker to write about nineteenth century social conditions, and he was surely not the most interesting. A page of Dickens is worth a volume of Das Kapital.  Wherever Marxism dominated working class movements—by suppressing competing reform movements or manipulating unions—blue collar workers fared worse. Had Marx not been appropriated as the ideological figurehead of the Bolshevik Revolution in 1917 in order to justify the dictatorship in Russia, he would be barely remembered today. (A Google N-gram search shows that “Marxism” only takes off as a term after Lenin came to power.) Instead, he has become the symbol of decades of terror.

For those who want to talk about Marx, to erect statues in his memory or to defend him as a philosopher, it is high time to discover some intellectual integrity and face up to the crimes committed in his name. It is wrong to say, as one commonly hears in some circles, that his program of Communism was a good idea, but poorly implemented. On the contrary, it was a bad idea from the start and the brutality that always accompanied it was a consequence of its core character.

Turns out wind and solar have a secret friend: Natural gas


In this Feb. 25, 2015 photo, a gas flare is seen at a natural gas processing facility near Williston, N.D. (AP Photo/Matthew Brown)

We’re at a time of deeply ambitious plans for clean energy growth. Two of the U.S.’s largest states by population, California and New York, have both mandated that power companies get fully 50 percent of their electricity from renewable sources by the year 2030.

Only, there’s a problem: Because of the particular nature of clean energy sources like solar and wind, you can’t simply add them to the grid in large volumes and think that’s the end of the story. Rather, because these sources of electricity generation are “intermittent” — solar fluctuates with weather and the daily cycle, wind fluctuates with the wind — there has to be some means of continuing to provide electricity even when they go dark. And the more renewables you have, the bigger this problem can be.

Now, a new study suggests that at least so far, solving that problem has ironically involved more fossil fuels — and more particularly, installing a large number of fast-ramping natural gas plants, which can fill in quickly whenever renewable generation slips.

The new research, published recently as a working paper by the National Bureau of Economic Research, was conducted by Elena Verdolini of the Euro-Mediterranean Center on Climate Change and the Fondazione Eni Enrico Mattei in Milan, Italy, along with colleagues from Syracuse University and the French Economic Observatory.

In the study, the researchers took a broad look at the erection of wind, solar, and other renewable energy plants (not including large hydropower or biomass projects) across 26 countries that are members of an international council known as the Organisation for Economic Co-operation and Development over the period between the year 1990 and 2013. And they found a surprisingly tight relationship between renewables on the one hand, and gas on the other.

“All other things equal, a 1% percent increase in the share of fast reacting fossil technologies is associated with a 0.88% percent increase in renewable generation capacity in the long term,” the study reports. Again, this is over 26 separate countries, and more than two decades.

“Our paper calls attention to the fact that renewables and fast-reacting fossil technologies appear as highly complementary and that they should be jointly installed to meet the goals of cutting emissions and ensuring a stable supply,” the paper adds.

The type of “fast-reacting fossil technologies” being referred to here is natural gas plants that fire up quickly. For example, General Electric and EDF Energy currently feature a natural gas plant in France that “is capable of reaching full power in less than 30 minutes.” Full power, in this case, means rapidly adding over 600 megawatts, or million watts, of electricity to the grid.

“This allows partners to respond quickly to grid demand fluctuations, integrating renewables as necessary,” note the companies.

“When people assume that we can switch from fossil fuels to renewables they assume we can completely switch out of one path, to another path,” says Verdolini. But, she adds, the study suggests otherwise.

Verdolini emphasized this merely describes the past — not necessarily the future. That’s a critical distinction, because the study also notes that if we reach a time when fast-responding energy storage is prevalent — when, say, large-scale grid batteries store solar or wind-generated energy and can discharge it instantaneously when there’s a need — then the reliance on gas may no longer be so prevalent.

Other recent research has suggested that precisely because of this overlap between fast-firing natural gas plants and grid scale batteries — because they can play many of the same roles — extremely cheap natural gas prices have helped the industry out-compete the storage sector and slowed its growth.

Two other researchers contacted for reactions to Verdolini’s study largely agreed with its findings.

“I think policymakers haven’t really grasped what 50 percent renewables really means in a system, without at least cheap batteries available,” says Christopher Knittel, who directs the Center for Energy and Environmental Policy Research at MIT, and who said he found the study’s results quite plausible.

“It’s certainly true that as one adds more renewables, the value of flexible generation increases, and so I would expect to see some correlation as they found,” added Eric Hittinger, an energy system researcher at the Rochester Institute of Technology who like Knittel was not involved in the study.

Hittinger and Knittel agreed that adding flexible natural gas alongside renewable projects is not a major climate change concern because the gas plants wouldn’t be running all the time — so it’s not like adding coal plants. The emissions would be real, but considerably more limited. However, they said, the principal issue is that the research suggests renewable plants are more costly to build, because of the added backup requirement.

“It’s a reality check now,” said Knittel of the study. “I think it’s potentially bad news as we start to get higher and higher penetration levels of renewables.”

The study also lends some credence to the widespread description of natural gas as a so-called “bridge fuel” that allows for a transition into a world of more renewables, as it is both flexible and also contributes less carbon dioxide emissions than does coal, per unit of energy generated by burning the fuel. (Environmentalists like to point out that if there are enough methane leaks from the process of drilling for and transporting natural gas, this edge could be canceled out.)

Hittinger also questioned what the correlation found in the study actually means — does it mean that natural gas spurs on the development of more solar and wind, or vice versa?

Verdolini said the study implies that the causation occurs with gas plants being added first, which then makes renewable projects more easy to integrate. “It’s an enabling factor,” she said, although she cautioned that the study cannot fully demonstrate causation.

Verdolini agreed that the findings are something that decision-makers hoping to add more clean energy to the grid will have to take into account.

“If you have an electric car, you don’t need a diesel car in your garage sitting there,” said Verdolini. “But in the case of renewables, it’s different, because if you have renewable electricity and that fails, then you need the fast acting gas sitting in your garage, so to speak.”

Inequality (mathematics)

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Inequality...