Search This Blog

Wednesday, August 30, 2023

Matrix decomposition

From Wikipedia, the free encyclopedia

In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of problems.

Example

In numerical analysis, different decompositions are used to implement efficient matrix algorithms.

For instance, when solving a system of linear equations , the matrix A can be decomposed via the LU decomposition. The LU decomposition factorizes a matrix into a lower triangular matrix L and an upper triangular matrix U. The systems and require fewer additions and multiplications to solve, compared with the original system , though one might require significantly more digits in inexact arithmetic such as floating point.

Similarly, the QR decomposition expresses A as QR with Q an orthogonal matrix and R an upper triangular matrix. The system Q(Rx) = b is solved by Rx = QTb = c, and the system Rx = c is solved by 'back substitution'. The number of additions and multiplications required is about twice that of using the LU solver, but no more digits are required in inexact arithmetic because the QR decomposition is numerically stable.

Decompositions related to solving systems of linear equations

LU decomposition

LU reduction

Block LU decomposition

Rank factorization

Cholesky decomposition

  • Applicable to: square, hermitian, positive definite matrix
  • Decomposition: , where is upper triangular with real positive diagonal entries
  • Comment: if the matrix is Hermitian and positive semi-definite, then it has a decomposition of the form if the diagonal entries of are allowed to be zero
  • Uniqueness: for positive definite matrices Cholesky decomposition is unique. However, it is not unique in the positive semi-definite case.
  • Comment: if is real and symmetric, has all real elements
  • Comment: An alternative is the LDL decomposition, which can avoid extracting square roots.

QR decomposition

  • Applicable to: m-by-n matrix A with linearly independent columns
  • Decomposition: where is a unitary matrix of size m-by-m, and is an upper triangular matrix of size m-by-n
  • Uniqueness: In general it is not unique, but if is of full rank, then there exists a single that has all positive diagonal elements. If is square, also is unique.
  • Comment: The QR decomposition provides an effective way to solve the system of equations . The fact that is orthogonal means that , so that is equivalent to , which is very easy to solve since is triangular.

RRQR factorization

Interpolative decomposition

Decompositions based on eigenvalues and related concepts

Eigendecomposition

  • Also called spectral decomposition.
  • Applicable to: square matrix A with linearly independent eigenvectors (not necessarily distinct eigenvalues).
  • Decomposition: , where D is a diagonal matrix formed from the eigenvalues of A, and the columns of V are the corresponding eigenvectors of A.
  • Existence: An n-by-n matrix A always has n (complex) eigenvalues, which can be ordered (in more than one way) to form an n-by-n diagonal matrix D and a corresponding matrix of nonzero columns V that satisfies the eigenvalue equation . is invertible if and only if the n eigenvectors are linearly independent (that is, each eigenvalue has geometric multiplicity equal to its algebraic multiplicity). A sufficient (but not necessary) condition for this to happen is that all the eigenvalues are different (in this case geometric and algebraic multiplicity are equal to 1)
  • Comment: One can always normalize the eigenvectors to have length one (see the definition of the eigenvalue equation)
  • Comment: Every normal matrix A (that is, matrix for which , where is a conjugate transpose) can be eigendecomposed. For a normal matrix A (and only for a normal matrix), the eigenvectors can also be made orthonormal () and the eigendecomposition reads as . In particular all unitary, Hermitian, or skew-Hermitian (in the real-valued case, all orthogonal, symmetric, or skew-symmetric, respectively) matrices are normal and therefore possess this property.
  • Comment: For any real symmetric matrix A, the eigendecomposition always exists and can be written as , where both D and V are real-valued.
  • Comment: The eigendecomposition is useful for understanding the solution of a system of linear ordinary differential equations or linear difference equations. For example, the difference equation starting from the initial condition is solved by , which is equivalent to , where V and D are the matrices formed from the eigenvectors and eigenvalues of A. Since D is diagonal, raising it to power , just involves raising each element on the diagonal to the power t. This is much easier to do and understand than raising A to power t, since A is usually not diagonal.

Jordan decomposition

The Jordan normal form and the Jordan–Chevalley decomposition

  • Applicable to: square matrix A
  • Comment: the Jordan normal form generalizes the eigendecomposition to cases where there are repeated eigenvalues and cannot be diagonalized, the Jordan–Chevalley decomposition does this without choosing a basis.

Schur decomposition

Real Schur decomposition

  • Applicable to: square matrix A
  • Decomposition: This is a version of Schur decomposition where and only contain real numbers. One can always write where V is a real orthogonal matrix, is the transpose of V, and S is a block upper triangular matrix called the real Schur form. The blocks on the diagonal of S are of size 1×1 (in which case they represent real eigenvalues) or 2×2 (in which case they are derived from complex conjugate eigenvalue pairs).

QZ decomposition

  • Also called: generalized Schur decomposition
  • Applicable to: square matrices A and B
  • Comment: there are two versions of this decomposition: complex and real.
  • Decomposition (complex version): and where Q and Z are unitary matrices, the * superscript represents conjugate transpose, and S and T are upper triangular matrices.
  • Comment: in the complex QZ decomposition, the ratios of the diagonal elements of S to the corresponding diagonal elements of T, , are the generalized eigenvalues that solve the generalized eigenvalue problem (where is an unknown scalar and v is an unknown nonzero vector).
  • Decomposition (real version): and where A, B, Q, Z, S, and T are matrices containing real numbers only. In this case Q and Z are orthogonal matrices, the T superscript represents transposition, and S and T are block upper triangular matrices. The blocks on the diagonal of S and T are of size 1×1 or 2×2.

Takagi's factorization

  • Applicable to: square, complex, symmetric matrix A.
  • Decomposition: , where D is a real nonnegative diagonal matrix, and V is unitary. denotes the matrix transpose of V.
  • Comment: The diagonal elements of D are the nonnegative square roots of the eigenvalues of .
  • Comment: V may be complex even if A is real.
  • Comment: This is not a special case of the eigendecomposition (see above), which uses instead of . Moreover, if A is not real, it is not Hermitian and the form using also does not apply.

Singular value decomposition

  • Applicable to: m-by-n matrix A.
  • Decomposition: , where D is a nonnegative diagonal matrix, and U and V satisfy . Here is the conjugate transpose of V (or simply the transpose, if V contains real numbers only), and I denotes the identity matrix (of some dimension).
  • Comment: The diagonal elements of D are called the singular values of A.
  • Comment: Like the eigendecomposition above, the singular value decomposition involves finding basis directions along which matrix multiplication is equivalent to scalar multiplication, but it has greater generality since the matrix under consideration need not be square.
  • Uniqueness: the singular values of are always uniquely determined. and need not to be unique in general.

Scale-invariant decompositions

Refers to variants of existing matrix decompositions, such as the SVD, that are invariant with respect to diagonal scaling.

  • Applicable to: m-by-n matrix A.
  • Unit-Scale-Invariant Singular-Value Decomposition: , where S is a unique nonnegative diagonal matrix of scale-invariant singular values, U and V are unitary matrices, is the conjugate transpose of V, and positive diagonal matrices D and E.
  • Comment: Is analogous to the SVD except that the diagonal elements of S are invariant with respect to left and/or right multiplication of A by arbitrary nonsingular diagonal matrices, as opposed to the standard SVD for which the singular values are invariant with respect to left and/or right multiplication of A by arbitrary unitary matrices.
  • Comment: Is an alternative to the standard SVD when invariance is required with respect to diagonal rather than unitary transformations of A.
  • Uniqueness: The scale-invariant singular values of (given by the diagonal elements of S) are always uniquely determined. Diagonal matrices D and E, and unitary U and V, are not necessarily unique in general.
  • Comment: U and V matrices are not the same as those from the SVD.

Analogous scale-invariant decompositions can be derived from other matrix decompositions; for example, to obtain scale-invariant eigenvalues.

Hessenberg decomposition

Complete orthogonal decomposition

  • Also known as: UTV decomposition, ULV decomposition, URV decomposition.
  • Applicable to: m-by-n matrix A.
  • Decomposition: , where T is a triangular matrix, and U and V are unitary matrices.
  • Comment: Similar to the singular value decomposition and to the Schur decomposition.

Other decompositions

Polar decomposition

  • Applicable to: any square complex matrix A.
  • Decomposition: (right polar decomposition) or (left polar decomposition), where U is a unitary matrix and P and P' are positive semidefinite Hermitian matrices.
  • Uniqueness: is always unique and equal to (which is always hermitian and positive semidefinite). If is invertible, then is unique.
  • Comment: Since any Hermitian matrix admits a spectral decomposition with a unitary matrix, can be written as . Since is positive semidefinite, all elements in are non-negative. Since the product of two unitary matrices is unitary, taking one can write which is the singular value decomposition. Hence, the existence of the polar decomposition is equivalent to the existence of the singular value decomposition.

Algebraic polar decomposition

  • Applicable to: square, complex, non-singular matrix A.
  • Decomposition: , where Q is a complex orthogonal matrix and S is complex symmetric matrix.
  • Uniqueness: If has no negative real eigenvalues, then the decomposition is unique.
  • Comment: The existence of this decomposition is equivalent to being similar to .
  • Comment: A variant of this decomposition is , where R is a real matrix and C is a circular matrix.

Mostow's decomposition

  • Applicable to: square, complex, non-singular matrix A.
  • Decomposition: , where U is unitary, M is real anti-symmetric and S is real symmetric.
  • Comment: The matrix A can also be decomposed as , where U2 is unitary, M2 is real anti-symmetric and S2 is real symmetric.

Sinkhorn normal form

  • Applicable to: square real matrix A with strictly positive elements.
  • Decomposition: , where S is doubly stochastic and D1 and D2 are real diagonal matrices with strictly positive elements.

Sectoral decomposition

  • Applicable to: square, complex matrix A with numerical range contained in the sector .
  • Decomposition: , where C is an invertible complex matrix and with all .

Williamson's normal form

  • Applicable to: square, positive-definite real matrix A with order 2n×2n.
  • Decomposition: , where is a symplectic matrix and D is a nonnegative n-by-n diagonal matrix.

Matrix square root

  • Decomposition: , not unique in general.
  • In the case of positive semidefinite , there is a unique positive semidefinite such that .

Generalizations

There exist analogues of the SVD, QR, LU and Cholesky factorizations for quasimatrices and cmatrices or continuous matrices. A ‘quasimatrix’ is, like a matrix, a rectangular scheme whose elements are indexed, but one discrete index is replaced by a continuous index. Likewise, a ‘cmatrix’, is continuous in both indices. As an example of a cmatrix, one can think of the kernel of an integral operator.

These factorizations are based on early work by Fredholm (1903), Hilbert (1904) and Schmidt (1907). For an account, and a translation to English of the seminal papers, see Stewart (2011).

Protein engineering

From Wikipedia, the free encyclopedia

Protein engineering is the process of developing useful or valuable proteins through the design and production of unnatural polypeptides, often by altering amino acid sequences found in nature. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to improve the function of many enzymes for industrial catalysis. It is also a product and services market, with an estimated value of $168 billion by 2017.

There are two general strategies for protein engineering: rational protein design and directed evolution. These methods are not mutually exclusive; researchers will often apply both. In the future, more detailed knowledge of protein structure and function, and advances in high-throughput screening, may greatly expand the abilities of protein engineering. Eventually, even unnatural amino acids may be included, via newer methods, such as expanded genetic code, that allow encoding novel amino acids in genetic code.

Approaches

Rational design

In rational protein design, a scientist uses detailed knowledge of the structure and function of a protein to make desired changes. In general, this has the advantage of being inexpensive and technically easy, since site-directed mutagenesis methods are well-developed. However, its major drawback is that detailed structural knowledge of a protein is often unavailable, and, even when available, it can be very difficult to predict the effects of various mutations since structural information most often provide a static picture of a protein structure. However, programs such as Folding@home and Foldit have utilized crowdsourcing techniques in order to gain insight into the folding motifs of proteins.

Computational protein design algorithms seek to identify novel amino acid sequences that are low in energy when folded to the pre-specified target structure. While the sequence-conformation space that needs to be searched is large, the most challenging requirement for computational protein design is a fast, yet accurate, energy function that can distinguish optimal sequences from similar suboptimal ones.

Multiple sequence alignment

Without structural information about a protein, sequence analysis is often useful in elucidating information about the protein. These techniques involve alignment of target protein sequences with other related protein sequences. This alignment can show which amino acids are conserved between species and are important for the function of the protein. These analyses can help to identify hot spot amino acids that can serve as the target sites for mutations. Multiple sequence alignment utilizes data bases such as PREFAB, SABMARK, OXBENCH, IRMBASE, and BALIBASE in order to cross reference target protein sequences with known sequences. Multiple sequence alignment techniques are listed below.

This method begins by performing pair wise alignment of sequences using k-tuple or Needleman–Wunsch methods. These methods calculate a matrix that depicts the pair wise similarity among the sequence pairs. Similarity scores are then transformed into distance scores that are used to produce a guide tree using the neighbor joining method. This guide tree is then employed to yield a multiple sequence alignment.

Clustal omega

This method is capable of aligning up to 190,000 sequences by utilizing the k-tuple method. Next sequences are clustered using the mBed and k-means methods. A guide tree is then constructed using the UPGMA method that is used by the HH align package. This guide tree is used to generate multiple sequence alignments.

MAFFT

This method utilizes fast Fourier transform (FFT) that converts amino acid sequences into a sequence composed of volume and polarity values for each amino acid residue. This new sequence is used to find homologous regions.

K-Align

This method utilizes the Wu-Manber approximate string matching algorithm to generate multiple sequence alignments.

Multiple sequence comparison by log expectation (MUSCLE)

This method utilizes Kmer and Kimura distances to generate multiple sequence alignments.

T-Coffee

This method utilizes tree based consistency objective functions for alignment evolution. This method has been shown to be 5–10% more accurate than Clustal W.

Coevolutionary analysis

Coevolutionary analysis is also known as correlated mutation, covariation, or co-substitution. This type of rational design involves reciprocal evolutionary changes at evolutionarily interacting loci. Generally this method begins with the generation of a curated multiple sequence alignments for the target sequence. This alignment is then subjected to manual refinement that involves removal of highly gapped sequences, as well as sequences with low sequence identity. This step increases the quality of the alignment. Next, the manually processed alignment is utilized for further coevolutionary measurements using distinct correlated mutation algorithms. These algorithms result in a coevolution scoring matrix. This matrix is filtered by applying various significance tests to extract significant coevolution values and wipe out background noise. Coevolutionary measurements are further evaluated to assess their performance and stringency. Finally, the results from this coevolutionary analysis are validated experimentally.

Structural prediction

De novo synthesis of protein benefits from knowledge of existing protein structures. This knowledge of existing protein structure assists with the prediction of new protein structures. Methods for protein structure prediction fall under one of the four following classes: ab initio, fragment based methods, homology modeling, and protein threading.

Ab initio

These methods involve free modeling without using any structural information about the template. Ab initio methods are aimed at prediction of the native structures of proteins corresponding to the global minimum of its free energy. some examples of ab initio methods are AMBER, GROMOS, GROMACS, CHARMM, OPLS, and ENCEPP12. General steps for ab initio methods begin with the geometric representation of the protein of interest. Next, a potential energy function model for the protein is developed. This model can be created using either molecular mechanics potentials or protein structure derived potential functions. Following the development of a potential model, energy search techniques including molecular dynamic simulations, Monte Carlo simulations and genetic algorithms are applied to the protein.

Fragment based

These methods use database information regarding structures to match homologous structures to the created protein sequences. These homologous structures are assembled to give compact structures using scoring and optimization procedures, with the goal of achieving the lowest potential energy score. Webservers for fragment information are I-TASSER, ROSETTA, ROSETTA @ home, FRAGFOLD, CABS fold, PROFESY, CREF, QUARK, UNDERTAKER, HMM, and ANGLOR.

Homology modeling

These methods are based upon the homology of proteins. These methods are also known as comparative modeling. The first step in homology modeling is generally the identification of template sequences of known structure which are homologous to the query sequence. Next the query sequence is aligned to the template sequence. Following the alignment, the structurally conserved regions are modeled using the template structure. This is followed by the modeling of side chains and loops that are distinct from the template. Finally the modeled structure undergoes refinement and assessment of quality. Servers that are available for homology modeling data are listed here: SWISS MODEL, MODELLER, ReformAlign, PyMOD, TIP-STRUCTFAST, COMPASS, 3d-PSSM, SAMT02, SAMT99, HHPRED, FAGUE, 3D-JIGSAW, META-PP, ROSETTA, and I-TASSER.

Protein threading

Protein threading can be used when a reliable homologue for the query sequence cannot be found. This method begins by obtaining a query sequence and a library of template structures. Next, the query sequence is threaded over known template structures. These candidate models are scored using scoring functions. These are scored based upon potential energy models of both query and template sequence. The match with the lowest potential energy model is then selected. Methods and servers for retrieving threading data and performing calculations are listed here: GenTHREADER, pGenTHREADER, pDomTHREADER, ORFEUS, PROSPECT, BioShell-Threading, FFASO3, RaptorX, HHPred, LOOPP server, Sparks-X, SEGMER, THREADER2, ESYPRED3D, LIBRA, TOPITS, RAPTOR, COTH, MUSTER.

For more information on rational design see site-directed mutagenesis.

Multivalent binding

Multivalent binding can be used to increase the binding specificity and affinity through avidity effects. Having multiple binding domains in a single biomolecule or complex increases the likelihood of other interactions to occur via individual binding events. Avidity or effective affinity can be much higher than the sum of the individual affinities providing a cost and time-effective tool for targeted binding.

Multivalent proteins

Multivalent proteins are relatively easy to produce by post-translational modifications or multiplying the protein-coding DNA sequence. The main advantage of multivalent and multispecific proteins is that they can increase the effective affinity for a target of a known protein. In the case of an inhomogeneous target using a combination of proteins resulting in multispecific binding can increase specificity, which has high applicability in protein therapeutics.

The most common example for multivalent binding are the antibodies, and there is extensive research for bispecific antibodies. Applications of bispecific antibodies cover a broad spectrum that includes diagnosis, imaging, prophylaxis, and therapy.

Directed evolution

In directed evolution, random mutagenesis, e.g. by error-prone PCR or sequence saturation mutagenesis, is applied to a protein, and a selection regime is used to select variants having desired traits. Further rounds of mutation and selection are then applied. This method mimics natural evolution and, in general, produces superior results to rational design. An added process, termed DNA shuffling, mixes and matches pieces of successful variants to produce better results. Such processes mimic the recombination that occurs naturally during sexual reproduction. Advantages of directed evolution are that it requires no prior structural knowledge of a protein, nor is it necessary to be able to predict what effect a given mutation will have. Indeed, the results of directed evolution experiments are often surprising in that desired changes are often caused by mutations that were not expected to have some effect. The drawback is that they require high-throughput screening, which is not feasible for all proteins. Large amounts of recombinant DNA must be mutated and the products screened for desired traits. The large number of variants often requires expensive robotic equipment to automate the process. Further, not all desired activities can be screened for easily.

Natural Darwinian evolution can be effectively imitated in the lab toward tailoring protein properties for diverse applications, including catalysis. Many experimental technologies exist to produce large and diverse protein libraries and for screening or selecting folded, functional variants. Folded proteins arise surprisingly frequently in random sequence space, an occurrence exploitable in evolving selective binders and catalysts. While more conservative than direct selection from deep sequence space, redesign of existing proteins by random mutagenesis and selection/screening is a particularly robust method for optimizing or altering extant properties. It also represents an excellent starting point for achieving more ambitious engineering goals. Allying experimental evolution with modern computational methods is likely the broadest, most fruitful strategy for generating functional macromolecules unknown to nature.

The main challenges of designing high quality mutant libraries have shown significant progress in the recent past. This progress has been in the form of better descriptions of the effects of mutational loads on protein traits. Also computational approaches have showed large advances in the innumerably large sequence space to more manageable screenable sizes, thus creating smart libraries of mutants. Library size has also been reduced to more screenable sizes by the identification of key beneficial residues using algorithms for systematic recombination. Finally a significant step forward toward efficient reengineering of enzymes has been made with the development of more accurate statistical models and algorithms quantifying and predicting coupled mutational effects on protein functions.

Generally, directed evolution may be summarized as an iterative two step process which involves generation of protein mutant libraries, and high throughput screening processes to select for variants with improved traits. This technique does not require prior knowledge of the protein structure and function relationship. Directed evolution utilizes random or focused mutagenesis to generate libraries of mutant proteins. Random mutations can be introduced using either error prone PCR, or site saturation mutagenesis. Mutants may also be generated using recombination of multiple homologous genes. Nature has evolved a limited number of beneficial sequences. Directed evolution makes it possible to identify undiscovered protein sequences which have novel functions. This ability is contingent on the proteins ability to tolerant amino acid residue substitutions without compromising folding or stability.

Directed evolution methods can be broadly categorized into two strategies, asexual and sexual methods.

Asexual methods

Asexual methods do not generate any cross links between parental genes. Single genes are used to create mutant libraries using various mutagenic techniques. These asexual methods can produce either random or focused mutagenesis.

Random mutagenesis

Random mutagenic methods produce mutations at random throughout the gene of interest. Random mutagenesis can introduce the following types of mutations: transitions, transversions, insertions, deletions, inversion, missense, and nonsense. Examples of methods for producing random mutagenesis are below.

Error prone PCR

Error prone PCR utilizes the fact that Taq DNA polymerase lacks 3' to 5' exonuclease activity. This results in an error rate of 0.001–0.002% per nucleotide per replication. This method begins with choosing the gene, or the area within a gene, one wishes to mutate. Next, the extent of error required is calculated based upon the type and extent of activity one wishes to generate. This extent of error determines the error prone PCR strategy to be employed. Following PCR, the genes are cloned into a plasmid and introduced to competent cell systems. These cells are then screened for desired traits. Plasmids are then isolated for colonies which show improved traits, and are then used as templates the next round of mutagenesis. Error prone PCR shows biases for certain mutations relative to others. Such as biases for transitions over transversions.

Rates of error in PCR can be increased in the following ways:

  1. Increase concentration of magnesium chloride, which stabilizes non complementary base pairing.
  2. Add manganese chloride to reduce base pair specificity.
  3. Increased and unbalanced addition of dNTPs.
  4. Addition of base analogs like dITP, 8 oxo-dGTP, and dPTP.
  5. Increase concentration of Taq polymerase.
  6. Increase extension time.
  7. Increase cycle time.
  8. Use less accurate Taq polymerase.

Also see polymerase chain reaction for more information.

Rolling circle error-prone PCR

This PCR method is based upon rolling circle amplification, which is modeled from the method that bacteria use to amplify circular DNA. This method results in linear DNA duplexes. These fragments contain tandem repeats of circular DNA called concatamers, which can be transformed into bacterial strains. Mutations are introduced by first cloning the target sequence into an appropriate plasmid. Next, the amplification process begins using random hexamer primers and Φ29 DNA polymerase under error prone rolling circle amplification conditions. Additional conditions to produce error prone rolling circle amplification are 1.5 pM of template DNA, 1.5 mM MnCl2 and a 24 hour reaction time. MnCl2 is added into the reaction mixture to promote random point mutations in the DNA strands. Mutation rates can be increased by increasing the concentration of MnCl2, or by decreasing concentration of the template DNA. Error prone rolling circle amplification is advantageous relative to error prone PCR because of its use of universal random hexamer primers, rather than specific primers. Also the reaction products of this amplification do not need to be treated with ligases or endonucleases. This reaction is isothermal.

Chemical mutagenesis

Chemical mutagenesis involves the use of chemical agents to introduce mutations into genetic sequences. Examples of chemical mutagens follow.

Sodium bisulfate is effective at mutating G/C rich genomic sequences. This is because sodium bisulfate catalyses deamination of unmethylated cytosine to uracil.

Ethyl methane sulfonate alkylates guanidine residues. This alteration causes errors during DNA replication.

Nitrous acid causes transversion by de-amination of adenine and cytosine.

The dual approach to random chemical mutagenesis is an iterative two step process. First it involves the in vivo chemical mutagenesis of the gene of interest via EMS. Next, the treated gene is isolated and cloning into an untreated expression vector in order to prevent mutations in the plasmid backbone. This technique preserves the plasmids genetic properties.

Targeting glycosylases to embedded arrays for mutagenesis (TaGTEAM)

This method has been used to create targeted in vivo mutagenesis in yeast. This method involves the fusion of a 3-methyladenine DNA glycosylase to tetR DNA-binding domain. This has been shown to increase mutation rates by over 800 time in regions of the genome containing tetO sites.

Mutagenesis by random insertion and deletion

This method involves alteration in length of the sequence via simultaneous deletion and insertion of chunks of bases of arbitrary length. This method has been shown to produce proteins with new functionalities via introduction of new restriction sites, specific codons, four base codons for non-natural amino acids.

Transposon based random mutagenesis

Recently many methods for transposon based random mutagenesis have been reported. This methods include, but are not limited to the following: PERMUTE-random circular permutation, random protein truncation, random nucleotide triplet substitution, random domain/tag/multiple amino acid insertion, codon scanning mutagenesis, and multicodon scanning mutagenesis. These aforementioned techniques all require the design of mini-Mu transposons. Thermo scientific manufactures kits for the design of these transposons.

Random mutagenesis methods altering the target DNA length

These methods involve altering gene length via insertion and deletion mutations. An example is the tandem repeat insertion (TRINS) method. This technique results in the generation of tandem repeats of random fragments of the target gene via rolling circle amplification and concurrent incorporation of these repeats into the target gene.

Mutator strains

Mutator strains are bacterial cell lines which are deficient in one or more DNA repair mechanisms. An example of a mutator strand is the E. coli XL1-RED. This subordinate strain of E. coli is deficient in the MutS, MutD, MutT DNA repair pathways. Use of mutator strains is useful at introducing many types of mutation; however, these strains show progressive sickness of culture because of the accumulation of mutations in the strains own genome.

Focused mutagenesis

Focused mutagenic methods produce mutations at predetermined amino acid residues. These techniques require and understanding of the sequence-function relationship for the protein of interest. Understanding of this relationship allows for the identification of residues which are important in stability, stereoselectivity, and catalytic efficiency. Examples of methods that produce focused mutagenesis are below.

Site saturation mutagenesis

Site saturation mutagenesis is a PCR based method used to target amino acids with significant roles in protein function. The two most common techniques for performing this are whole plasmid single PCR, and overlap extension PCR.

Whole plasmid single PCR is also referred to as site directed mutagenesis (SDM). SDM products are subjected to Dpn endonuclease digestion. This digestion results in cleavage of only the parental strand, because the parental strand contains a GmATC which is methylated at N6 of adenine. SDM does not work well for large plasmids of over ten kilobases. Also, this method is only capable of replacing two nucleotides at a time.

Overlap extension PCR requires the use of two pairs of primers. One primer in each set contains a mutation. A first round of PCR using these primer sets is performed and two double stranded DNA duplexes are formed. A second round of PCR is then performed in which these duplexes are denatured and annealed with the primer sets again to produce heteroduplexes, in which each strand has a mutation. Any gaps in these newly formed heteroduplexes are filled with DNA polymerases and further amplified.

Sequence saturation mutagenesis (SeSaM)

Sequence saturation mutagenesis results in the randomization of the target sequence at every nucleotide position. This method begins with the generation of variable length DNA fragments tailed with universal bases via the use of template transferases at the 3' termini. Next, these fragments are extended to full length using a single stranded template. The universal bases are replaced with a random standard base, causing mutations. There are several modified versions of this method such as SeSAM-Tv-II, SeSAM-Tv+, and SeSAM-III.

Single primer reactions in parallel (SPRINP)

This site saturation mutagenesis method involves two separate PCR reaction. The first of which uses only forward primers, while the second reaction uses only reverse primers. This avoids the formation of primer dimer formation.

Mega primed and ligase free focused mutagenesis

This site saturation mutagenic technique begins with one mutagenic oligonucleotide and one universal flanking primer. These two reactants are used for an initial PCR cycle. Products from this first PCR cycle are used as mega primers for the next PCR.

Ω-PCR

This site saturation mutagenic method is based on overlap extension PCR. It is used to introduce mutations at any site in a circular plasmid.

PFunkel-ominchange-OSCARR

This method utilizes user defined site directed mutagenesis at single or multiple sites simultaneously. OSCARR is an acronym for one pot simple methodology for cassette randomization and recombination. This randomization and recombination results in randomization of desired fragments of a protein. Omnichange is a sequence independent, multisite saturation mutagenesis which can saturate up to five independent codons on a gene.

Trimer-dimer mutagenesis

This method removes redundant codons and stop codons.

Cassette mutagenesis

This is a PCR based method. Cassette mutagenesis begins with the synthesis of a DNA cassette containing the gene of interest, which is flanked on either side by restriction sites. The endonuclease which cleaves these restriction sites also cleaves sites in the target plasmid. The DNA cassette and the target plasmid are both treated with endonucleases to cleave these restriction sites and create sticky ends. Next the products from this cleavage are ligated together, resulting in the insertion of the gene into the target plasmid. An alternative form of cassette mutagenesis called combinatorial cassette mutagenesis is used to identify the functions of individual amino acid residues in the protein of interest. Recursive ensemble mutagenesis then utilizes information from previous combinatorial cassette mutagenesis. Codon cassette mutagenesis allows you to insert or replace a single codon at a particular site in double stranded DNA.

Sexual methods

Sexual methods of directed evolution involve in vitro recombination which mimic natural in vivo recombination. Generally these techniques require high sequence homology between parental sequences. These techniques are often used to recombine two different parental genes, and these methods do create cross overs between these genes.

In vitro homologous recombination

Homologous recombination can be categorized as either in vivo or in vitro. In vitro homologous recombination mimics natural in vivo recombination. These in vitro recombination methods require high sequence homology between parental sequences. These techniques exploit the natural diversity in parental genes by recombining them to yield chimeric genes. The resulting chimera show a blend of parental characteristics.

DNA shuffling

This in vitro technique was one of the first techniques in the era of recombination. It begins with the digestion of homologous parental genes into small fragments by DNase1. These small fragments are then purified from undigested parental genes. Purified fragments are then reassembled using primer-less PCR. This PCR involves homologous fragments from different parental genes priming for each other, resulting in chimeric DNA. The chimeric DNA of parental size is then amplified using end terminal primers in regular PCR.

Random priming in vitro recombination (RPR)

This in vitro homologous recombination method begins with the synthesis of many short gene fragments exhibiting point mutations using random sequence primers. These fragments are reassembled to full length parental genes using primer-less PCR. These reassembled sequences are then amplified using PCR and subjected to further selection processes. This method is advantageous relative to DNA shuffling because there is no use of DNase1, thus there is no bias for recombination next to a pyrimidine nucleotide. This method is also advantageous due to its use of synthetic random primers which are uniform in length, and lack biases. Finally this method is independent of the length of DNA template sequence, and requires a small amount of parental DNA.

Truncated metagenomic gene-specific PCR

This method generates chimeric genes directly from metagenomic samples. It begins with isolation of the desired gene by functional screening from metagenomic DNA sample. Next, specific primers are designed and used to amplify the homologous genes from different environmental samples. Finally, chimeric libraries are generated to retrieve the desired functional clones by shuffling these amplified homologous genes.

Staggered extension process (StEP)

This in vitro method is based on template switching to generate chimeric genes. This PCR based method begins with an initial denaturation of the template, followed by annealing of primers and a short extension time. All subsequent cycle generate annealing between the short fragments generated in previous cycles and different parts of the template. These short fragments and the templates anneal together based on sequence complementarity. This process of fragments annealing template DNA is known as template switching. These annealed fragments will then serve as primers for further extension. This method is carried out until the parental length chimeric gene sequence is obtained. Execution of this method only requires flanking primers to begin. There is also no need for Dnase1 enzyme.

Random chimeragenesis on transient templates (RACHITT)

This method has been shown to generate chimeric gene libraries with an average of 14 crossovers per chimeric gene. It begins by aligning fragments from a parental top strand onto the bottom strand of a uracil containing template from a homologous gene. 5' and 3' overhang flaps are cleaved and gaps are filled by the exonuclease and endonuclease activities of Pfu and taq DNA polymerases. The uracil containing template is then removed from the heteroduplex by treatment with a uracil DNA glcosylase, followed by further amplification using PCR. This method is advantageous because it generates chimeras with relatively high crossover frequency. However it is somewhat limited due to the complexity and the need for generation of single stranded DNA and uracil containing single stranded template DNA.

Synthetic shuffling

Shuffling of synthetic degenerate oligonucleotides adds flexibility to shuffling methods, since oligonucleotides containing optimal codons and beneficial mutations can be included.

In vivo Homologous Recombination

Cloning performed in yeast involves PCR dependent reassembly of fragmented expression vectors. These reassembled vectors are then introduced to, and cloned in yeast. Using yeast to clone the vector avoids toxicity and counter-selection that would be introduced by ligation and propagation in E. coli.

Mutagenic organized recombination process by homologous in vivo grouping (MORPHING)

This method introduces mutations into specific regions of genes while leaving other parts intact by utilizing the high frequency of homologous recombination in yeast.

Phage-assisted continuous evolution (PACE)

This method utilizes a bacteriophage with a modified life cycle to transfer evolving genes from host to host. The phage's life cycle is designed in such a way that the transfer is correlated with the activity of interest from the enzyme. This method is advantageous because it requires minimal human intervention for the continuous evolution of the gene.

In vitro non-homologous recombination methods

These methods are based upon the fact that proteins can exhibit similar structural identity while lacking sequence homology.

Exon shuffling

Exon shuffling is the combination of exons from different proteins by recombination events occurring at introns. Orthologous exon shuffling involves combining exons from orthologous genes from different species. Orthologous domain shuffling involves shuffling of entire protein domains from orthologous genes from different species. Paralogous exon shuffling involves shuffling of exon from different genes from the same species. Paralogous domain shuffling involves shuffling of entire protein domains from paralogous proteins from the same species. Functional homolog shuffling involves shuffling of non-homologous domains which are functional related. All of these processes being with amplification of the desired exons from different genes using chimeric synthetic oligonucleotides. This amplification products are then reassembled into full length genes using primer-less PCR. During these PCR cycles the fragments act as templates and primers. This results in chimeric full length genes, which are then subjected to screening.

Incremental truncation for the creation of hybrid enzymes (ITCHY)

Fragments of parental genes are created using controlled digestion by exonuclease III. These fragments are blunted using endonuclease, and are ligated to produce hybrid genes. THIOITCHY is a modified ITCHY technique which utilized nucleotide triphosphate analogs such as α-phosphothioate dNTPs. Incorporation of these nucleotides blocks digestion by exonuclease III. This inhibition of digestion by exonuclease III is called spiking. Spiking can be accomplished by first truncating genes with exonuclease to create fragments with short single stranded overhangs. These fragments then serve as templates for amplification by DNA polymerase in the presence of small amounts of phosphothioate dNTPs. These resulting fragments are then ligated together to form full length genes. Alternatively the intact parental genes can be amplified by PCR in the presence of normal dNTPs and phosphothioate dNTPs. These full length amplification products are then subjected to digestion by an exonuclease. Digestion will continue until the exonuclease encounters an α-pdNTP, resulting in fragments of different length. These fragments are then ligated together to generate chimeric genes.

SCRATCHY

This method generates libraries of hybrid genes inhibiting multiple crossovers by combining DNA shuffling and ITCHY. This method begins with the construction of two independent ITCHY libraries. The first with gene A on the N-terminus. And the other having gene B on the N-terminus. These hybrid gene fragments are separated using either restriction enzyme digestion or PCR with terminus primers via agarose gel electrophoresis. These isolated fragments are then mixed together and further digested using DNase1. Digested fragments are then reassembled by primerless PCR with template switching.

Recombined extension on truncated templates (RETT)

This method generates libraries of hybrid genes by template switching of uni-directionally growing polynucleotides in the presence of single stranded DNA fragments as templates for chimeras. This method begins with the preparation of single stranded DNA fragments by reverse transcription from target mRNA. Gene specific primers are then annealed to the single stranded DNA. These genes are then extended during a PCR cycle. This cycle is followed by template switching and annealing of the short fragments obtained from the earlier primer extension to other single stranded DNA fragments. This process is repeated until full length single stranded DNA is obtained.

Sequence homology-independent protein recombination (SHIPREC)

This method generates recombination between genes with little to no sequence homology. These chimeras are fused via a linker sequence containing several restriction sites. This construct is then digested using DNase1. Fragments are made are made blunt ended using S1 nuclease. These blunt end fragments are put together into a circular sequence by ligation. This circular construct is then linearized using restriction enzymes for which the restriction sites are present in the linker region. This results in a library of chimeric genes in which contribution of genes to 5' and 3' end will be reversed as compared to the starting construct.

Sequence independent site directed chimeragenesis (SISDC)

This method results in a library of genes with multiple crossovers from several parental genes. This method does not require sequence identity among the parental genes. This does require one or two conserved amino acids at every crossover position. It begins with alignment of parental sequences and identification of consensus regions which serve as crossover sites. This is followed by the incorporation of specific tags containing restriction sites followed by the removal of the tags by digestion with Bac1, resulting in genes with cohesive ends. These gene fragments are mixed and ligated in an appropriate order to form chimeric libraries.

Degenerate homo-duplex recombination (DHR)

This method begins with alignment of homologous genes, followed by identification of regions of polymorphism. Next the top strand of the gene is divided into small degenerate oligonucleotides. The bottom strand is also digested into oligonucleotides to serve as scaffolds. These fragments are combined in solution are top strand oligonucleotides are assembled onto bottom strand oligonucleotides. Gaps between these fragments are filled with polymerase and ligated.

Random multi-recombinant PCR (RM-PCR)

This method involves the shuffling of plural DNA fragments without homology, in a single PCR. This results in the reconstruction of complete proteins by assembly of modules encoding different structural units.

User friendly DNA recombination (USERec)

This method begins with the amplification of gene fragments which need to be recombined, using uracil dNTPs. This amplification solution also contains primers, PfuTurbo, and Cx Hotstart DNA polymerase. Amplified products are next incubated with USER enzyme. This enzyme catalyzes the removal of uracil residues from DNA creating single base pair gaps. The USER enzyme treated fragments are mixed and ligated using T4 DNA ligase and subjected to Dpn1 digestion to remove the template DNA. These resulting dingle stranded fragments are subjected to amplification using PCR, and are transformed into E. coli.

Golden Gate shuffling (GGS) recombination

This method allows you to recombine at least 9 different fragments in an acceptor vector by using type 2 restriction enzyme which cuts outside of the restriction sites. It begins with sub cloning of fragments in separate vectors to create Bsa1 flanking sequences on both sides. These vectors are then cleaved using type II restriction enzyme Bsa1, which generates four nucleotide single strand overhangs. Fragments with complementary overhangs are hybridized and ligated using T4 DNA ligase. Finally these constructs are then transformed into E. coli cells, which are screened for expression levels.

Phosphoro thioate-based DNA recombination method (PRTec)

This method can be used to recombine structural elements or entire protein domains. This method is based on phosphorothioate chemistry which allows the specific cleavage of phosphorothiodiester bonds. The first step in the process begins with amplification of fragments that need to be recombined along with the vector backbone. This amplification is accomplished using primers with phosphorothiolated nucleotides at 5' ends. Amplified PCR products are cleaved in an ethanol-iodine solution at high temperatures. Next these fragments are hybridized at room temperature and transformed into E. coli which repair any nicks.

Integron

This system is based upon a natural site specific recombination system in E. coli. This system is called the integron system, and produces natural gene shuffling. This method was used to construct and optimize a functional tryptophan biosynthetic operon in trp-deficient E. coli by delivering individual recombination cassettes or trpA-E genes along with regulatory elements with the integron system.

Y-Ligation based shuffling (YLBS)

This method generates single stranded DNA strands, which encompass a single block sequence either at the 5' or 3' end, complementary sequences in a stem loop region, and a D branch region serving as a primer binding site for PCR. Equivalent amounts of both 5' and 3' half strands are mixed and formed a hybrid due to the complementarity in the stem region. Hybrids with free phosphorylated 5' end in 3' half strands are then ligated with free 3' ends in 5' half strands using T4 DNA ligase in the presence of 0.1 mM ATP. Ligated products are then amplified by two types of PCR to generate pre 5' half and pre 3' half PCR products. These PCR product are converted to single strands via avidin-biotin binding to the 5' end of the primes containing stem sequences that were biotin labeled. Next, biotinylated 5' half strands and non-biotinylated 3' half strands are used as 5' and 3' half strands for the next Y-ligation cycle.

Semi-rational design

Semi-rational design uses information about a proteins sequence, structure and function, in tandem with predictive algorithms. Together these are used to identify target amino acid residues which are most likely to influence protein function. Mutations of these key amino acid residues create libraries of mutant proteins that are more likely to have enhanced properties.

Advances in semi-rational enzyme engineering and de novo enzyme design provide researchers with powerful and effective new strategies to manipulate biocatalysts. Integration of sequence and structure based approaches in library design has proven to be a great guide for enzyme redesign. Generally, current computational de novo and redesign methods do not compare to evolved variants in catalytic performance. Although experimental optimization may be produced using directed evolution, further improvements in the accuracy of structure predictions and greater catalytic ability will be achieved with improvements in design algorithms. Further functional enhancements may be included in future simulations by integrating protein dynamics.

Biochemical and biophysical studies, along with fine-tuning of predictive frameworks will be useful to experimentally evaluate the functional significance of individual design features. Better understanding of these functional contributions will then give feedback for the improvement of future designs.

Directed evolution will likely not be replaced as the method of choice for protein engineering, although computational protein design has fundamentally changed the way protein engineering can manipulate bio-macromolecules. Smaller, more focused and functionally-rich libraries may be generated by using in methods which incorporate predictive frameworks for hypothesis-driven protein engineering. New design strategies and technical advances have begun a departure from traditional protocols, such as directed evolution, which represents the most effective strategy for identifying top-performing candidates in focused libraries. Whole-gene library synthesis is replacing shuffling and mutagenesis protocols for library preparation. Also highly specific low throughput screening assays are increasingly applied in place of monumental screening and selection efforts of millions of candidates. Together, these developments are poised to take protein engineering beyond directed evolution and towards practical, more efficient strategies for tailoring biocatalysts.

Screening and selection techniques

Once a protein has undergone directed evolution, ration design or semi-ration design, the libraries of mutant proteins must be screened to determine which mutants show enhanced properties. Phage display methods are one option for screening proteins. This method involves the fusion of genes encoding the variant polypeptides with phage coat protein genes. Protein variants expressed on phage surfaces are selected by binding with immobilized targets in vitro. Phages with selected protein variants are then amplified in bacteria, followed by the identification of positive clones by enzyme linked immunosorbent assay. These selected phages are then subjected to DNA sequencing.

Cell surface display systems can also be utilized to screen mutant polypeptide libraries. The library mutant genes are incorporated into expression vectors which are then transformed into appropriate host cells. These host cells are subjected to further high throughput screening methods to identify the cells with desired phenotypes.

Cell free display systems have been developed to exploit in vitro protein translation or cell free translation. These methods include mRNA display, ribosome display, covalent and non covalent DNA display, and in vitro compartmentalization.

Enzyme engineering

Enzyme engineering is the application of modifying an enzyme's structure (and, thus, its function) or modifying the catalytic activity of isolated enzymes to produce new metabolites, to allow new (catalyzed) pathways for reactions to occur, or to convert from some certain compounds into others (biotransformation). These products are useful as chemicals, pharmaceuticals, fuel, food, or agricultural additives.

An enzyme reactor consists of a vessel containing a reactional medium that is used to perform a desired conversion by enzymatic means. Enzymes used in this process are free in the solution. Also Microorganisms are one of important origin for genuine enzymes .

Examples of engineered proteins

Computing methods have been used to design a protein with a novel fold, named Top7, and sensors for unnatural molecules. The engineering of fusion proteins has yielded rilonacept, a pharmaceutical that has secured Food and Drug Administration (FDA) approval for treating cryopyrin-associated periodic syndrome.

Another computing method, IPRO, successfully engineered the switching of cofactor specificity of Candida boidinii xylose reductase. Iterative Protein Redesign and Optimization (IPRO) redesigns proteins to increase or give specificity to native or novel substrates and cofactors. This is done by repeatedly randomly perturbing the structure of the proteins around specified design positions, identifying the lowest energy combination of rotamers, and determining whether the new design has a lower binding energy than prior ones.

Computation-aided design has also been used to engineer complex properties of a highly ordered nano-protein assembly. A protein cage, E. coli bacterioferritin (EcBfr), which naturally shows structural instability and an incomplete self-assembly behavior by populating two oligomerization states, is the model protein in this study. Through computational analysis and comparison to its homologs, it has been found that this protein has a smaller-than-average dimeric interface on its two-fold symmetry axis due mainly to the existence of an interfacial water pocket centered on two water-bridged asparagine residues. To investigate the possibility of engineering EcBfr for modified structural stability, a semi-empirical computational method is used to virtually explore the energy differences of the 480 possible mutants at the dimeric interface relative to the wild type EcBfr. This computational study also converges on the water-bridged asparagines. Replacing these two asparagines with hydrophobic amino acids results in proteins that fold into alpha-helical monomers and assemble into cages as evidenced by circular dichroism and transmission electron microscopy. Both thermal and chemical denaturation confirm that, all redesigned proteins, in agreement with the calculations, possess increased stability. One of the three mutations shifts the population in favor of the higher order oligomerization state in solution as shown by both size exclusion chromatography and native gel electrophoresis.

A in silico method, PoreDesigner, was successfully developed to redesign bacterial channel protein (OmpF) to reduce its 1 nm pore size to any desired sub-nm dimension. Transport experiments on the narrowest designed pores revealed complete salt rejection when assembled in biomimetic block-polymer matrices.

Lie point symmetry

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Lie_point_symmetry     ...