A Medley of Potpourri

Friday, November 26, 2021

Protein design

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Protein_design

Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch (de novo design) or by making calculated variants of a known protein structure and its sequence (termed protein redesign). Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.

Rational protein design dates back to the mid-1970s. Recently, however, there were numerous examples of successful rational design of water-soluble and even transmembrane peptides and proteins, in part due to a better understanding of different factors contributing to protein structure stability and development of better computational methods.

Overview and history

The goal in rational protein design is to predict amino acid sequences that will fold to a specific protein structure. Although the number of possible protein sequences is vast, growing exponentially with the size of the protein chain, only a subset of them will fold reliably and quickly to one native state. Protein design involves identifying novel sequences within this subset. The native state of a protein is the conformational free energy minimum for the chain. Thus, protein design is the search for sequences that have the chosen structure as a free energy minimum. In a sense, it is the reverse of protein structure prediction. In design, a tertiary structure is specified, and a sequence that will fold to it is identified. Hence, it is also termed inverse folding. Protein design is then an optimization problem: using some scoring criteria, an optimized sequence that will fold to the desired structure is chosen.

When the first proteins were rationally designed during the 1970s and 1980s, the sequence for these was optimized manually based on analyses of other known proteins, the sequence composition, amino acid charges, and the geometry of the desired structure. The first designed proteins are attributed to Bernd Gutte, who designed a reduced version of a known catalyst, bovine ribonuclease, and tertiary structures consisting of beta-sheets and alpha-helices, including a binder of DDT. Urry and colleagues later designed elastin-like fibrous peptides based on rules on sequence composition. Richardson and coworkers designed a 79-residue protein with no sequence homology to a known protein. In the 1990s, the advent of powerful computers, libraries of amino acid conformations, and force fields developed mainly for molecular dynamics simulations enabled the development of structure-based computational protein design tools. Following the development of these computational tools, great success has been achieved over the last 30 years in protein design. The first protein successfully designed completely de novo was done by Stephen Mayo and coworkers in 1997, and, shortly after, in 1999 Peter S. Kim and coworkers designed dimers, trimers, and tetramers of unnatural right-handed coiled coils. In 2003, David Baker's laboratory designed a full protein to a fold never seen before in nature. Later, in 2008, Baker's group computationally designed enzymes for two different reactions. In 2010, one of the most powerful broadly neutralizing antibodies was isolated from patient serum using a computationally designed protein probe. Due to these and other successes (e.g., see examples below), protein design has become one of the most important tools available for protein engineering. There is great hope that the design of new proteins, small and large, will have uses in biomedicine and bioengineering.

Underlying models of protein structure and function

Protein design programs use computer models of the molecular forces that drive proteins in in vivo environments. In order to make the problem tractable, these forces are simplified by protein design models. Although protein design programs vary greatly, they have to address four main modeling questions: What is the target structure of the design, what flexibility is allowed on the target structure, which sequences are included in the search, and which force field will be used to score sequences and structures.

Target structure

The Top7 protein was one of the first proteins designed for a fold that had never been seen before in nature

Protein function is heavily dependent on protein structure, and rational protein design uses this relationship to design function by designing proteins that have a target structure or fold. Thus, by definition, in rational protein design the target structure or ensemble of structures must be known beforehand. This contrasts with other forms of protein engineering, such as directed evolution, where a variety of methods are used to find proteins that achieve a specific function, and with protein structure prediction where the sequence is known, but the structure is unknown.

Most often, the target structure is based on a known structure of another protein. However, novel folds not seen in nature have been made increasingly possible. Peter S. Kim and coworkers designed trimers and tetramers of unnatural coiled coils, which had not been seen before in nature. The protein Top7, developed in David Baker's lab, was designed completely using protein design algorithms, to a completely novel fold. More recently, Baker and coworkers developed a series of principles to design ideal globular-protein structures based on protein folding funnels that bridge between secondary structure prediction and tertiary structures. These principles, which build on both protein structure prediction and protein design, were used to design five different novel protein topologies.

Sequence space

FSD-1 (shown in blue, PDB id: 1FSV) was the first de novo computational design of a full protein. The target fold was that of the zinc finger in residues 33–60 of the structure of protein Zif268 (shown in red, PDB id: 1ZAA). The designed sequence had very little sequence identity with any known protein sequence.

In rational protein design, proteins can be redesigned from the sequence and structure of a known protein, or completely from scratch in de novo protein design. In protein redesign, most of the residues in the sequence are maintained as their wild-type amino-acid while a few are allowed to mutate. In de novo design, the entire sequence is designed anew, based on no prior sequence.

Both de novo designs and protein redesigns can establish rules on the sequence space: the specific amino acids that are allowed at each mutable residue position. For example, the composition of the surface of the RSC3 probe to select HIV-broadly neutralizing antibodies was restricted based on evolutionary data and charge balancing. Many of the earliest attempts on protein design were heavily based on empiric rules on the sequence space. Moreover, the design of fibrous proteins usually follows strict rules on the sequence space. Collagen-based designed proteins, for example, are often composed of Gly-Pro-X repeating patterns. The advent of computational techniques allows designing proteins with no human intervention in sequence selection.

Structural flexibility

Common protein design programs use rotamer libraries to simplify the conformational space of protein side chains. This animation loops through all the rotamers of the isoleucine amino acid based on the Penultimate Rotamer Library.

In protein design, the target structure (or structures) of the protein are known. However, a rational protein design approach must model some flexibility on the target structure in order to increase the number of sequences that can be designed for that structure and to minimize the chance of a sequence folding to a different structure. For example, in a protein redesign of one small amino acid (such as alanine) in the tightly packed core of a protein, very few mutants would be predicted by a rational design approach to fold to the target structure, if the surrounding side-chains are not allowed to be repacked.

Thus, an essential parameter of any design process is the amount of flexibility allowed for both the side-chains and the backbone. In the simplest models, the protein backbone is kept rigid while some of the protein side-chains are allowed to change conformations. However, side-chains can have many degrees of freedom in their bond lengths, bond angles, and χ dihedral angles. To simplify this space, protein design methods use rotamer libraries that assume ideal values for bond lengths and bond angles, while restricting χ dihedral angles to a few frequently observed low-energy conformations termed rotamers.

Rotamer libraries are derived from the statistical analysis of many protein structures. Backbone-independent rotamer libraries describe all rotamers. Backbone-dependent rotamer libraries, in contrast, describe the rotamers as how likely they are to appear depending on the protein backbone arrangement around the side chain. Most protein design programs use one conformation (e.g., the modal value for rotamer dihedrals in space) or several points in the region described by the rotamer; the OSPREY protein design program, in contrast, models the entire continuous region.

Although rational protein design must preserve the general backbone fold a protein, allowing some backbone flexibility can significantly increase the number of sequences that fold to the structure while maintaining the general fold of the protein. Backbone flexibility is especially important in protein redesign because sequence mutations often result in small changes to the backbone structure. Moreover, backbone flexibility can be essential for more advanced applications of protein design, such as binding prediction and enzyme design. Some models of protein design backbone flexibility include small and continuous global backbone movements, discrete backbone samples around the target fold, backrub motions, and protein loop flexibility.

Energy function

Comparison of various potential energy functions. The most accurate energy are those that use quantum mechanical calculations, but these are too slow for protein design. On the other extreme, heuristic energy functions, are based on statistical terms and are very fast. In the middle are molecular mechanics energy functions that are physically-based but are not as computationally expensive as quantum mechanical simulations.

Rational protein design techniques must be able to discriminate sequences that will be stable under the target fold from those that would prefer other low-energy competing states. Thus, protein design requires accurate energy functions that can rank and score sequences by how well they fold to the target structure. At the same time, however, these energy functions must consider the computational challenges behind protein design. One of the most challenging requirements for successful design is an energy function that is both accurate and simple for computational calculations.

The most accurate energy functions are those based on quantum mechanical simulations. However, such simulations are too slow and typically impractical for protein design. Instead, many protein design algorithms use either physics-based energy functions adapted from molecular mechanics simulation programs, knowledge based energy-functions, or a hybrid mix of both. The trend has been toward using more physics-based potential energy functions.

Physics-based energy functions, such as AMBER and CHARMM, are typically derived from quantum mechanical simulations, and experimental data from thermodynamics, crystallography, and spectroscopy. These energy functions typically simplify physical energy function and make them pairwise decomposable, meaning that the total energy of a protein conformation can be calculated by adding the pairwise energy between each atom pair, which makes them attractive for optimization algorithms. Physics-based energy functions typically model an attractive-repulsive Lennard-Jones term between atoms and a pairwise electrostatics coulombic term between non-bonded atoms.

Water-mediated hydrogen bonds play a key role in protein–protein binding. One such interaction is shown between residues D457, S365 in the heavy chain of the HIV-broadly-neutralizing antibody VRC01 (green) and residues N58 and Y59 in the HIV envelope protein GP120 (purple).

Statistical potentials, in contrast to physics-based potentials, have the advantage of being fast to compute, of accounting implicitly of complex effects and being less sensitive to small changes in the protein structure. These energy functions are based on deriving energy values from frequency of appearance on a structural database.

Protein design, however, has requirements that can sometimes be limited in molecular mechanics force-fields. Molecular mechanics force-fields, which have been used mostly in molecular dynamics simulations, are optimized for the simulation of single sequences, but protein design searches through many conformations of many sequences. Thus, molecular mechanics force-fields must be tailored for protein design. In practice, protein design energy functions often incorporate both statistical terms and physics-based terms. For example, the Rosetta energy function, one of the most-used energy functions, incorporates physics-based energy terms originating in the CHARMM energy function, and statistical energy terms, such as rotamer probability and knowledge-based electrostatics. Typically, energy functions are highly customized between laboratories, and specifically tailored for every design.

Challenges for effective design energy functions

Water makes up most of the molecules surrounding proteins and is the main driver of protein structure. Thus, modeling the interaction between water and protein is vital in protein design. The number of water molecules that interact with a protein at any given time is huge and each one has a large number of degrees of freedom and interaction partners. Instead, protein design programs model most of such water molecules as a continuum, modeling both the hydrophobic effect and solvation polarization.

Individual water molecules can sometimes have a crucial structural role in the core of proteins, and in protein–protein or protein–ligand interactions. Failing to model such waters can result in mispredictions of the optimal sequence of a protein–protein interface. As an alternative, water molecules can be added to rotamers.

As an optimization problem

This animation illustrates the complexity of a protein design search, which typically compares all the rotamer-conformations from all possible mutations at all residues. In this example, the residues Phe36 and His 106 are allowed to mutate to, respectively, the amino acids Tyr and Asn. Phe and Tyr have 4 rotamers each in the rotamer library, while Asn and His have 7 and 8 rotamers, respectively, in the rotamer library (from the Richardson's penultimate rotamer library). The animation loops through all (4 + 4) x (7 + 8) = 120 possibilities. The structure shown is that of myoglobin, PDB id: 1mbn.

The goal of protein design is to find a protein sequence that will fold to a target structure. A protein design algorithm must, thus, search all the conformations of each sequence, with respect to the target fold, and rank sequences according to the lowest-energy conformation of each one, as determined by the protein design energy function. Thus, a typical input to the protein design algorithm is the target fold, the sequence space, the structural flexibility, and the energy function, while the output is one or more sequences that are predicted to fold stably to the target structure.

The number of candidate protein sequences, however, grows exponentially with the number of protein residues; for example, there are 20¹⁰⁰ protein sequences of length 100. Furthermore, even if amino acid side-chain conformations are limited to a few rotamers (see Structural flexibility), this results in an exponential number of conformations for each sequence. Thus, in our 100 residue protein, and assuming that each amino acid has exactly 10 rotamers, a search algorithm that searches this space will have to search over 200¹⁰⁰ protein conformations.

The most common energy functions can be decomposed into pairwise terms between rotamers and amino acid types, which casts the problem as a combinatorial one, and powerful optimization algorithms can be used to solve it. In those cases, the total energy of each conformation belonging to each sequence can be formulated as a sum of individual and pairwise terms between residue positions. If a designer is interested only in the best sequence, the protein design algorithm only requires the lowest-energy conformation of the lowest-energy sequence. In these cases, the amino acid identity of each rotamer can be ignored and all rotamers belonging to different amino acids can be treated the same. Let r_i be a rotamer at residue position i in the protein chain, and E(r_i) the potential energy between the internal atoms of the rotamer. Let E(r_i, r_j) be the potential energy between r_i and rotamer r_j at residue position j. Then, we define the optimization problem as one of finding the conformation of minimum energy (E_T):

\min E_{T}=\sum _{i}{\Big [}E_{i}(r_{i})+\sum _{i\neq j}E_{ij}(r_{i},r_{j}){\Big ]}\,

The problem of minimizing E_T is an NP-hard problem. Even though the class of problems is NP-hard, in practice many instances of protein design can be solved exactly or optimized satisfactorily through heuristic methods.

Algorithms

Several algorithms have been developed specifically for the protein design problem. These algorithms can be divided into two broad classes: exact algorithms, such as dead-end elimination, that lack runtime guarantees but guarantee the quality of the solution; and heuristic algorithms, such as Monte Carlo, that are faster than exact algorithms but have no guarantees on the optimality of the results. Exact algorithms guarantee that the optimization process produced the optimal according to the protein design model. Thus, if the predictions of exact algorithms fail when these are experimentally validated, then the source of error can be attributed to the energy function, the allowed flexibility, the sequence space or the target structure (e.g., if it cannot be designed for).

Some protein design algorithms are listed below. Although these algorithms address only the most basic formulation of the protein design problem, Equation (1), when the optimization goal changes because designers introduce improvements and extensions to the protein design model, such as improvements to the structural flexibility allowed (e.g., protein backbone flexibility) or including sophisticated energy terms, many of the extensions on protein design that improve modeling are built atop these algorithms. For example, Rosetta Design incorporates sophisticated energy terms, and backbone flexibility using Monte Carlo as the underlying optimizing algorithm. OSPREY's algorithms build on the dead-end elimination algorithm and A* to incorporate continuous backbone and side-chain movements. Thus, these algorithms provide a good perspective on the different kinds of algorithms available for protein design.

In July 2020 scientists reported the development of an AI-based process using genome databases for evolution-based designing of novel proteins. They used deep learning to identify design-rules.

With mathematical guarantees

Dead-end elimination

The dead-end elimination (DEE) algorithm reduces the search space of the problem iteratively by removing rotamers that can be provably shown to be not part of the global lowest energy conformation (GMEC). On each iteration, the dead-end elimination algorithm compares all possible pairs of rotamers at each residue position, and removes each rotamer r′_i that can be shown to always be of higher energy than another rotamer r_i and is thus not part of the GMEC:

${\displaystyle E(r_{i}^{\prime })+\sum _{j\neq i}\min _{r_{j}}E(r_{i}^{\prime },r_{j})>E(r_{i})+\sum _{j\neq i}\max _{r_{j}}E(r_{i},r_{j})}$

Other powerful extensions to the dead-end elimination algorithm include the pairs elimination criterion, and the generalized dead-end elimination criterion. This algorithm has also been extended to handle continuous rotamers with provable guarantees.

Although the Dead-end elimination algorithm runs in polynomial time on each iteration, it cannot guarantee convergence. If, after a certain number of iterations, the dead-end elimination algorithm does not prune any more rotamers, then either rotamers have to be merged or another search algorithm must be used to search the remaining search space. In such cases, the dead-end elimination acts as a pre-filtering algorithm to reduce the search space, while other algorithms, such as A*, Monte Carlo, Linear Programming, or FASTER are used to search the remaining search space.

Branch and bound

The protein design conformational space can be represented as a tree, where the protein residues are ordered in an arbitrary way, and the tree branches at each of the rotamers in a residue. Branch and bound algorithms use this representation to efficiently explore the conformation tree: At each branching, branch and bound algorithms bound the conformation space and explore only the promising branches.

A popular search algorithm for protein design is the A* search algorithm. A* computes a lower-bound score on each partial tree path that lower bounds (with guarantees) the energy of each of the expanded rotamers. Each partial conformation is added to a priority queue and at each iteration the partial path with the lowest lower bound is popped from the queue and expanded. The algorithm stops once a full conformation has been enumerated and guarantees that the conformation is the optimal.

The A* score f in protein design consists of two parts, f=g+h. g is the exact energy of the rotamers that have already been assigned in the partial conformation. h is a lower bound on the energy of the rotamers that have not yet been assigned. Each is designed as follows, where d is the index of the last assigned residue in the partial conformation.

$g=\sum _{i=1}^{d}(E(r_{i})+\sum _{j=i+1}^{d}E(r_{i},r_{j}))$

${\displaystyle h=\sum _{j=d+1}^{n}[\min _{r_{j}}(E(r_{j})+\sum _{i=1}^{d}E(r_{i},r_{j})+\sum _{k=j+1}^{n}\min _{r_{k}}E(r_{j},r_{k}))]}$

Integer linear programming

The problem of optimizing E_T (Equation (1)) can be easily formulated as an integer linear program (ILP). One of the most powerful formulations uses binary variables to represent the presence of a rotamer and edges in the final solution, and constraints the solution to have exactly one rotamer for each residue and one pairwise interaction for each pair of residues:

${\displaystyle \ \min \sum _{i}\sum _{r_{i}}E_{i}(r_{i})q_{i}(r_{i})+\sum _{j\neq i}\sum _{r_{j}}E_{ij}(r_{i},r_{j})q_{ij}(r_{i},r_{j})\,}$

s.t.

$\sum _{r_{i}}q_{i}(r_{i})=1,\ \forall i$

$\sum _{r_{j}}q_{ij}(r_{i},r_{j})=q_{i}(r_{i}),\forall i,r_{i},j$

$q_{i},q_{ij}\in \{0,1\}$

ILP solvers, such as CPLEX, can compute the exact optimal solution for large instances of protein design problems. These solvers use a linear programming relaxation of the problem, where q_i and q_ij are allowed to take continuous values, in combination with a branch and cut algorithm to search only a small portion of the conformation space for the optimal solution. ILP solvers have been shown to solve many instances of the side-chain placement problem.

Message-passing based approximations to the linear programming dual

ILP solvers depend on linear programming (LP) algorithms, such as the Simplex or barrier-based methods to perform the LP relaxation at each branch. These LP algorithms were developed as general-purpose optimization methods and are not optimized for the protein design problem (Equation (1)). In consequence, the LP relaxation becomes the bottleneck of ILP solvers when the problem size is large. Recently, several alternatives based on message-passing algorithms have been designed specifically for the optimization of the LP relaxation of the protein design problem. These algorithms can approximate both the dual or the primal instances of the integer programming, but in order to maintain guarantees on optimality, they are most useful when used to approximate the dual of the protein design problem, because approximating the dual guarantees that no solutions are missed. Message-passing based approximations include the tree reweighted max-product message passing algorithm, and the message passing linear programming algorithm.

Optimization algorithms without guarantees

Monte Carlo and simulated annealing

Monte Carlo is one of the most widely used algorithms for protein design. In its simplest form, a Monte Carlo algorithm selects a residue at random, and in that residue a randomly chosen rotamer (of any amino acid) is evaluated. The new energy of the protein, E_new is compared against the old energy E_old and the new rotamer is accepted with a probability of:

$p=e^{-\beta (E_{\text{new}}-E_{\text{old}}))},$

where β is the Boltzmann constant and the temperature T can be chosen such that in the initial rounds it is high and it is slowly annealed to overcome local minima.

FASTER

The FASTER algorithm uses a combination of deterministic and stochastic criteria to optimize amino acid sequences. FASTER first uses DEE to eliminate rotamers that are not part of the optimal solution. Then, a series of iterative steps optimize the rotamer assignment.

Belief propagation

In belief propagation for protein design, the algorithm exchanges messages that describe the belief that each residue has about the probability of each rotamer in neighboring residues. The algorithm updates messages on every iteration and iterates until convergence or until a fixed number of iterations. Convergence is not guaranteed in protein design. The message m_{i→ j}(r_j that a residue i sends to every rotamer (r_j at neighboring residue j is defined as:

${\displaystyle m_{i\to j}(r_{j})=\max _{r_{i}}{\Big (}e^{\frac {-E_{i}(r_{i})-E_{ij}(r_{i},r_{j})}{T}}{\Big )}\prod _{k\in N(i)\backslash j}m_{k\to i(r_{i})}}$

Both max-product and sum-product belief propagation have been used to optimize protein design.

Applications and examples of designed proteins

Enzyme design

The design of new enzymes is a use of protein design with huge bioengineering and biomedical applications. In general, designing a protein structure can be different from designing an enzyme, because the design of enzymes must consider many states involved in the catalytic mechanism. However protein design is a prerequisite of de novo enzyme design because, at the very least, the design of catalysts requires a scaffold in which the catalytic mechanism can be inserted.

Great progress in de novo enzyme design, and redesign, was made in the first decade of the 21st century. In three major studies, David Baker and coworkers de novo designed enzymes for the retro-aldol reaction, a Kemp-elimination reaction, and for the Diels-Alder reaction. Furthermore, Stephen Mayo and coworkers developed an iterative method to design the most efficient known enzyme for the Kemp-elimination reaction. Also, in the laboratory of Bruce Donald, computational protein design was used to switch the specificity of one of the protein domains of the nonribosomal peptide synthetase that produces Gramicidin S, from its natural substrate phenylalanine to other noncognate substrates including charged amino acids; the redesigned enzymes had activities close to those of the wild-type.

Design for affinity

Protein–protein interactions are involved in most biotic processes. Many of the hardest-to-treat diseases, such as Alzheimer's, many forms of cancer (e.g., TP53), and human immunodeficiency virus (HIV) infection involve protein–protein interactions. Thus, to treat such diseases, it is desirable to design protein or protein-like therapeutics that bind one of the partners of the interaction and, thus, disrupt the disease-causing interaction. This requires designing protein-therapeutics for affinity toward its partner.

Protein–protein interactions can be designed using protein design algorithms because the principles that rule protein stability also rule protein–protein binding. Protein–protein interaction design, however, presents challenges not commonly present in protein design. One of the most important challenges is that, in general, the interfaces between proteins are more polar than protein cores, and binding involves a tradeoff between desolvation and hydrogen bond formation. To overcome this challenge, Bruce Tidor and coworkers developed a method to improve the affinity of antibodies by focusing on electrostatic contributions. They found that, for the antibodies designed in the study, reducing the desolvation costs of the residues in the interface increased the affinity of the binding pair.

Scoring binding predictions

Protein design energy functions must be adapted to score binding predictions because binding involves a trade-off between the lowest-energy conformations of the free proteins (E_P and E_L) and the lowest-energy conformation of the bound complex (E_PL):

$\Delta _{G}=E_{PL}-E_{P}-E_{L}$ .

The K* algorithm approximates the binding constant of the algorithm by including conformational entropy into the free energy calculation. The K* algorithm considers only the lowest-energy conformations of the free and bound complexes (denoted by the sets P, L, and PL) to approximate the partition functions of each complex:

${\displaystyle K^{*}={\frac {\sum \limits _{x\in PL}e^{-E(x)/RT}}{\sum \limits _{x\in P}e^{-E(x)/RT}\sum \limits _{x\in L}e^{-E(x)/RT}}}}$

Design for specificity

The design of protein–protein interactions must be highly specific because proteins can interact with a large number of proteins; successful design requires selective binders. Thus, protein design algorithms must be able to distinguish between on-target (or positive design) and off-target binding (or negative design). One of the most prominent examples of design for specificity is the design of specific bZIP-binding peptides by Amy Keating and coworkers for 19 out of the 20 bZIP families; 8 of these peptides were specific for their intended partner over competing peptides. Further, positive and negative design was also used by Anderson and coworkers to predict mutations in the active site of a drug target that conferred resistance to a new drug; positive design was used to maintain wild-type activity, while negative design was used to disrupt binding of the drug. Recent computational redesign by Costas Maranas and coworkers was also capable of experimentally switching the cofactor specificity of Candida boidinii xylose reductase from NADPH to NADH.

Protein resurfacing

Protein resurfacing consists of designing a protein's surface while preserving the overall fold, core, and boundary regions of the protein intact. Protein resurfacing is especially useful to alter the binding of a protein to other proteins. One of the most important applications of protein resurfacing was the design of the RSC3 probe to select broadly neutralizing HIV antibodies at the NIH Vaccine Research Center. First, residues outside of the binding interface between the gp120 HIV envelope protein and the formerly discovered b12-antibody were selected to be designed. Then, the sequence spaced was selected based on evolutionary information, solubility, similarity with the wild-type, and other considerations. Then the RosettaDesign software was used to find optimal sequences in the selected sequence space. RSC3 was later used to discover the broadly neutralizing antibody VRC01 in the serum of a long-term HIV-infected non-progressor individual.

Design of globular proteins

Globular proteins are proteins that contain a hydrophobic core and a hydrophilic surface. Globular proteins often assume a stable structure, unlike fibrous proteins, which have multiple conformations. The three-dimensional structure of globular proteins is typically easier to determine through X-ray crystallography and nuclear magnetic resonance than both fibrous proteins and membrane proteins, which makes globular proteins more attractive for protein design than the other types of proteins. Most successful protein designs have involved globular proteins. Both RSD-1, and Top7 were de novo designs of globular proteins. Five more protein structures were designed, synthesized, and verified in 2012 by the Baker group. These new proteins serve no biotic function, but the structures are intended to act as building-blocks that can be expanded to incorporate functional active sites. The structures were found computationally by using new heuristics based on analyzing the connecting loops between parts of the sequence that specify secondary structures.

Design of membrane proteins

Several transmembrane proteins have been successfully designed, along with many other membrane-associated peptides and proteins. Recently, Costas Maranas and his coworkers developed an automated tool to redesign the pore size of Outer Membrane Porin Type-F (OmpF) from E.coli to any desired sub-nm size and assembled them in membranes to perform precise angstrom scale separation.

Other applications

One of the most desirable uses for protein design is for biosensors, proteins that will sense the presence of specific compounds. Some attempts in the design of biosensors include sensors for unnatural molecules including TNT. More recently, Kuhlman and coworkers designed a biosensor of the PAK1.

Artificial gene synthesis

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Artificial_gene_synthesis

DNA Double Helix

Artificial gene synthesis, or gene synthesis, refers to a group of methods that are used in synthetic biology to construct and assemble genes from nucleotides de novo. Unlike DNA synthesis in living cells, artificial gene synthesis does not require template DNA, allowing virtually any DNA sequence to be synthesized in the laboratory. It comprises two main steps, the first of which is solid-phase DNA synthesis, sometimes known as DNA printing. This produces oligonucleotide fragments that are generally under 200 base pairs. The second step then involves connecting these oligonucleotide fragments using various DNA assembly methods. Because artificial gene synthesis does not require template DNA, it is theoretically possible to make a completely synthetic DNA molecule with no limits on the nucleotide sequence or size.

Synthesis of the first complete gene, a yeast tRNA, was demonstrated by Har Gobind Khorana and coworkers in 1972. Synthesis of the first peptide- and protein-coding genes was performed in the laboratories of Herbert Boyer and Alexander Markham, respectively. More recently, artificial gene synthesis methods have been developed that will allow the assembly of entire chromosomes and genomes. The first synthetic yeast chromosome was synthesised in 2014, and entire functional bacterial chromosomes have also been synthesised. In addition, artificial gene synthesis could in the future make use of novel nucleobase pairs (unnatural base pairs).

Standard methods for DNA synthesis

Oligonucleotide synthesis

Oligonucleotides are chemically synthesized using building blocks called nucleoside phosphoramidites. These can be normal or modified nucleosides which have protecting groups to prevent their amines, hydroxyl groups and phosphate groups from interacting incorrectly. One phosphoramidite is added at a time, the 5' hydroxyl group is deprotected and a new base is added and so on. The chain grows in the 3' to 5' direction, which is backwards relative to biosynthesis. At the end, all the protecting groups are removed. Nevertheless, being a chemical process, several incorrect interactions occur leading to some defective products. The longer the oligonucleotide sequence that is being synthesized, the more defects there are, thus this process is only practical for producing short sequences of nucleotides. The current practical limit is about 200 bp (base pairs) for an oligonucleotide with sufficient quality to be used directly for a biological application. HPLC can be used to isolate products with the proper sequence. Meanwhile, a large number of oligos can be synthesized in parallel on gene chips. For optimal performance in subsequent gene synthesis procedures they should be prepared individually and in larger scales.

Annealing based connection of oligonucleotides

Usually, a set of individually designed oligonucleotides is made on automated solid-phase synthesizers, purified and then connected by specific annealing and standard ligation or polymerase reactions. To improve specificity of oligonucleotide annealing, the synthesis step relies on a set of thermostable DNA ligase and polymerase enzymes. To date, several methods for gene synthesis have been described, such as the ligation of phosphorylated overlapping oligonucleotides, the Fok I method and a modified form of ligase chain reaction for gene synthesis. Additionally, several PCR assembly approaches have been described. They usually employ oligonucleotides of 40-50 nucleotides long that overlap each other. These oligonucleotides are designed to cover most of the sequence of both strands, and the full-length molecule is generated progressively by overlap extension (OE) PCR, thermodynamically balanced inside-out (TBIO) PCR or combined approaches. The most commonly synthesized genes range in size from 600 to 1,200 bp although much longer genes have been made by connecting previously assembled fragments of under 1,000 bp. In this size range it is necessary to test several candidate clones confirming the sequence of the cloned synthetic gene by automated sequencing methods.

Limitations

Moreover, because the assembly of the full-length gene product relies on the efficient and specific alignment of long single stranded oligonucleotides, critical parameters for synthesis success include extended sequence regions comprising secondary structures caused by inverted repeats, extraordinary high or low GC-content, or repetitive structures. Usually these segments of a particular gene can only be synthesized by splitting the procedure into several consecutive steps and a final assembly of shorter sub-sequences, which in turn leads to a significant increase in time and labor needed for its production. The result of a gene synthesis experiment depends strongly on the quality of the oligonucleotides used. For these annealing based gene synthesis protocols, the quality of the product is directly and exponentially dependent on the correctness of the employed oligonucleotides. Alternatively, after performing gene synthesis with oligos of lower quality, more effort must be made in downstream quality assurance during clone analysis, which is usually done by time-consuming standard cloning and sequencing procedures. Another problem associated with all current gene synthesis methods is the high frequency of sequence errors because of the usage of chemically synthesized oligonucleotides. The error frequency increases with longer oligonucleotides, and as a consequence the percentage of correct product decreases dramatically as more oligonucleotides are used. The mutation problem could be solved by shorter oligonucleotides used to assemble the gene. However, all annealing based assembly methods require the primers to be mixed together in one tube. In this case, shorter overlaps do not always allow precise and specific annealing of complementary primers, resulting in the inhibition of full length product formation. Manual design of oligonucleotides is a laborious procedure and does not guarantee the successful synthesis of the desired gene. For optimal performance of almost all annealing based methods, the melting temperatures of the overlapping regions are supposed to be similar for all oligonucleotides. The necessary primer optimisation should be performed using specialized oligonucleotide design programs. Several solutions for automated primer design for gene synthesis have been presented so far.

Error correction procedures

To overcome problems associated with oligonucleotide quality several elaborate strategies have been developed, employing either separately prepared fishing oligonucleotides, mismatch binding enzymes of the mutS family or specific endonucleases from bacteria or phages. Nevertheless, all these strategies increase time and costs for gene synthesis based on the annealing of chemically synthesized oligonucleotides.

Massively parallel sequencing has also been used as a tool to screen complex oligonucleotide libraries and enable the retrieval of accurate molecules. In one approach, oligonucleotides are sequenced on the 454 pyrosequencing platform and a robotic system images and picks individual beads corresponding to accurate sequence. In another approach, a complex oligonucleotide library is modified with unique flanking tags before massively parallel sequencing. Tag-directed primers then enable the retrieval of molecules with desired sequences by dial-out PCR.

Increasingly, genes are ordered in sets including functionally related genes or multiple sequence variants on a single gene. Virtually all of the therapeutic proteins in development, such as monoclonal antibodies, are optimised by testing many gene variants for improved function or expression.

Unnatural base pairs

While traditional nucleic acid synthesis only uses 4 base pairs - adenine, thymine, guanine and cytosine, oligonucleotide synthesis in the future could incorporate the use of unnatural base pairs, which are artificially designed and synthesized nucleobases that do not occur in nature.

In 2012, a group of American scientists led by Floyd Romesberg, a chemical biologist at the Scripps Research Institute in San Diego, California, published that his team designed an unnatural base pair (UBP). The two new artificial nucleotides or Unnatural Base Pair (UBP) were named d5SICS and dNaM. More technically, these artificial nucleotides bearing hydrophobic nucleobases, feature two fused aromatic rings that form a (d5SICS–dNaM) complex or base pair in DNA. In 2014 the same team from the Scripps Research Institute reported that they synthesized a stretch of circular DNA known as a plasmid containing natural T-A and C-G base pairs along with the best-performing UBP Romesberg's laboratory had designed, and inserted it into cells of the common bacterium E. coli that successfully replicated the unnatural base pairs through multiple generations. This is the first known example of a living organism passing along an expanded genetic code to subsequent generations. This was in part achieved by the addition of a supportive algal gene that expresses a nucleotide triphosphate transporter which efficiently imports the triphosphates of both d5SICSTP and dNaMTP into E. coli bacteria. Then, the natural bacterial replication pathways use them to accurately replicate the plasmid containing d5SICS–dNaM.

The successful incorporation of a third base pair is a significant breakthrough toward the goal of greatly expanding the number of amino acids which can be encoded by DNA, from the existing 20 amino acids to a theoretically possible 172, thereby expanding the potential for living organisms to produce novel proteins. In the future, these unnatural base pairs could be synthesised and incorporated into oligonucleotides via DNA printing methods.

DNA assembly

DNA printing can thus be used to produce DNA parts, which are defined as sequences of DNA that encode a specific biological function (for example, promoters, transcription regulatory sequences or open reading frames). However, because oligonucleotide synthesis typically cannot accurately produce oligonucleotides sequences longer than a few hundred base pairs, DNA assembly methods have to be employed to assemble these parts together to create functional genes, multi-gene circuits or even entire synthetic chromosomes or genomes. Some DNA assembly techniques only define protocols for joining DNA parts, while other techniques also define the rules for the format of DNA parts that are compatible with them. These processes can be scaled up to enable the assembly of entire chromosomes or genomes. In recent years, there has been proliferation in the number of different DNA assembly standards with 14 different assembly standards developed as of 2015, each with their pros and cons. Overall, the development of DNA assembly standards has greatly facilitated the workflow of synthetic biology, aided the exchange of material between research groups and also allowed for the creation of modular and reusable DNA parts.

The various DNA assembly methods can be classified into three main categories – endonuclease-mediated assembly, site-specific recombination, and long-overlap-based assembly. Each group of methods has its distinct characteristics and their own advantages and limitations.

Endonuclease-mediated assembly

Endonucleases are enzymes that recognise and cleave nucleic acid segments and they can be used to direct DNA assembly. Of the different types of restriction enzymes, the type II restriction enzymes are the most commonly available and used because their cleavage sites are located near or in their recognition sites. Hence, endonuclease-mediated assembly methods make use of this property to define DNA parts and assembly protocols.

BioBricks

BBF RFC 10 assembly of two BioBricks compatible part. Treating the upstream fragment with EcoRI and SpeI, and the downstream fragment with EcoRI and XbaI allows for the assembly in the desired sequence. Because SpeI and XbaI produce complementary overhangs, they help link the two DNA fragments together, producing a scar sequence. All the original restriction sites are maintained in the final construct, which can then be used for further BioBricks reactions.

The BioBricks assembly standard was described and introduced by Tom Knight in 2003 and it has been constantly updated since then. Currently, the most commonly used BioBricks standard is the assembly standard 10, or BBF RFC 10. BioBricks defines the prefix and suffix sequences required for a DNA part to be compatible with the BioBricks assembly method, allowing the joining together of all DNA parts which are in the BioBricks format.

The prefix contains the restriction sites for EcoRI, NotI and XBaI, while the suffix contains the SpeI, NotI and PstI restriction sites. Outside of the prefix and suffix regions, the DNA part must not contain these restriction sites. To join two BioBrick parts together, one of the plasmids is digested with EcoRI and SpeI while the second plasmid is digested with EcoRI and XbaI. The two EcoRI overhangs are complementary and will thus anneal together, while SpeI and XbaI also produce complementary overhangs which can also be ligated together. As the resulting plasmid contains the original prefix and suffix sequences, it can be used to join with more BioBricks parts. Because of this property, the BioBricks assembly standard is said to be idempotent in nature. However, there will also be a "scar" sequence (either TACTAG or TACTAGAG) formed between the two fused BioBricks. This prevents BioBricks from being used to create fusion proteins, as the 6bp scar sequence codes for a tyrosine and a stop codon, causing translation to be terminated after the first domain is expressed, while the 8bp scar sequence causes a frameshift, preventing continuous readthrough of the codons. To offer alternative scar sequences that for example give a 6bp scar, or scar sequences that do not contain stop codons, other assembly standards such as the BB-2 Assembly, BglBricks Assembly, Silver Assembly and the Freiburg Assembly were designed.

While the easiest method to assemble BioBrick parts is described above, there also exist several other commonly used assembly methods that offer several advantages over the standard assembly. The 3 antibiotic (3A) assembly allows for the correct assembly to be selected via antibiotic selection, while the amplified insert assembly seeks to overcome the low transformation efficiency seen in 3A assembly.

The BioBrick assembly standard has also served as inspiration for using other types of endonucleases for DNA assembly. For example, both the iBrick standard and the HomeRun vector assembly standards employ homing endonucleases instead of type II restriction enzymes.

Type IIs restriction endonuclease assembly

Some assembly methods also make use of type IIs restriction endonucleases. These differ from other type II endonucleases as they cut several base pairs away from the recognition site. As a result, the overhang sequence can be modified to contain the desired sequence. This provides Type IIs assembly methods with two advantages – it enables "scar-less" assembly, and allows for one-pot, multi-part assembly. Assembly methods that use type IIs endonucleases include Golden Gate and its associated variants.

Golden Gate cloning

The sequence of DNA parts for the Golden Gate assembly can be directed by defining unique complementary overhangs for each part. Thus, to assemble gene 1 in order of fragment A, B and C, the 3' overhang for fragment A is complementary to the 5' overhang for fragment B, and similarly for fragment B and fragment C. For the destination plasmid, the selectable marker is flanked by outward-cutting BsaI restriction sites. This excises the selectable marker, allowing the insertion of the final construct. T4 ligase is used to ligate the fragments together and to the destination plasmid.

The Golden Gate assembly protocol was defined by Engler et al. 2008 to define a DNA assembly method that would give a final construct without a scar sequence, while also lacking the original restriction sites. This allows the protein to be expressed without containing unwanted protein sequences which could negatively affect protein folding or expression. By using the BsaI restriction enzyme that produces a 4 base pair overhang, up to 240 unique, non-palindromic sequences can be used for assembly.

Plasmid design and assembly

In Golden Gate cloning, each DNA fragment to be assembled is placed in a plasmid, flanked by inward facing BsaI restriction sites containing the programmed overhang sequences. For each DNA fragment, the 3' overhang sequence is complementary to the 5' overhang of the next downstream DNA fragment. For the first fragment, the 5' overhang is complementary to the 5' overhang of the destination plasmid, while the 3' overhang of the final fragment is complementary to the 3' overhang of the destination plasmid. Such a design allows for all DNA fragments to be assembled in a one-pot reaction (where all reactants are mixed together), with all fragments arranged in the correct sequence. Successfully assembled constructs are selected by detecting the loss of function of a screening cassette that was originally in the destination plasmid.

MoClo and Golden Braid

The original Golden Gate Assembly only allows for a single construct to be made in the destination vector . To enable this construct to be used in a subsequent reaction as an entry vector, the MoClo and Golden Braid standards were designed.

The MoClo standard involves defining multiple tiers of DNA assembly:

The MoClo assembly standard allows for Golden Gate constructs to be further assembled in subsequent tiers. In the example here, four genes assembled via tier 1 Golden Gate assembly are assembled into a multi-gene construct in a tier 2 assembly.
The Golden Braid assembly standard also builds on the first tier of Golden Gate assembly and assembles further tiers via a pairwise protocol. Four tier one destination vectors (assembled via Golden Gate assembly) are assembled into two tier 2 destination vectors, which are then used as tier 3 entry vectors for the tier 3 destination vector. Alternating restriction enzymes (BpiI for tier 2 and BsaI for tier 3) are used.
The MoClo and Golden Braid assembly standards are derivatives of the original Golden Gate assembly standard.
Tier 1: Tier 1 assembly is the standard Golden Gate assembly, and genes are assembled from their components parts (DNA parts coding for genetic elements like UTRs, promoters, ribosome binding sites or terminator sequences). Flanking the insertion site of the tier 1 destination vectors are a pair of inward cutting BpiI restriction sites. This allows these plasmids to be used as entry vectors for tier two destination vectors.
Tier 2: Tier 2 assembly involves further assembling the genes assembled in tier 1 assembly into multi-gene constructs. If there is a need for further, higher tier assembly, inward cutting BsaI restriction sites can be added to flank the insertion sites. These vectors can then be used as entry vectors for higher tier constructs.

Each assembly tier alternates the use of BsaI and BpiI restriction sites to minimise the number of forbidden sites, and sequential assembly for each tier is achieved by following the Golden Gate plasmid design. Overall, the MoClo standard allows for the assembly of a construct that contains multiple transcription units, all assembled from different DNA parts, by a series of one-pot Golden Gate reactions. However, one drawback of the MoClo standard is that it requires the use of 'dummy parts' with no biological function, if the final construct requires less than four component parts. The Golden Braid standard on the other hand introduced a pairwise Golden Gate assembly standard.

The Golden Braid standard uses the same tiered assembly as MoClo, but each tier only involves the assembly of two DNA fragments, i.e. a pairwise approach. Hence in each tier, pairs of genes are cloned into a destination fragment in the desired sequence, and these are subsequently assembled two at a time in successive tiers. Like MoClo, the Golden Braid standard alternates the BsaI and BpiI restriction enzymes between each tier.

The development of the Golden Gate assembly methods and its variants has allowed researchers to design tool-kits to speed up the synthetic biology workflow. For example EcoFlex was developed as a toolkit for E. Coli that uses the MoClo standard for its DNA parts, while a similar toolkit has also been developed for engineering the Chlamydomonas reinhardtii mircoalgae.

Site-specific recombination

The Gateway Cloning entry vectors must first be produced using a synthesised DNA fragment containing the required attB sites. Recombination with the donor vector is catalysed by the BP clonase mix and produces the desired entry vector with attL sites.

The desired construct is obtained by recombining the entry vector with the destination vector. In this case, the final construct involves the assembly of two DNA fragments of interest. The attL sites on the entry vector recombine with the attR sites on the destination vector. The lethal gene ccdB is lost from the destination vector, and any bacteria that uptakes the unwanted vector will die, allowing selection of the desired vector.

The discovery of orthogonal att sites that are specific (i.e. each orthogonal attL only reacts with its partner attR) allowed for the developed of Multisite Gateway Cloning technology. This allows for multiple DNA fragments to be assembled, with the order of assembly directed by the att sites.

Summary of the key Gateway Cloning technologies.

Site-specific recombination makes use of phage integrases instead of restriction enzymes, eliminating the need for having restriction sites in the DNA fragments. Instead, integrases make use of unique attachment (att) sites, and catalyse DNA rearrangement between the target fragment and the destination vector. The Invitrogen Gateway cloning system was invented in the late 1990s and uses two proprietary enzyme mixtures, BP clonase and LR clonase. The BP clonase mix catalyses the recombination between attB and attP sites, generating hybrid attL and attR sites, while the LR clonase mix catalyse the recombination of attL and attR sites to give attB and attP sites. As each enzyme mix recognises only specific att sites, recombination is highly specific and the fragments can be assembled in the desired sequence.

Vector design and assembly

Because Gateway cloning is a proprietary technology, all Gateway reactions must be carried out with the Gateway kit that is provided by the manufacturer. The reaction can be summarised into two steps. The first step involves assembling the entry clones containing the DNA fragment of interest, while the second step involves inserting this fragment of interest into the destination clone.

Entry clones must be made using the supplied "Donor" vectors containing a Gateway cassette flanked by attP sites. The Gateway cassette contains a bacterial suicide gene (e.g. ccdB) that will allow for survival and selection of successfully recombined entry clones. A pair of attB sites are added to flank the DNA fragment of interest, and this will allow recombination with the attP sites when the BP clonase mix is added. Entry clones are produced, and the fragment of interest is flanked by attL sites.
The destination vector also comes with a Gateway cassette, but is instead flanked by a pair of attR sites. Mixing this destination plasmid with the entry clones and the LR clonase mix will allow for recombination to occur between the attR and attL sites. A destination clone is produced, with the fragment of interest successfully inserted. The lethal gene is inserted into the original vector, and bacteria transformed with this plasmid will die. The desired vector can thus be easily selected.

The earliest iterations of the Gateway cloning method only allowed for only one entry clone to be used for each destination clone produced. However, further research revealed that four more orthogonal att sequences could be generated, allowing for the assembly of up to four different DNA fragments, and this process is now known as the Multisite Gateway technology.

Besides Gateway cloning, non-commercial methods using other integrases have also been developed. For example, the Serine Integrase Recombinational Assembly (SIRA) method uses the ϕC31 integrase, while the Site-Specific Recombination-based Tandem Assembly (SSRTA) method uses the Streptomyces phage φBT1 integrase. Other methods, like the HomeRun Vector Assembly System (HVAS), build on the Gateway cloning system and further incorporate homing endoucleases to design a protocol that could potentially support the industrial synthesis of synthetic DNA constructs.

Long-overlap-based assembly methods require the presence of long overlap regions on the DNA parts that are to be assembled. This enables the construction of complementary overhangs that can anneal via complementary base pairing. There exist a variety of methods, e.g. Gibson assembly, CPEC, MODAL that make use of this concept to assemble DNA.

Long-overlap-based assembly

There have been a variety of long-overlap-based assembly methods developed in recent years. One of the most commonly used methods, the Gibson assembly method, was developed in 2009, and provides a one-pot DNA assembly method that does not require the use of restriction enzymes or integrases. Other similar overlap-based assembly methods include Circular Polymerase Extension Cloning (CPEC), Sequence and Ligase Independent Cloning (SLIC) and Seamless Ligation Cloning Extract (SLiCE). Despite the presence of many overlap assembly methods, the Gibson assembly method is still the most popular. Besides the methods listed above, other researchers have built on the concepts used in Gibson assembly and other assembly methods to develop new assembly strategies like the Modular Overlap-Directed Assembly with Linkers (MODAL) strategy, or the Biopart Assembly Standard for Idempotent Cloning (BASIC) method.

Gibson assembly

The Gibson assembly method is a relatively straightforward DNA assembly method, requiring only a few additional reagents: the 5' T5 exonuclease, Phusion DNA polymerase, and Taq DNA ligase. The DNA fragments to be assembled are synthesised to have overlapping 5' and 3' ends in the order that they are to be assembled in. These reagents are mixed together with the DNA fragments to be assembled at 50 °C and the following reactions occur:

The T5 exonuclease chews back DNA from the 5' end of each fragment, exposing 3' overhangs on each DNA fragment.
The complementary overhangs on adjacent DNA fragments anneal via complementary base pairing.
The Phusion DNA polymerase fills in any gaps where the fragments anneal.
Taq DNA ligase repairs the nicks on both DNA strands.

Because the T5 exonuclease is heat labile, it is inactivated at 50 °C after the initial chew back step. The product is thus stable, and the fragments assembled in the desired order. This one-pot protocol can assemble up to 5 different fragments accurately, while several commercial providers have kits to accurately assemble up to 15 different fragments in a two-step reaction. However, while the Gibson assembly protocol is fast and uses relatively few reagents, it requires bespoke DNA synthesis as each fragment has to be designed to contain overlapping sequences with the adjacent fragments and amplified via PCR. This reliance on PCR may also affect the fidelity of the reaction when long fragments, fragments with high GC content or repeat sequences are used.

The MODAL standard provides a common format to allow any DNA part to be made compatible with Gibson assembly or other overlap assembly methods. The DNA fragment of interest undergoes two rounds of PCR, first to attach the adaptor prefix and suffixes, and next to attach the predefined linker sequences. Once the parts are in the required format, assembly methods like Gibson assembly can carried out. The order of the parts is directed by the linkers, i.e the same linker sequence is attached to the 3' end of the upstream part and the 5' end of the downstream part.

MODAL

The MODAL strategy defines overlap sequences known as "linkers" to reduce the amount of customisation that needs to be done with each DNA fragment. The linkers were designed using the R2oDNA Designer software and the overlap regions were designed to be 45 bp long to be compatible with Gibson assembly and other overlap assembly methods. To attach these linkers to the parts to be assembled, PCR is carried using part-specific primers containing 15 bp prefix and suffix adaptor sequences. The linkers are then attached to the adaptor sequences via a second PCR reaction. To position the DNA fragments, the same linker will be attached to the suffix of the desired upstream fragment and the prefix of the desired downstream fragments. Once the linkers are attached, Gibson assembly, CPEC, or the other overlap assembly methods can all be used to assemble the DNA fragments in the desired order.

BASIC

The BASIC assembly strategy was developed in 2015 and sought to address the limitations of previous assembly techniques, incorporating six key concepts from them: standard reusable parts; single-tier format (all parts are in the same format and are assembled using the same process); idempotent cloning; parallel (multipart) DNA assembly; size independence; automatability.

DNA parts and linker design

The DNA parts are designed and cloned into storage plasmids, with the part flanked by an integrated prefix (iP) and an integrated suffix (iS) sequence. The iP and iS sequences contain inward facing BsaI restriction sites, which contain overhangs complementary to the BASIC linkers. Like in MODAL, the 7 standard linkers used in BASIC were designed with the R2oDNA Designer software, and screened to ensure that they do not contain sequences with homology to chassis genomes, and that they do not contain unwanted sequences like secondary structure sequences, restriction sites or ribosomal binding sites. Each linker sequence is split into two halves, each with a 4 bp overhang complementary to the BsaI restriction site, a 12 bp double stranded sequence and sharing a 21 bp overlap sequence with the other half. The half that is will bind to the upstream DNA part is known as the suffix linker part (e.g. L1S) and the half that binds to the downstream part is known as the prefix linker part (e.g. L1P). These linkers form the basis of assembling the DNA parts together.

Besides directing the order of assembly, the standard BASIC linkers can also be modified to carry out other functions. To allow for idempotent assembly, linkers were also designed with additional methylated iP and iS sequences inserted to protect them from being recognised by BsaI. This methylation is lost following transformation and in vivo plasmid replication, and the plasmids can be extracted, purified, and used for further reactions.

Because the linker sequence are relatively long (45bp for a standard linker), there is an opportunity to incorporate functional DNA sequences to reduce the number of DNA parts needed during assembly. The BASIC assembly standard provides several linkers embedded with RBS of different strengths. Similarly to facilitate the construction of fusion proteins containing multiple protein domains, several fusion linkers were also designed to allow for full read-through of the DNA construct. These fusion linkers code for a 15 amino acid glycine and serine polypeptide, which is an ideal linker peptide for fusion proteins with multiple domains.

DNA parts to be used for BASIC assembly need to contain integrated prefix and suffix sequences (iP and iS). These contain BsaI restriction sites that will allow for the iP and iS linkers (which contain complementary overhang sequences) to be attached to the DNA part. Once the linkers are attached, the part is ready for assembly.

The sequence of assembly is directed by the positioning of the linkers. The suffix linker is ligated to the 3' end of the upstream fragment, while the prefix linker is ligated to the 5' end of the downstream fragment. Shown here is an example of a BASIC assembly construct using four DNA parts.

Assembly

There are three main steps in the assembly of the final construct.

First, the DNA parts are excised from the storage plasmid, giving a DNA fragment with BsaI overhangs on the 3' and 5' end.
Next, each linker part is attached to its respective DNA part by incubating with T4 DNA ligase. Each DNA part will have a suffix and prefix linker part from two different linkers to direct the order of assembly. For example, the first part in the sequence will have L1P and L2S, while the second part will have L2P and L3S attached. The linker parts can be changed to change the sequence of assembly.
Finally, the parts with the attached linkers are assembled into a plasmid by incubating at 50 °C. The 21 bp overhangs of the P and S linkers anneal and the final construct can be transformed into bacteria cells for cloning. The single stranded nicks are repaired in vivo following transformation, producing a stable final construct cloned into plasmids.

Applications

As DNA printing and DNA assembly methods have allowed commercial gene synthesis to become progressively and exponentially cheaper over the past years, artificial gene synthesis represents a powerful and flexible engineering tool for creating and designing new DNA sequences and protein functions. Besides synthetic biology, various research areas like those involving heterologous gene expression, vaccine development, gene therapy and molecular engineering, would benefit greatly from having fast and cheap methods to synthesise DNA to code for proteins and peptides. The methods used for DNA printing and assembly have even enabled the use of DNA as an information storage medium.

Synthesising bacterial genomes

Synthia and Mycoplasma laboratorium

On June 28, 2007, a team at the J. Craig Venter Institute published an article in Science Express, saying that they had successfully transplanted the natural DNA from a Mycoplasma mycoides bacterium into a Mycoplasma capricolum cell, creating a bacterium which behaved like a M. mycoides.

On Oct 6, 2007, Craig Venter announced in an interview with UK's The Guardian newspaper that the same team had synthesized a modified version of the single chromosome of Mycoplasma genitalium artificially. The chromosome was modified to eliminate all genes which tests in live bacteria had shown to be unnecessary. The next planned step in this minimal genome project is to transplant the synthesized minimal genome into a bacterial cell with its old DNA removed; the resulting bacterium will be called Mycoplasma laboratorium. The next day the Canadian bioethics group, ETC Group issued a statement through their representative, Pat Mooney, saying Venter's "creation" was "a chassis on which you could build almost anything". The synthesized genome had not yet been transplanted into a working cell.

On May 21, 2010, Science reported that the Venter group had successfully synthesized the genome of the bacterium Mycoplasma mycoides from a computer record, and transplanted the synthesized genome into the existing cell of a Mycoplasma capricolum bacterium that had its DNA removed. The "synthetic" bacterium was viable, i.e. capable of replicating billions of times. The team had originally planned to use the M. genitalium bacterium they had previously been working with, but switched to M. mycoides because the latter bacterium grows much faster, which translated into quicker experiments. Venter describes it as "the first species.... to have its parents be a computer". The transformed bacterium is dubbed "Synthia" by ETC. A Venter spokesperson has declined to confirm any breakthrough at the time of this writing.

Synthetic Yeast 2.0

As part of the Synthetic Yeast 2.0 project, various research groups around the world have participated in a project to synthesise synthetic yeast genomes, and through this process, optimise the genome of the model organism Saccharomyces cerevisae. The Yeast 2.0 project applied various DNA assembly methods that have been discussed above, and in March 2014, Jef Boeke of the Langone Medical Centre at New York University, revealed that his team had synthesized chromosome III of S. cerevisae. The procedure involved replacing the genes in the original chromosome with synthetic versions and the finished synthetic chromosome was then integrated into a yeast cell. It required designing and creating 273,871 base pairs of DNA – fewer than the 316,667 pairs in the original chromosome. In March 2017, the synthesis of 6 of the 16 chromosomes had been completed, with synthesis of the others still ongoing.

Search This Blog