A Medley of Potpourri

Sunday, December 10, 2023

Algorithmic efficiency

From Wikipedia, the free encyclopedia

In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. An algorithm must be analyzed to determine its resource usage, and the efficiency of an algorithm can be measured based on the usage of different resources. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process.

For maximum efficiency it is desirable to minimize resource usage. However, different resources such as time and space complexity cannot be compared directly, so which of two algorithms is considered to be more efficient often depends on which measure of efficiency is considered most important.

For example, bubble sort and timsort are both algorithms to sort a list of items from smallest to largest. Bubble sort sorts the list in time proportional to the number of elements squared ( $O (n^{2})$ , see Big O notation), but only requires a small amount of extra memory which is constant with respect to the length of the list ( $O (1)$ ). Timsort sorts the list in time linearithmic (proportional to a quantity times its logarithm) in the list's length ( $O (n \log n)$ ), but has a space requirement linear in the length of the list ( $O (n)$ ). If large lists must be sorted at high speed for a given application, timsort is a better choice; however, if minimizing the memory footprint of the sorting is more important, bubble sort is a better choice.

Background

The importance of efficiency with respect to time was emphasised by Ada Lovelace in 1843 as applied to Charles Babbage's mechanical analytical engine:

"In almost every computation a great variety of arrangements for the succession of the processes is possible, and various considerations must influence the selections amongst them for the purposes of a calculating engine. One essential object is to choose that arrangement which shall tend to reduce to a minimum the time necessary for completing the calculation"

Early electronic computers had both limited speed and limited random access memory. Therefore, a space–time trade-off occurred. A task could use a fast algorithm using a lot of memory, or it could use a slow algorithm using little memory. The engineering trade-off was then to use the fastest algorithm that could fit in the available memory.

Modern computers are significantly faster than the early computers, and have a much larger amount of memory available (Gigabytes instead of Kilobytes). Nevertheless, Donald Knuth emphasised that efficiency is still an important consideration:

"In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering"

Overview

An algorithm is considered efficient if its resource consumption, also known as computational cost, is at or below some acceptable level. Roughly speaking, 'acceptable' means: it will run in a reasonable amount of time or space on an available computer, typically as a function of the size of the input. Since the 1950s computers have seen dramatic increases in both the available computational power and in the available amount of memory, so current acceptable levels would have been unacceptable even 10 years ago. In fact, thanks to the approximate doubling of computer power every 2 years, tasks that are acceptably efficient on modern smartphones and embedded systems may have been unacceptably inefficient for industrial servers 10 years ago.

Computer manufacturers frequently bring out new models, often with higher performance. Software costs can be quite high, so in some cases the simplest and cheapest way of getting higher performance might be to just buy a faster computer, provided it is compatible with an existing computer.

There are many ways in which the resources used by an algorithm can be measured: the two most common measures are speed and memory usage; other measures could include transmission speed, temporary disk usage, long-term disk usage, power consumption, total cost of ownership, response time to external stimuli, etc. Many of these measures depend on the size of the input to the algorithm, i.e. the amount of data to be processed. They might also depend on the way in which the data is arranged; for example, some sorting algorithms perform poorly on data which is already sorted, or which is sorted in reverse order.

In practice, there are other factors which can affect the efficiency of an algorithm, such as requirements for accuracy and/or reliability. As detailed below, the way in which an algorithm is implemented can also have a significant effect on actual efficiency, though many aspects of this relate to optimization issues.

Theoretical analysis

In the theoretical analysis of algorithms, the normal practice is to estimate their complexity in the asymptotic sense. The most commonly used notation to describe resource consumption or "complexity" is Donald Knuth's Big O notation, representing the complexity of an algorithm as a function of the size of the input $n$ . Big O notation is an asymptotic measure of function complexity, where $f (n) = O (g (n))$ roughly means the time requirement for an algorithm is proportional to $g (n)$ , omitting lower-order terms that contribute less than $g (n)$ to the growth of the function as $n$ grows arbitrarily large. This estimate may be misleading when $n$ is small, but is generally sufficiently accurate when $n$ is large as the notation is asymptotic. For example, bubble sort may be faster than merge sort when only a few items are to be sorted; however either implementation is likely to meet performance requirements for a small list. Typically, programmers are interested in algorithms that scale efficiently to large input sizes, and merge sort is preferred over bubble sort for lists of length encountered in most data-intensive programs.

Some examples of Big O notation applied to algorithms' asymptotic time complexity include:

Notation	Name	Examples
$O (1)$	constant	Finding the median from a sorted list of measurements; Using a constant-size lookup table; Using a suitable hash function for looking up an item.
$O (\log n)$	logarithmic	Finding an item in a sorted array with a binary search or a balanced search tree as well as all operations in a Binomial heap.
$O (n)$	linear	Finding an item in an unsorted list or a malformed tree (worst case) or in an unsorted array; Adding two n-bit integers by ripple carry.
$O (n \log n)$	linearithmic, loglinear, or quasilinear	Performing a Fast Fourier transform; heapsort, quicksort (best and average case), or merge sort
$O (n^{2})$	quadratic	Multiplying two n-digit numbers by a simple algorithm; bubble sort (worst case or naive implementation), Shell sort, quicksort (worst case), selection sort or insertion sort
$O (c^{n}), c > 1$	exponential	Finding the optimal (non-approximate) solution to the travelling salesman problem using dynamic programming; determining if two logical statements are equivalent using brute-force search

Benchmarking: measuring performance

For new versions of software or to provide comparisons with competitive systems, benchmarks are sometimes used, which assist with gauging an algorithms relative performance. If a new sort algorithm is produced, for example, it can be compared with its predecessors to ensure that at least it is efficient as before with known data, taking into consideration any functional improvements. Benchmarks can be used by customers when comparing various products from alternative suppliers to estimate which product will best suit their specific requirements in terms of functionality and performance. For example, in the mainframe world certain proprietary sort products from independent software companies such as Syncsort compete with products from the major suppliers such as IBM for speed.

Some benchmarks provide opportunities for producing an analysis comparing the relative speed of various compiled and interpreted languages for example and The Computer Language Benchmarks Game compares the performance of implementations of typical programming problems in several programming languages.

Even creating "do it yourself" benchmarks can demonstrate the relative performance of different programming languages, using a variety of user specified criteria. This is quite simple, as a "Nine language performance roundup" by Christopher W. Cowell-Shah demonstrates by example.

Implementation concerns

Implementation issues can also have an effect on efficiency, such as the choice of programming language, or the way in which the algorithm is actually coded, or the choice of a compiler for a particular language, or the compilation options used, or even the operating system being used. In many cases a language implemented by an interpreter may be much slower than a language implemented by a compiler. See the articles on just-in-time compilation and interpreted languages.

There are other factors which may affect time or space issues, but which may be outside of a programmer's control; these include data alignment, data granularity, cache locality, cache coherency, garbage collection, instruction-level parallelism, multi-threading (at either a hardware or software level), simultaneous multitasking, and subroutine calls.

Some processors have capabilities for vector processing, which allow a single instruction to operate on multiple operands; it may or may not be easy for a programmer or compiler to use these capabilities. Algorithms designed for sequential processing may need to be completely redesigned to make use of parallel processing, or they could be easily reconfigured. As parallel and distributed computing grow in importance in the late 2010s, more investments are being made into efficient high-level APIs for parallel and distributed computing systems such as CUDA, TensorFlow, Hadoop, OpenMP and MPI.

Another problem which can arise in programming is that processors compatible with the same instruction set (such as x86-64 or ARM) may implement an instruction in different ways, so that instructions which are relatively fast on some models may be relatively slow on other models. This often presents challenges to optimizing compilers, which must have a great amount of knowledge of the specific CPU and other hardware available on the compilation target to best optimize a program for performance. In the extreme case, a compiler may be forced to emulate instructions not supported on a compilation target platform, forcing it to generate code or link an external library call to produce a result that is otherwise incomputable on that platform, even if it is natively supported and more efficient in hardware on other platforms. This is often the case in embedded systems with respect to floating-point arithmetic, where small and low-power microcontrollers often lack hardware support for floating-point arithmetic and thus require computationally expensive software routines to produce floating point calculations.

Measures of resource usage

Measures are normally expressed as a function of the size of the input $n$ .

The two most common measures are:

Time: how long does the algorithm take to complete?
Space: how much working memory (typically RAM) is needed by the algorithm? This has two aspects: the amount of memory needed by the code (auxiliary space usage), and the amount of memory needed for the data on which the code operates (intrinsic space usage).

For computers whose power is supplied by a battery (e.g. laptops and smartphones), or for very long/large calculations (e.g. supercomputers), other measures of interest are:

Direct power consumption: power needed directly to operate the computer.
Indirect power consumption: power needed for cooling, lighting, etc.

As of 2018, power consumption is growing as an important metric for computational tasks of all types and at all scales ranging from embedded Internet of things devices to system-on-chip devices to server farms. This trend is often referred to as green computing.

Less common measures of computational efficiency may also be relevant in some cases:

Transmission size: bandwidth could be a limiting factor. Data compression can be used to reduce the amount of data to be transmitted. Displaying a picture or image (e.g. Google logo) can result in transmitting tens of thousands of bytes (48K in this case) compared with transmitting six bytes for the text "Google". This is important for I/O bound computing tasks.
External space: space needed on a disk or other external memory device; this could be for temporary storage while the algorithm is being carried out, or it could be long-term storage needed to be carried forward for future reference.
Response time (latency): this is particularly relevant in a real-time application when the computer system must respond quickly to some external event.
Total cost of ownership: particularly if a computer is dedicated to one particular algorithm.

Time

Theory

Analyze the algorithm, typically using time complexity analysis to get an estimate of the running time as a function of the size of the input data. The result is normally expressed using Big O notation. This is useful for comparing algorithms, especially when a large amount of data is to be processed. More detailed estimates are needed to compare algorithm performance when the amount of data is small, although this is likely to be of less importance. Algorithms which include parallel processing may be more difficult to analyze.

Practice

Use a benchmark to time the use of an algorithm. Many programming languages have an available function which provides CPU time usage. For long-running algorithms the elapsed time could also be of interest. Results should generally be averaged over several tests.

Run-based profiling can be very sensitive to hardware configuration and the possibility of other programs or tasks running at the same time in a multi-processing and multi-programming environment.

This sort of test also depends heavily on the selection of a particular programming language, compiler, and compiler options, so algorithms being compared must all be implemented under the same conditions.

Space

This section is concerned with use of memory resources (registers, cache, RAM, virtual memory, secondary memory) while the algorithm is being executed. As for time analysis above, analyze the algorithm, typically using space complexity analysis to get an estimate of the run-time memory needed as a function as the size of the input data. The result is normally expressed using Big O notation.

There are up to four aspects of memory usage to consider:

The amount of memory needed to hold the code for the algorithm.
The amount of memory needed for the input data.
The amount of memory needed for any output data.
- Some algorithms, such as sorting, often rearrange the input data and do not need any additional space for output data. This property is referred to as "in-place" operation.
The amount of memory needed as working space during the calculation.
- This includes local variables and any stack space needed by routines called during a calculation; this stack space can be significant for algorithms which use recursive techniques.

Early electronic computers, and early home computers, had relatively small amounts of working memory. For example, the 1949 Electronic Delay Storage Automatic Calculator (EDSAC) had a maximum working memory of 1024 17-bit words, while the 1980 Sinclair ZX80 came initially with 1024 8-bit bytes of working memory. In the late 2010s, it is typical for personal computers to have between 4 and 32 GB of RAM, an increase of over 300 million times as much memory.

Caching and memory hierarchy

Current computers can have relatively large amounts of memory (possibly Gigabytes), so having to squeeze an algorithm into a confined amount of memory is much less of a problem than it used to be. But the presence of four different categories of memory can be significant:

Processor registers, the fastest of computer memory technologies with the least amount of storage space. Most direct computation on modern computers occurs with source and destination operands in registers before being updated to the cache, main memory and virtual memory if needed. On a processor core, there are typically on the order of hundreds of bytes or fewer of register availability, although a register file may contain more physical registers than architectural registers defined in the instruction set architecture.
Cache memory is the second fastest and second smallest memory available in the memory hierarchy. Caches are present in CPUs, GPUs, hard disk drives and external peripherals, and are typically implemented in static RAM. Memory caches are multi-leveled; lower levels are larger, slower and typically shared between processor cores in multi-core processors. In order to process operands in cache memory, a processing unit must fetch the data from the cache, perform the operation in registers and write the data back to the cache. This operates at speeds comparable (about 2-10 times slower) with the CPU or GPU's arithmetic logic unit or floating-point unit if in the L1 cache. It is about 10 times slower if there is an L1 cache miss and it must be retrieved from and written to the L2 cache, and a further 10 times slower if there is an L2 cache miss and it must be retrieved from an L3 cache, if present.
Main physical memory is most often implemented in dynamic RAM (DRAM). The main memory is much larger (typically gigabytes compared to ≈8 megabytes) than an L3 CPU cache, with read and write latencies typically 10-100 times slower. As of 2018, RAM is increasingly implemented on-chip of processors, as CPU or GPU memory.
Virtual memory is most often implemented in terms of secondary storage such as a hard disk, and is an extension to the memory hierarchy that has much larger storage space but much larger latency, typically around 1000 times slower than a cache miss for a value in RAM. While originally motivated to create the impression of higher amounts of memory being available than were truly available, virtual memory is more important in contemporary usage for its time-space tradeoff and enabling the usage of virtual machines. Cache misses from main memory are called page faults, and incur huge performance penalties on programs.

An algorithm whose memory needs will fit in cache memory will be much faster than an algorithm which fits in main memory, which in turn will be very much faster than an algorithm which has to resort to virtual memory. Because of this, cache replacement policies are extremely important to high-performance computing, as are cache-aware programming and data alignment. To further complicate the issue, some systems have up to three levels of cache memory, with varying effective speeds. Different systems will have different amounts of these various types of memory, so the effect of algorithm memory needs can vary greatly from one system to another.

In the early days of electronic computing, if an algorithm and its data would not fit in main memory then the algorithm could not be used. Nowadays the use of virtual memory appears to provide much memory, but at the cost of performance. If an algorithm and its data will fit in cache memory, then very high speed can be obtained; in this case minimizing space will also help minimize time. This is called the principle of locality, and can be subdivided into locality of reference, spatial locality and temporal locality. An algorithm which will not fit completely in cache memory but which exhibits locality of reference may perform reasonably well.

Criticism of the current state of programming

David May FRS a British computer scientist and currently Professor of Computer Science at University of Bristol and founder and CTO of XMOS Semiconductor, believes one of the problems is that there is a reliance on Moore's law to solve inefficiencies. He has advanced an 'alternative' to Moore's law (May's law) stated as follows:

Software efficiency halves every 18 months, compensating Moore's Law

May goes on to state:

In ubiquitous systems, halving the instructions executed can double the battery life and big data sets bring big opportunities for better software and algorithms: Reducing the number of operations from N × N to N × log(N) has a dramatic effect when N is large ... for N = 30 billion, this change is as good as 50 years of technology improvements.

Software author Adam N. Rosenburg in his blog "The failure of the Digital computer", has described the current state of programming as nearing the "Software event horizon", (alluding to the fictitious "shoe event horizon" described by Douglas Adams in his Hitchhiker's Guide to the Galaxy book). He estimates there has been a 70 dB factor loss of productivity or "99.99999 percent, of its ability to deliver the goods", since the 1980s—"When Arthur C. Clarke compared the reality of computing in 2001 to the computer HAL 9000 in his book 2001: A Space Odyssey, he pointed out how wonderfully small and powerful computers were but how disappointing computer programming had become".

Nonsense mutation

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Nonsense_mutation

In genetics, a nonsense mutation is a point mutation in a sequence of DNA that results in a nonsense codon, or a premature stop codon in the transcribed mRNA, and leads to a truncated, incomplete, and possibly nonfunctional protein product. Nonsense mutation is not always harmful, the functional effect of a nonsense mutation depends on many aspects, such as the location of the stop codon within the coding DNA. For example, the effect of a nonsense mutation depends on the proximity of the nonsense mutation to the original stop codon, and the degree to which functional subdomains of the protein are affected. As nonsense mutations leads to premature termination of polypeptide chains; they are also called chain termination mutations.

Missense mutations differ from nonsense mutations since they are point mutations that exhibit a single nucleotide change to cause substitution of a different amino acid. A nonsense mutation also differs from a nonstop mutation, which is a point mutation that removes a stop codon. About 10% of patients facing genetic diseases have involvement with nonsense mutations. Some of the diseases that these mutations can cause are Duchenne muscular dystrophy (DMD), cystic fibrosis (CF), spinal muscular atrophy (SMA), cancers, metabolic diseases, and neurologic disorders. The rate of nonsense mutations is variable from gene-to-gene and tissue-to-tissue but gene silencing occurs in every patient with a nonsense mutation.

Simple example

    DNA: 5' - ATG ACT CAC CGA GCG CGA AGC TGA - 3'
         3' - TAC TGA GTG GCT CGC GCT TCG ACT - 5'
   mRNA: 5' - AUG ACU CAC CGA GCG CGA AGC UGA - 3'
Protein:      Met Thr His Arg Ala Arg Ser Stop

The example above begins with a 5' DNA sequence with eight nucleotides seen and its complementary strand shown below. The next row highlights the 5' mRNA strand, which is generated through transcription. Lastly, the final row showcases which the amino acids that are translated from each respective codon, with the eighth and final codon representing the stop codon. The codons corresponding to the fourth amino acid, Arginine, are highlighted because they will undergo a nonsense mutation in the following figure of this example.

    DNA: 5' - ATG ACT CAC TGA GCG CGA AGC TGA - 3'
         3' - TAC TGA GTG ACT CGC GCT TCG ACT - 5'
   mRNA: 5' - AUG ACU CAC UGA GCG CGU AGC UGA - 3'
Protein:      Met Thr His Stop

Now, suppose that a nonsense mutation was introduced at the fourth codon in the 5' DNA sequence (CGA) causing the cytosine to be replaced with thymine, yielding TGA in the 5' DNA sequence and ACT in the complementary strand. Because ACT is transcribed as UGA, it is translated as a stop codon. This leads the remaining codons of the mRNA to not be translated into protein because the stop codon is prematurely reached during translation. This can yield a truncated (i.e., abbreviated) protein product, which quite often lacks the functionality of the normal, non-mutant protein.

Possible outcomes

Deleterious

Deleterious outcomes represent the majority of nonsense mutations and are the most common outcome that is observed naturally. Deleterious nonsense mutations decreases the overall fitness and reproductive success of the organism. For example, a nonsense mutation occurring in a gene encoding a protein can cause structural or functional defects in the protein that disrupt cellular biology. Depending on the significance of the functions of this protein, this disruption now could be detrimental to the fitness and survival of that organism.

Neutral

When a nonsense mutation is neutral, it does not provide benefits or harm. These occur when the effects of the mutation are unnoticed. In other words, this means that the mutation does not positively or negatively affect the organism. As this effect is unnoticed, there is a lack of papers describing such mutations. An example of this type of nonsense mutation is one that occurs directly before the original stop codon for that given protein. Because this mutation occurred in such close proximity to the end of the protein chain, the impact of this change might not be as significant. This would suggest that this amino acid that was mutated did not have a large impact on the overall structure or function of the protein or the organism as a whole. This scenario is rare, but possible.

Beneficial

Beneficial nonsense mutations are considered as the rarest of possible nonsense mutation outcomes. Beneficial nonsense mutations increase the overall fitness and reproductive success of an organism, opposite of the effects of a deleterious mutation. Because a nonsense mutation introduces a premature stop codon within a sequence of DNA, it is extremely unlikely that this scenario can actually benefit the organism. An example of this would occur with a nonsense mutation that impacts a dysfunctional protein that releases toxins. The stop codon that this mutation brings would stop this dysfunctional protein from properly carrying out its function. Stopping this protein from performing at full strength causes less toxin to be released and the fitness of the organism to be improved. These types of situations with nonsense mutations occur a lot less frequently than the deleterious outcomes.

Suppressing nonsense mutations

Nonsense-mediated mRNA decay

Despite an expected tendency for premature termination codons to yield shortened polypeptide products, in fact the formation of truncated proteins does not occur often in vivo. Many organisms—including humans and lower species, such as yeast—employ a nonsense-mediated mRNA decay pathway, which degrades mRNAs containing nonsense mutations before they are able to be translated into nonfunctional polypeptides.

tRNA Suppression

Because nonsense mutations result in altered mRNA with a premature stop codon, one way of suppressing the damage done to the final protein's function is to alter the tRNA that reads the mRNA. These tRNA’s are termed suppressor tRNA's. If the stop codon is UAG, any other amino acid tRNA could be altered from its original anticodon to AUC so it will recognize the UAG codon instead. This will result in the protein not being truncated, but it may still have an altered amino acid. These suppressor tRNA mutations are only possible if the cell has more than one tRNA that reads a particular codon, otherwise the mutation would kill the cell. The only stop codons are UAG, UAA, and UGA. UAG and UAA suppressors read their respective stop codons instead of their original codon, but UAA suppressors also read UAG due to wobble base pairing. UGA suppressors are very rare. Another hurdle to pass in this technique is the fact that stop codons are also recognized by release factors, so the tRNA still needs to compete with the release factors to keep the translation going. Because of this, suppression is usually only 10-40% successful. These suppressor tRNA mutations also target stop codons that are not mutations, causing some proteins to be much longer than they should be. Only bacteria and lower eukaryotes can survive with these mutations, mammal and insect cells die as a result of a suppressor mutation.

Common disease-associated nonsense mutations

Nonsense mutations comprise around 20% of single nucleotide substitutions within protein coding sequences that result in human disease. Nonsense mutation-mediated pathology is often attributed to reduced amounts of full-length protein, because only 5-25% of transcripts possessing nonsense mutations do not undergo nonsense-mediated decay (NMD). Translation of the remaining nonsense-bearing mRNA may generate abbreviated protein variants with toxic effects.

Twenty-three different single-point nucleotide substitutions are capable of converting a non-stop codon into a stop-codon, with the mutations CGA $⟶$ TGA and CAG $⟶$ TAG being the most common disease-related substitutions characterized in the Human Gene Mutation Database (HGMD). As a result of different substitution frequencies for each nucleotide, the proportions of the three stop codons generated by disease-inducing nonsense mutations differs from stop codon distributions in non-diseased gene variants. Notably, the codon TAG is overrepresented, while the TGA and TAA codons are underrepresented in disease-related nonsense mutations.

Translation termination efficiency is influenced by the specific stop codon sequence on the mRNA, with the UAA sequence yielding the highest termination. Sequences surrounding the stop codon also impact termination efficiency. Consequently, the underlying pathology of diseases caused by nonsense mutations is ultimately dependent on the identity of the mutated gene, and specific location of the mutation.

Examples of diseases induced by nonsense mutations include:

Cystic fibrosis (caused by the G542X mutation in the cystic fibrosis transmembrane conductance regulator (CFTR)
Beta thalassaemia (β-globin)
Hurler syndrome
Dravet syndrome
Usher syndrome

Nonsense mutations in other genes may also drive dysfunction of several tissue or organ systems:

SMAD8

SMAD8 is the eighth homolog of the ENDOGLIN gene family and is involved in the signaling between TGF-b/BMP. It has been identified that novel nonsense mutations in SMAD8 are associated with pulmonary arterial hypertension. The pulmonary system relies on SMAD1, SMAD5, and SMAD 8 to regulate pulmonary vascular function. Downregulation and loss of signals that are normally operated by SMAD8 contributed to pathogenesis in pulmonary arterial hypertension. The ALK1 gene, a part of the TGF-B signaling family, was found to have been mutated while also down-regulating the SMAD8 gene in patients with pulmonary arterial hypertension. SMAD8 mutants were not phosphorylated by ALK1, disrupting interactions with SMAD4 that would normally allow for signaling in wild-type organisms.

LGR4

LGR4 binds R-spondins to activate the Wnt signaling pathway. Wnt signaling regulates bone mass and osteoblast differentiation and is important for the development of bone, heart, and muscle. An LGR4 nonsense mutation in a healthy population has been linked to low bone mass density and symptoms of osteoporosis. LGR4 mutant mice showed the observed low bone mass is not due to age-related bone loss. Mutations in LGR4 have been associated with family lineages with medical histories of rare bone disorders. Wild-type mice lacking LGR4 also displayed delayed osteoblast differentiation during development, showcasing the important role of LGR4 in bone mass regulation and development.

Therapeutics targeting nonsense mutation diseases

Therapeutics for diseases caused by nonsense mutations attempt to recapitulate wild-type function by decreasing the efficacy of NMD, facilitating readthrough of the premature stop codon during translation, or editing the genomic nonsense mutation.

Antisense oligonucleotides to suppress the expression of NMD and translation termination proteins are being explored in animal models of nonsense mutation-induced disease. Other RNA therapeutics under investigation include synthetic suppressor tRNAs that enable ribosomes to insert an amino acid, instead of initiating chain termination, upon encountering premature stop codons.

CRISPR-Cas9 based single nucleotide substitutions have been used to generate amino acid codons from stop codons, achieving an editing success rate of 10% in cell cultures.

Read-through has been achieved using small molecule drugs such as aminoglycosides and negamycin. An oxadiazole, Ataluren (previously PTC124), facilitates the selective read-through of aberrant stop codons, rendering it a potential therapeutic against nonsense mutation-induced disease. Ataluren, sold under the tradename Translarna, is currently an approved treatment for Duchenne muscular dystrophy in the European Economic area and Brazil. However, phase III trials of Ataluren as a cystic fibrosis therapeutic have failed to meet their primary endpoints.

Point mutation

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Point_mutation

Schematic of a single-stranded RNA molecule illustrating a series of three-base codons. Each three-nucleotide codon corresponds to an amino acid when translated to protein. When one of these codons is changed by a point mutation, the corresponding amino acid of the protein is changed.

A to G point mutation detected with Sanger sequencing

A point mutation is a genetic mutation where a single nucleotide base is changed, inserted or deleted from a DNA or RNA sequence of an organism's genome. Point mutations have a variety of effects on the downstream protein product—consequences that are moderately predictable based upon the specifics of the mutation. These consequences can range from no effect (e.g. synonymous mutations) to deleterious effects (e.g. frameshift mutations), with regard to protein production, composition, and function.

Causes

Point mutations usually take place during DNA replication. DNA replication occurs when one double-stranded DNA molecule creates two single strands of DNA, each of which is a template for the creation of the complementary strand. A single point mutation can change the whole DNA sequence. Changing one purine or pyrimidine may change the amino acid that the nucleotides code for.

Point mutations may arise from spontaneous mutations that occur during DNA replication. The rate of mutation may be increased by mutagens. Mutagens can be physical, such as radiation from UV rays, X-rays or extreme heat, or chemical (molecules that misplace base pairs or disrupt the helical shape of DNA). Mutagens associated with cancers are often studied to learn about cancer and its prevention.

There are multiple ways for point mutations to occur. First, ultraviolet (UV) light and higher-frequency light are capable of ionizing electrons, which in turn can affect DNA. Reactive oxygen molecules with free radicals, which are a byproduct of cellular metabolism, can also be very harmful to DNA. These reactants can lead to both single-stranded DNA breaks and double-stranded DNA breaks. Third, bonds in DNA eventually degrade, which creates another problem to keep the integrity of DNA to a high standard. There can also be replication errors that lead to substitution, insertion, or deletion mutations.

Categorization

Transition/transversion categorization

Transitions (Alpha) and transversions (Beta).

In 1959 Ernst Freese coined the terms "transitions" or "transversions" to categorize different types of point mutations.Transitions are replacement of a purine base with another purine or replacement of a pyrimidine with another pyrimidine. Transversions are replacement of a purine with a pyrimidine or vice versa. There is a systematic difference in mutation rates for transitions (Alpha) and transversions (Beta). Transition mutations are about ten times more common than transversions.

Functional categorization

Nonsense mutations include stop-gain and start-loss. Stop-gain is a mutation that results in a premature termination codon (a stop was gained), which signals the end of translation. This interruption causes the protein to be abnormally shortened. The number of amino acids lost mediates the impact on the protein's functionality and whether it will function whatsoever. Stop-loss is a mutation in the original termination codon (a stop was lost), resulting in abnormal extension of a protein's carboxyl terminus. Start-gain creates an AUG start codon upstream of the original start site. If the new AUG is near the original start site, in-frame within the processed transcript and downstream to a ribosomal binding site, it can be used to initiate translation. The likely effect is additional amino acids added to the amino terminus of the original protein. Frame-shift mutations are also possible in start-gain mutations, but typically do not affect translation of the original protein. Start-loss is a point mutation in a transcript's AUG start codon, resulting in the reduction or elimination of protein production.

Missense mutations code for a different amino acid. A missense mutation changes a codon so that a different protein is created, a non-synonymous change. Conservative mutations result in an amino acid change. However, the properties of the amino acid remain the same (e.g., hydrophobic, hydrophilic, etc.). At times, a change to one amino acid in the protein is not detrimental to the organism as a whole. Most proteins can withstand one or two point mutations before their function changes. Non-conservative mutations result in an amino acid change that has different properties than the wild type. The protein may lose its function, which can result in a disease in the organism. For example, sickle-cell disease is caused by a single point mutation (a missense mutation) in the beta-hemoglobin gene that converts a GAG codon into GUG, which encodes the amino acid valine rather than glutamic acid. The protein may also exhibit a "gain of function" or become activated, such is the case with the mutation changing a valine to glutamic acid in the BRAF gene; this leads to an activation of the RAF protein which causes unlimited proliferative signalling in cancer cells. These are both examples of a non-conservative (missense) mutation.

Silent mutations code for the same amino acid (a "synonymous substitution"). A silent mutation does not affect the functioning of the protein. A single nucleotide can change, but the new codon specifies the same amino acid, resulting in an unmutated protein. This type of change is called synonymous change since the old and new codon code for the same amino acid. This is possible because 64 codons specify only 20 amino acids. Different codons can lead to differential protein expression levels, however.

Single base pair insertions and deletions

Sometimes the term point mutation is used to describe insertions or deletions of a single base pair (which has more of an adverse effect on the synthesized protein due to the nucleotides' still being read in triplets, but in different frames: a mutation called a frameshift mutation).

General consequences

Point mutations that occur in non-coding sequences are most often without consequences, although there are exceptions. If the mutated base pair is in the promoter sequence of a gene, then the expression of the gene may change. Also, if the mutation occurs in the splicing site of an intron, then this may interfere with correct splicing of the transcribed pre-mRNA.

By altering just one amino acid, the entire peptide may change, thereby changing the entire protein. The new protein is called a protein variant. If the original protein functions in cellular reproduction then this single point mutation can change the entire process of cellular reproduction for this organism.

Point germline mutations can lead to beneficial as well as harmful traits or diseases. This leads to adaptations based on the environment where the organism lives. An advantageous mutation can create an advantage for that organism and lead to the trait's being passed down from generation to generation, improving and benefiting the entire population. The scientific theory of evolution is greatly dependent on point mutations in cells. The theory explains the diversity and history of living organisms on Earth. In relation to point mutations, it states that beneficial mutations allow the organism to thrive and reproduce, thereby passing its positively affected mutated genes on to the next generation. On the other hand, harmful mutations cause the organism to die or be less likely to reproduce in a phenomenon known as natural selection.

There are different short-term and long-term effects that can arise from mutations. Smaller ones would be a halting of the cell cycle at numerous points. This means that a codon coding for the amino acid glycine may be changed to a stop codon, causing the proteins that should have been produced to be deformed and unable to complete their intended tasks. Because the mutations can affect the DNA and thus the chromatin, it can prohibit mitosis from occurring due to the lack of a complete chromosome. Problems can also arise during the processes of transcription and replication of DNA. These all prohibit the cell from reproduction and thus lead to the death of the cell. Long-term effects can be a permanent changing of a chromosome, which can lead to a mutation. These mutations can be either beneficial or detrimental. Cancer is an example of how they can be detrimental.

Other effects of point mutations, or single nucleotide polymorphisms in DNA, depend on the location of the mutation within the gene. For example, if the mutation occurs in the region of the gene responsible for coding, the amino acid sequence of the encoded protein may be altered, causing a change in the function, protein localization, stability of the protein or protein complex. Many methods have been proposed to predict the effects of missense mutations on proteins. Machine learning algorithms train their models to distinguish known disease-associated from neutral mutations whereas other methods do not explicitly train their models but almost all methods exploit the evolutionary conservation assuming that changes at conserved positions tend to be more deleterious. While majority of methods provide a binary classification of effects of mutations into damaging and benign, a new level of annotation is needed to offer an explanation of why and how these mutations damage proteins.

Moreover, if the mutation occurs in the region of the gene where transcriptional machinery binds to the protein, the mutation can affect the binding of the transcription factors because the short nucleotide sequences recognized by the transcription factors will be altered. Mutations in this region can affect rate of efficiency of gene transcription, which in turn can alter levels of mRNA and, thus, protein levels in general.

Point mutations can have several effects on the behavior and reproduction of a protein depending on where the mutation occurs in the amino acid sequence of the protein. If the mutation occurs in the region of the gene that is responsible for coding for the protein, the amino acid may be altered. This slight change in the sequence of amino acids can cause a change in the function, activation of the protein meaning how it binds with a given enzyme, where the protein will be located within the cell, or the amount of free energy stored within the protein.

If the mutation occurs in the region of the gene where transcriptional machinery binds to the protein, the mutation can affect the way in which transcription factors bind to the protein. The mechanisms of transcription bind to a protein through recognition of short nucleotide sequences. A mutation in this region may alter these sequences and, thus, change the way the transcription factors bind to the protein. Mutations in this region can affect the efficiency of gene transcription, which controls both the levels of mRNA and overall protein levels.

Specific diseases caused by point mutations

Cancer

Point mutations in multiple tumor suppressor proteins cause cancer. For instance, point mutations in Adenomatous Polyposis Coli promote tumorigenesis. A novel assay, Fast parallel proteolysis (FASTpp), might help swift screening of specific stability defects in individual cancer patients.

Neurofibromatosis

Neurofibromatosis is caused by point mutations in the Neurofibromin 1 or Neurofibromin 2 gene.

Sickle-cell anemia

Sickle-cell anemia is caused by a point mutation in the β-globin chain of hemoglobin, causing the hydrophilic amino acid glutamic acid to be replaced with the hydrophobic amino acid valine at the sixth position.

The β-globin gene is found on the short arm of chromosome 11. The association of two wild-type α-globin subunits with two mutant β-globin subunits forms hemoglobin S (HbS). Under low-oxygen conditions (being at high altitude, for example), the absence of a polar amino acid at position six of the β-globin chain promotes the non-covalent polymerisation (aggregation) of hemoglobin, which distorts red blood cells into a sickle shape and decreases their elasticity.

Hemoglobin is a protein found in red blood cells, and is responsible for the transportation of oxygen through the body. There are two subunits that make up the hemoglobin protein: beta-globins and alpha-globins. Beta-hemoglobin is created from the genetic information on the HBB, or "hemoglobin, beta" gene found on chromosome 11p15.5. A single point mutation in this polypeptide chain, which is 147 amino acids long, results in the disease known as Sickle Cell Anemia. Sickle-cell anemia is an autosomal recessive disorder that affects 1 in 500 African Americans, and is one of the most common blood disorders in the United States. The single replacement of the sixth amino acid in the beta-globin, glutamic acid, with valine results in deformed red blood cells. These sickle-shaped cells cannot carry nearly as much oxygen as normal red blood cells and they get caught more easily in the capillaries, cutting off blood supply to vital organs. The single nucleotide change in the beta-globin means that even the smallest of exertions on the part of the carrier results in severe pain and even heart attack. Below is a chart depicting the first thirteen amino acids in the normal and abnormal sickle cell polypeptide chain.

Sequence for normal hemoglobin
AUG	GUG	CAC	CUG	ACU	CCU	GAG	GAG	AAG	UCU	GCC	GUU	ACU
START	Val	His	Leu	Thr	Pro	Glu	Glu	Lys	Ser	Ala	Val	Thr

Sequence for sickle-cell hemoglobin
AUG	GUG	CAC	CUG	ACU	CCU	GUG	GAG	AAG	UCU	GCC	GUU	ACU
START	Val	His	Leu	Thr	Pro	Val	Glu	Lys	Ser	Ala	Val	Thr

Tay–Sachs disease

The cause of Tay–Sachs disease is a genetic defect that is passed from parent to child. This genetic defect is located in the HEXA gene, which is found on chromosome 15.

The HEXA gene makes part of an enzyme called beta-hexosaminidase A, which plays a critical role in the nervous system. This enzyme helps break down a fatty substance called GM2 ganglioside in nerve cells. Mutations in the HEXA gene disrupt the activity of beta-hexosaminidase A, preventing the breakdown of the fatty substances. As a result, the fatty substances accumulate to deadly levels in the brain and spinal cord. The buildup of GM2 ganglioside causes progressive damage to the nerve cells. This is the cause of the signs and symptoms of Tay-Sachs disease.

Repeat-induced point mutation

In molecular biology, repeat-induced point mutation or RIP is a process by which DNA accumulates G:C to A:T transition mutations. Genomic evidence indicates that RIP occurs or has occurred in a variety of fungi while experimental evidence indicates that RIP is active in Neurospora crassa, Podospora anserina, Magnaporthe grisea, Leptosphaeria maculans, Gibberella zeae and Nectria haematococca. In Neurospora crassa, sequences mutated by RIP are often methylated de novo.

RIP occurs during the sexual stage in haploid nuclei after fertilization but prior to meiotic DNA replication. In Neurospora crassa, repeat sequences of at least 400 base pairs in length are vulnerable to RIP. Repeats with as low as 80% nucleotide identity may also be subject to RIP. Though the exact mechanism of repeat recognition and mutagenesis are poorly understood, RIP results in repeated sequences undergoing multiple transition mutations.

The RIP mutations do not seem to be limited to repeated sequences. Indeed, for example, in the phytopathogenic fungus L. maculans, RIP mutations are found in single copy regions, adjacent to the repeated elements. These regions are either non-coding regions or genes encoding small secreted proteins including avirulence genes. The degree of RIP within these single copy regions was proportional to their proximity to repetitive elements.

Rep and Kistler have speculated that the presence of highly repetitive regions containing transposons, may promote mutation of resident effector genes. So the presence of effector genes within such regions is suggested to promote their adaptation and diversification when exposed to strong selection pressure.

As RIP mutation is traditionally observed to be restricted to repetitive regions and not single copy regions, Fudal et al. suggested that leakage of RIP mutation might occur within a relatively short distance of a RIP-affected repeat. Indeed, this has been reported in N. crassa whereby leakage of RIP was detected in single copy sequences at least 930 bp from the boundary of neighbouring duplicated sequences. To elucidate the mechanism of detection of repeated sequences leading to RIP may allow to understand how the flanking sequences may also be affected.

Mechanism

RIP causes G:C to A:T transition mutations within repeats, however, the mechanism that detects the repeated sequences is unknown. RID is the only known protein essential for RIP. It is a DNA methyltransferease-like protein, that when mutated or knocked out results in loss of RIP. Deletion of the rid homolog in Aspergillus nidulans, dmtA, results in loss of fertility while deletion of the rid homolog in Ascobolus immersens, masc1, results in fertility defects and loss of methylation induced premeiotically (MIP).

Consequences

RIP is believed to have evolved as a defense mechanism against transposable elements, which resemble parasites by invading and multiplying within the genome. RIP creates multiple missense and nonsense mutations in the coding sequence. This hypermutation of G-C to A-T in repetitive sequences eliminates functional gene products of the sequence (if there were any to begin with). In addition, many of the C-bearing nucleotides become methylated, thus decreasing transcription.

Use in molecular biology

Because RIP is so efficient at detecting and mutating repeats, fungal biologists often use it as a tool for mutagenesis. A second copy of a single-copy gene is first transformed into the genome. The fungus must then mate and go through its sexual cycle to activate the RIP machinery. Many different mutations within the duplicated gene are obtained from even a single fertilization event so that inactivated alleles, usually due to nonsense mutations, as well as alleles containing missense mutations can be obtained.

History

The cellular reproduction process of meiosis was discovered by Oscar Hertwig in 1876. Mitosis was discovered several years later in 1882 by Walther Flemming.

Hertwig studied sea urchins, and noticed that each egg contained one nucleus prior to fertilization and two nuclei after. This discovery proved that one spermatozoon could fertilize an egg, and therefore proved the process of meiosis. Hermann Fol continued Hertwig's research by testing the effects of injecting several spermatozoa into an egg, and found that the process did not work with more than one spermatozoon.

Flemming began his research of cell division starting in 1868. The study of cells was an increasingly popular topic in this time period. By 1873, Schneider had already begun to describe the steps of cell division. Flemming furthered this description in 1874 and 1875 as he explained the steps in more detail. He also argued with Schneider's findings that the nucleus separated into rod-like structures by suggesting that the nucleus actually separated into threads that in turn separated. Flemming concluded that cells replicate through cell division, to be more specific mitosis.

Matthew Meselson and Franklin Stahl are credited with the discovery of DNA replication. Watson and Crick acknowledged that the structure of DNA did indicate that there is some form of replicating process. However, there was not a lot of research done on this aspect of DNA until after Watson and Crick. People considered all possible methods of determining the replication process of DNA, but none were successful until Meselson and Stahl. Meselson and Stahl introduced a heavy isotope into some DNA and traced its distribution. Through this experiment, Meselson and Stahl were able to prove that DNA reproduces semi-conservatively.

Search This Blog

Sunday, December 10, 2023

Algorithmic efficiency

Background

Overview

Theoretical analysis

Benchmarking: measuring performance

Implementation concerns

Measures of resource usage

Time

Theory

Practice

Space

Caching and memory hierarchy

Criticism of the current state of programming

Nonsense mutation

Simple example

Possible outcomes

Deleterious

Neutral

Beneficial

Suppressing nonsense mutations

Common disease-associated nonsense mutations

LGR4

Therapeutics targeting nonsense mutation diseases

Point mutation

Causes

Categorization

Transition/transversion categorization

Functional categorization

Single base pair insertions and deletions

General consequences

Specific diseases caused by point mutations

Cancer

Neurofibromatosis

Sickle-cell anemia

Tay–Sachs disease

Repeat-induced point mutation

Mechanism

Consequences

Use in molecular biology

History

Philosophy of space and time