A Medley of Potpourri

Monday, June 8, 2020

Algorithmic efficiency

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Algorithmic_efficiency

In computer science, algorithmic efficiency is a property of an algorithm which relates to the number of computational resources used by the algorithm. An algorithm must be analyzed to determine its resource usage, and the efficiency of an algorithm can be measured based on usage of different resources. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process.

For maximum efficiency we wish to minimize resource usage. However, different resources such as time and space complexity cannot be compared directly, so which of two algorithms is considered to be more efficient often depends on which measure of efficiency is considered most important.

For example, bubble sort and timsort are both algorithms to sort a list of items from smallest to largest. Bubble sort sorts the list in time proportional to the number of elements squared (

\scriptstyle {{\mathcal {O}}\left(n^{2}\right)}

, see Big O notation), but only requires a small amount of extra memory which is constant with respect to the length of the list (

{\textstyle \scriptstyle {{\mathcal {O}}\left(1\right)}}

). Timsort sorts the list in time linearithmic (proportional to a quantity times its logarithm) in the list's length (

{\textstyle \scriptstyle {\mathcal {O\left(n\log n\right)}}}

), but has a space requirement linear in the length of the list (

{\textstyle \scriptstyle {\mathcal {O\left(n\right)}}}

). If large lists must be sorted at high speed for a given application, timsort is a better choice; however, if minimizing the memory footprint of the sorting is more important, bubble sort is a better choice.

Background

The importance of efficiency with respect to time was emphasised by Ada Lovelace in 1843 as applying to Charles Babbage's mechanical analytical engine:

"In almost every computation a great variety of arrangements for the succession of the processes is possible, and various considerations must influence the selections amongst them for the purposes of a calculating engine. One essential object is to choose that arrangement which shall tend to reduce to a minimum the time necessary for completing the calculation"

Early electronic computers were severely limited both by the speed of operations and the amount of memory available. In some cases it was realized that there was a space–time trade-off, whereby a task could be handled either by using a fast algorithm which used quite a lot of working memory, or by using a slower algorithm which used very little working memory. The engineering trade-off was then to use the fastest algorithm which would fit in the available memory.

Modern computers are significantly faster than the early computers, and have a much larger amount of memory available (Gigabytes instead of Kilobytes). Nevertheless, Donald Knuth emphasised that efficiency is still an important consideration:

"In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering"

Overview

An algorithm is considered efficient if its resource consumption, also known as computational cost, is at or below some acceptable level. Roughly speaking, 'acceptable' means: it will run in a reasonable amount of time or space on an available computer, typically as a function of the size of the input. Since the 1950s computers have seen dramatic increases in both the available computational power and in the available amount of memory, so current acceptable levels would have been unacceptable even 10 years ago. In fact, thanks to the approximate doubling of computer power every 2 years, tasks that are acceptably efficient on modern smartphones and embedded systems may have been unacceptably inefficient for industrial servers 10 years ago.

Computer manufacturers frequently bring out new models, often with higher performance. Software costs can be quite high, so in some cases the simplest and cheapest way of getting higher performance might be to just buy a faster computer, provided it is compatible with an existing computer.

There are many ways in which the resources used by an algorithm can be measured: the two most common measures are speed and memory usage; other measures could include transmission speed, temporary disk usage, long-term disk usage, power consumption, total cost of ownership, response time to external stimuli, etc. Many of these measures depend on the size of the input to the algorithm, i.e. the amount of data to be processed. They might also depend on the way in which the data is arranged; for example, some sorting algorithms perform poorly on data which is already sorted, or which is sorted in reverse order.

In practice, there are other factors which can affect the efficiency of an algorithm, such as requirements for accuracy and/or reliability. As detailed below, the way in which an algorithm is implemented can also have a significant effect on actual efficiency, though many aspects of this relate to optimization issues.

Theoretical analysis

In the theoretical analysis of algorithms, the normal practice is to estimate their complexity in the asymptotic sense. The most commonly used notation to describe resource consumption or "complexity" is Donald Knuth's Big O notation, representing the complexity of an algorithm as a function of the size of the input

{\textstyle \scriptstyle n}

. Big O notation is an asymptotic measure of function complexity, where

{\textstyle \scriptstyle {f\left(n\right)\,=\,{\mathcal {O}}\left(g\left(n\right)\right)}}

roughly means the time requirement for an algorithm is proportional to

\scriptstyle {g\left(n\right)}

, omitting lower-order terms that contribute less than

\scriptstyle {g\left(n\right)}

to the growth of the function as

\scriptstyle n

grows arbitrarily large. This estimate may be misleading when ${\textstyle \scriptstyle n}$ is small, but is generally sufficiently accurate when

{\textstyle \scriptstyle n}

is large as the notation is asymptotic. For example, bubble sort may be faster than merge sort when only a few items are to be sorted; however either implementation is likely to meet performance requirements for a small list. Typically, programmers are interested in algorithms that scale efficiently to large input sizes, and merge sort is preferred over bubble sort for lists of length encountered in most data-intensive programs.

Some examples of Big O notation applied to algorithms' asymptotic time complexity include:

Notation	Name	Examples
${\mathcal {O}}(1)\,$	constant	Finding the median from a sorted list of measurements; Using a constant-size lookup table; Using a suitable hash function for looking up an item.
${\mathcal {O}}(\log {n})\,$	logarithmic	Finding an item in a sorted array with a binary search or a balanced search tree as well as all operations in a Binomial heap.
${\mathcal {O}}(n)\,$	linear	Finding an item in an unsorted list or a malformed tree (worst case) or in an unsorted array; Adding two n-bit integers by ripple carry.
${\mathcal {O}}(n\log n)\,$	linearithmic, loglinear, or quasilinear	Performing a Fast Fourier transform; heapsort, quicksort (best and average case), or merge sort
${\mathcal {O}}(n^{2})\,$	quadratic	Multiplying two n-digit numbers by a simple algorithm; bubble sort (worst case or naive implementation), Shell sort, quicksort (worst case), selection sort or insertion sort
${\mathcal {O}}(c^{n}),\;c>1$	exponential	Finding the optimal (non-approximate) solution to the travelling salesman problem using dynamic programming; determining if two logical statements are equivalent using brute-force search

Benchmarking: measuring performance

For new versions of software or to provide comparisons with competitive systems, benchmarks are sometimes used, which assist with gauging an algorithms relative performance. If a new sort algorithm is produced, for example, it can be compared with its predecessors to ensure that at least it is efficient as before with known data, taking into consideration any functional improvements. Benchmarks can be used by customers when comparing various products from alternative suppliers to estimate which product will best suit their specific requirements in terms of functionality and performance. For example, in the mainframe world certain proprietary sort products from independent software companies such as Syncsort compete with products from the major suppliers such as IBM for speed.

Some benchmarks provide opportunities for producing an analysis comparing the relative speed of various compiled and interpreted languages for example and The Computer Language Benchmarks Game compares the performance of implementations of typical programming problems in several programming languages.

Even creating "do it yourself" benchmarks can demonstrate the relative performance of different programming languages, using a variety of user specified criteria. This is quite simple, as a "Nine language performance roundup" by Christopher W. Cowell-Shah demonstrates by example.

Implementation concerns

Implementation issues can also have an effect on efficiency, such as the choice of programming language, or the way in which the algorithm is actually coded, or the choice of a compiler for a particular language, or the compilation options used, or even the operating system being used. In many cases a language implemented by an interpreter may be much slower than a language implemented by a compiler. See the articles on just-in-time compilation and interpreted languages.

There are other factors which may affect time or space issues, but which may be outside of a programmer's control; these include data alignment, data granularity, cache locality, cache coherency, garbage collection, instruction-level parallelism, multi-threading (at either a hardware or software level), simultaneous multitasking, and subroutine calls.

Some processors have capabilities for vector processing, which allow a single instruction to operate on multiple operands; it may or may not be easy for a programmer or compiler to use these capabilities. Algorithms designed for sequential processing may need to be completely redesigned to make use of parallel processing, or they could be easily reconfigured. As parallel and distributed computing grow in importance in the late 2010s, more investments are being made into efficient high-level APIs for parallel and distributed computing systems such as CUDA, TensorFlow, Hadoop, OpenMP and MPI.

Another problem which can arise in programming is that processors compatible with the same instruction set (such as x86-64 or ARM) may implement an instruction in different ways, so that instructions which are relatively fast on some models may be relatively slow on other models. This often presents challenges to optimizing compilers, which must have a great amount of knowledge of the specific CPU and other hardware available on the compilation target to best optimize a program for performance. In the extreme case, a compiler may be forced to emulate instructions not supported on a compilation target platform, forcing it to generate code or link an external library call to produce a result that is otherwise incomputable on that platform, even if it is natively supported and more efficient in hardware on other platforms. This is often the case in embedded systems with respect to floating-point arithmetic, where small and low-power microcontrollers often lack hardware support for floating-point arithmetic and thus require computationally expensive software routines to produce floating point calculations.

Measures of resource usage

Measures are normally expressed as a function of the size of the input

\scriptstyle {n}

The two most common measures are:

Time: how long does the algorithm take to complete?
Space: how much working memory (typically RAM) is needed by the algorithm? This has two aspects: the amount of memory needed by the code (auxiliary space usage), and the amount of memory needed for the data on which the code operates (intrinsic space usage).

For computers whose power is supplied by a battery (e.g. laptops and smartphones), or for very long/large calculations (e.g. supercomputers), other measures of interest are:

Direct power consumption: power needed directly to operate the computer.
Indirect power consumption: power needed for cooling, lighting, etc.

As of 2018, power consumption is growing as an important metric for computational tasks of all types and at all scales ranging from embedded Internet of things devices to system-on-chip devices to server farms. This trend is often referred to as green computing.

Less common measures of computational efficiency may also be relevant in some cases:

Transmission size: bandwidth could be a limiting factor. Data compression can be used to reduce the amount of data to be transmitted. Displaying a picture or image (e.g. Google logo) can result in transmitting tens of thousands of bytes (48K in this case) compared with transmitting six bytes for the text "Google". This is important for I/O bound computing tasks.
External space: space needed on a disk or other external memory device; this could be for temporary storage while the algorithm is being carried out, or it could be long-term storage needed to be carried forward for future reference.
Response time (latency): this is particularly relevant in a real-time application when the computer system must respond quickly to some external event.
Total cost of ownership: particularly if a computer is dedicated to one particular algorithm.

Time

Theory

Analyze the algorithm, typically using time complexity analysis to get an estimate of the running time as a function of the size of the input data. The result is normally expressed using Big O notation. This is useful for comparing algorithms, especially when a large amount of data is to be processed. More detailed estimates are needed to compare algorithm performance when the amount of data is small, although this is likely to be of less importance. Algorithms which include parallel processing may be more difficult to analyze.

Practice

Use a benchmark to time the use of an algorithm. Many programming languages have an available function which provides CPU time usage. For long-running algorithms the elapsed time could also be of interest. Results should generally be averaged over several tests.

Run-based profiling can be very sensitive to hardware configuration and the possibility of other programs or tasks running at the same time in a multi-processing and multi-programming environment.

This sort of test also depends heavily on the selection of a particular programming language, compiler, and compiler options, so algorithms being compared must all be implemented under the same conditions.

Space

This section is concerned with the use of memory resources (registers, cache, RAM, virtual memory, secondary memory) while the algorithm is being executed. As for time analysis above, analyze the algorithm, typically using space complexity analysis to get an estimate of the run-time memory needed as a function as the size of the input data. The result is normally expressed using Big O notation.

There are up to four aspects of memory usage to consider:

The amount of memory needed to hold the code for the algorithm.
The amount of memory needed for the input data.
The amount of memory needed for any output data.
- Some algorithms, such as sorting, often rearrange the input data and don't need any additional space for output data. This property is referred to as "in-place" operation.
The amount of memory needed as working space during the calculation.
- This includes local variables and any stack space needed by routines called during a calculation; this stack space can be significant for algorithms which use recursive techniques.

Early electronic computers, and early home computers, had relatively small amounts of working memory. For example, the 1949 Electronic Delay Storage Automatic Calculator (EDSAC) had a maximum working memory of 1024 17-bit words, while the 1980 Sinclair ZX80 came initially with 1024 8-bit bytes of working memory. In the late 2010s, it is typical for personal computers to have between 4 and 32 GB of RAM, an increase of over 300 million times as much memory.

Caching and memory hierarchy

Current computers can have relatively large amounts of memory (possibly Gigabytes), so having to squeeze an algorithm into a confined amount of memory is much less of a problem than it used to be. But the presence of four different categories of memory can be significant:

Processor registers, the fastest of computer memory technologies with the least amount of storage space. Most direct computation on modern computers occurs with source and destination operands in registers before being updated to the cache, main memory and virtual memory if needed. On a processor core, there are typically on the order of hundreds of bytes or fewer of register availability, although a register file may contain more physical registers than architectural registers defined in the instruction set architecture.
Cache memory is the second fastest and second smallest memory available in the memory hierarchy. Caches are present in CPUs, GPUs, hard disk drives and external peripherals, and are typically implemented in static RAM. Memory caches are multi-leveled; lower levels are larger, slower and typically shared between processor cores in multi-core processors. In order to process operands in cache memory, a processing unit must fetch the data from the cache, perform the operation in registers and write the data back to the cache. This operates at speeds comparable (about 2-10 times slower) with the CPU or GPU's arithmetic logic unit or floating-point unit if in the L1 cache. It is about 10 times slower if there is an L1 cache miss and it must be retrieved from and written to the L2 cache, and a further 10 times slower if there is an L2 cache miss and it must be retrieved from an L3 cache, if present.
Main physical memory is most often implemented in dynamic RAM (DRAM). The main memory is much larger (typically gigabytes compared to ≈8 megabytes) than an L3 CPU cache, with read and write latencies typically 10-100 times slower. As of 2018, RAM is increasingly implemented on-chip of processors, as CPU or GPU memory.
Virtual memory is most often implemented in terms of secondary storage such as a hard disk, and is an extension to the memory hierarchy that has much larger storage space but much larger latency, typically around 1000 times slower than a cache miss for a value in RAM. While originally motivated to create the impression of higher amounts of memory being available than were truly available, virtual memory is more important in contemporary usage for its time-space tradeoff and enabling the usage of virtual machines. Cache misses from main memory are called page faults, and incur huge performance penalties on programs.

An algorithm whose memory needs will fit in cache memory will be much faster than an algorithm which fits in main memory, which in turn will be very much faster than an algorithm which has to resort to virtual memory. Because of this, cache replacement policies are extremely important to high-performance computing, as are cache-aware programming and data alignment. To further complicate the issue, some systems have up to three levels of cache memory, with varying effective speeds. Different systems will have different amounts of these various types of memory, so the effect of algorithm memory needs can vary greatly from one system to another.

In the early days of electronic computing, if an algorithm and its data wouldn't fit in main memory then the algorithm couldn't be used. Nowadays the use of virtual memory appears to provide lots of memory, but at the cost of performance. If an algorithm and its data will fit in cache memory, then very high speed can be obtained; in this case minimizing space will also help minimize time. This is called the principle of locality, and can be subdivided into locality of reference, spatial locality and temporal locality. An algorithm which will not fit completely in cache memory but which exhibits locality of reference may perform reasonably well.

Criticism of the current state of programming

David May FRS a British computer scientist and currently Professor of Computer Science at University of Bristol and founder and CTO of XMOS Semiconductor, believes one of the problems is that there is a reliance on Moore's law to solve inefficiencies. He has advanced an 'alternative' to Moore's law (May's law) stated as follows:

Software efficiency halves every 18 months, compensating Moore's Law

May goes on to state:

In ubiquitous systems, halving the instructions executed can double the battery life and big data sets bring big opportunities for better software and algorithms: Reducing the number of operations from N x N to N x log(N) has a dramatic effect when N is large ... for N = 30 billion, this change is as good as 50 years of technology improvements.

Software author Adam N. Rosenburg in his blog "The failure of the Digital computer", has described the current state of programming as nearing the "Software event horizon", (alluding to the fictitious "shoe event horizon" described by Douglas Adams in his Hitchhiker's Guide to the Galaxy book). He estimates there has been a 70 dB factor loss of productivity or "99.99999 percent, of its ability to deliver the goods", since the 1980s—"When Arthur C. Clarke compared the reality of computing in 2001 to the computer HAL 9000 in his book 2001: A Space Odyssey, he pointed out how wonderfully small and powerful computers were but how disappointing computer programming had become".

Self-modifying code

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Self-modifying_code

In computer science, self-modifying code is code that alters its own instructions while it is executing – usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, thus simplifying maintenance. Self-modification is an alternative to the method of "flag setting" and conditional program branching, used primarily to reduce the number of times a condition needs to be tested. The term is usually only applied to code where the self-modification is intentional, not in situations where code accidentally modifies itself due to an error such as a buffer overflow.

The method is frequently used for conditionally invoking test/debugging code without requiring additional computational overhead for every input/output cycle.

The modifications may be performed:

only during initialization – based on input parameters (when the process is more commonly described as software 'configuration' and is somewhat analogous, in hardware terms, to setting jumpers for printed circuit boards). Alteration of program entry pointers is an equivalent indirect method of self-modification, but requiring the co-existence of one or more alternative instruction paths, increasing the program size.
throughout execution ("on the fly") – based on particular program states that have been reached during the execution

In either case, the modifications may be performed directly to the machine code instructions themselves, by overlaying new instructions over the existing ones (for example: altering a compare and branch to an unconditional branch or alternatively a 'NOP').

In the IBM/360 and Z/Architecture instruction set, an EXECUTE (EX) instruction logically overlays the second byte of its target instruction with the low-order 8 bits of register 1. This provides the effect of self-modification although the actual instruction in storage is not altered.

Application in low and high level languages

Self-modification can be accomplished in a variety of ways depending upon the programming language and its support for pointers and/or access to dynamic compiler or interpreter 'engines':

overlay of existing instructions (or parts of instructions such as opcode, register, flags or address) or
direct creation of whole instructions or sequences of instructions in memory
creating or modification of source code statements followed by a 'mini compile' or a dynamic interpretation
creating an entire program dynamically and then executing it

Assembly language

Self-modifying code is quite straightforward to implement when using assembly language. Instructions can be dynamically created in memory (or else overlaid over existing code in non-protected program storage), in a sequence equivalent to the ones that a standard compiler may generate as the object code. With modern processors, there can be unintended side effects on the CPU cache that must be considered. The method was frequently used for testing 'first time' conditions, as in this suitably commented IBM/360 assembler example. It uses instruction overlay to reduce the instruction path length by (N×1)−1 where N is the number of records on the file (−1 being the overhead to perform the overlay).

SUBRTN NOP OPENED      FIRST TIME HERE?
* The NOP is x'4700'
       OI    SUBRTN+1,X'F0'  YES, CHANGE NOP TO UNCONDITIONAL BRANCH (47F0...)
       OPEN   INPUT               AND  OPEN THE INPUT FILE SINCE IT'S THE FIRST TIME THRU
OPENED GET    INPUT        NORMAL PROCESSING RESUMES HERE
      ...

Alternative code might involve testing a "flag" each time through. The unconditional branch is slightly faster than a compare instruction, as well as reducing the overall path length. In later operating systems for programs residing in protected storage this technique could not be used and so changing the pointer to the subroutine would be used instead. The pointer would reside in dynamic storage and could be altered at will after the first pass to bypass the OPEN (having to load a pointer first instead of a direct branch & link to the subroutine would add N instructions to the path length – but there would be a corresponding reduction of N for the unconditional branch that would no longer be required).

Below is an example in Zilog Z80 assembly language. The code increments register "B" in range [0,5]. The "CP" compare instruction is modified on each loop.

;======================================================================
ORG 0H
CALL FUNC00
HALT
;======================================================================
FUNC00:
LD A,6
LD HL,label01+1
LD B,(HL)
label00:
INC B
LD (HL),B
label01:
CP $0
JP NZ,label00
RET
;======================================================================

Self-modifying code is sometimes used to overcome limitations in a machine's instruction set. For example, in the Intel 8080 instruction set, one cannot input a byte from an input port that is specified in a register. The input port is statically encoded in the instruction itself, as the second byte of a two byte instruction. Using self-modifying code, it is possible to store a register's contents into the second byte of the instruction, then execute the modified instruction in order to achieve the desired effect.

High-level languages

Some compiled languages explicitly permit self-modifying code. For example, the ALTER verb in COBOL may be implemented as a branch instruction that is modified during execution. Some batch programming techniques involve the use of self-modifying code. Clipper and SPITBOL also provide facilities for explicit self-modification. The Algol compiler on B6700 systems offered an interface to the operating system whereby executing code could pass a text string or a named disc file to the Algol compiler and was then able to invoke the new version of a procedure.

With interpreted languages, the "machine code" is the source text and may be susceptible to editing on-the-fly: in SNOBOL the source statements being executed are elements of a text array. Other languages, such as Perl and Python, allow programs to create new code at run-time and execute it using an eval function, but do not allow existing code to be mutated. The illusion of modification (even though no machine code is really being overwritten) is achieved by modifying function pointers, as in this JavaScript example:

    var f = function (x) {return x + 1};

    // assign a new definition to f:
    f = new Function('x', 'return x + 2');

Lisp macros also allow runtime code generation without parsing a string containing program code.

The Push programming language is a genetic programming system that is explicitly designed for creating self-modifying programs. While not a high level language, it is not as low level as assembly language.

Compound modification

Prior to the advent of multiple windows, command-line systems might offer a menu system involving the modification of a running command script. Suppose a DOS script (or "batch") file Menu.bat contains the following:

   :StartAfresh                <-a i="" line="">starting

with a colon marks a label. ShowMenu.exe

Upon initiation of Menu.bat from the command line, ShowMenu presents an on-screen menu, with possible help information, example usages and so forth. Eventually the user makes a selection that requires a command somename to be performed: ShowMenu exits after rewriting the file Menu.bat to contain

   :StartAfresh
   ShowMenu.exe
   CALL C:\Commands\somename.bat
   GOTO StartAfresh

Because the DOS command interpreter does not compile a script file and then execute it, nor does it read the entire file into memory before starting execution, nor yet rely on the content of a record buffer, when ShowMenu exits, the command interpreter finds a new command to execute (it is to invoke the script file somename, in a directory location and via a protocol known to ShowMenu), and after that command completes, it goes back to the start of the script file and reactivates ShowMenu ready for the next selection. Should the menu choice be to quit, the file would be rewritten back to its original state. Although this starting state has no use for the label, it, or an equivalent amount of text is required, because the DOS command interpreter recalls the byte position of the next command when it is to start the next command, thus the re-written file must maintain alignment for the next command start point to indeed be the start of the next command.

Aside from the convenience of a menu system (and possible auxiliary features), this scheme means that the ShowMenu.exe system is not in memory when the selected command is activated, a significant advantage when memory is limited.

Control tables

Control table interpreters can be considered to be, in one sense, 'self-modified' by data values extracted from the table entries (rather than specifically hand coded in conditional statements of the form "IF inputx = 'yyy'").

Channel programs

Some IBM access methods traditionally used self-modifying channel programs, where a value, such as a disk address, is read into an area referenced by a channel program, where it is used by a later channel command to access the disk.

History

The IBM SSEC, demonstrated in January 1948, had the ability to modify its instructions or otherwise treat them exactly like data. However, the capability was rarely used in practice. In the early days of computers, self-modifying code was often used to reduce use of limited memory, or improve performance, or both. It was also sometimes used to implement subroutine calls and returns when the instruction set only provided simple branching or skipping instructions to vary the control flow. This use is still relevant in certain ultra-RISC architectures, at least theoretically; see for example one instruction set computer. Donald Knuth's MIX architecture also used self-modifying code to implement subroutine calls.

Usage

Self-modifying code can be used for various purposes:

Semi-automatic optimizing of a state-dependent loop.
Run-time code generation, or specialization of an algorithm in runtime or loadtime (which is popular, for example, in the domain of real-time graphics) such as a general sort utility – preparing code to perform the key comparison described in a specific invocation.
Altering of inlined state of an object, or simulating the high-level construction of closures.
Patching of subroutine (pointer) address calling, usually as performed at load/initialization time of dynamic libraries, or else on each invocation, patching the subroutine's internal references to its parameters so as to use their actual addresses. (i.e. Indirect 'self-modification').
Evolutionary computing systems such as genetic programming.
Hiding of code to prevent reverse engineering (by use of a disassembler or debugger) or to evade detection by virus/spyware scanning software and the like.
Filling 100% of memory (in some architectures) with a rolling pattern of repeating opcodes, to erase all programs and data, or to burn-in hardware.
Compressing code to be decompressed and executed at runtime, e.g., when memory or disk space is limited.
Some very limited instruction sets leave no option but to use self-modifying code to perform certain functions. For example, a one instruction set computer (OISC) machine that uses only the subtract-and-branch-if-negative "instruction" cannot do an indirect copy (something like the equivalent of "*a = **b" in the C language) without using self-modifying code.
Booting. Early microcomputers often used self-modifying code in their bootloaders. Since the bootloader was keyed in via the front panel at every power-on, it did not matter if the bootloader modified itself. However, even today many bootstrap loaders are self-relocating, and a few are even self-modifying.
Altering instructions for fault-tolerance.

Optimizing a state-dependent loop

Pseudocode example:

repeat N times {
   if STATE is 1
      increase A by one
   else
      decrease A by one
   do something with A
}

Self-modifying code, in this case, would simply be a matter of rewriting the loop like this:

 repeat N times {
    increase A by one
    do something with A
    when STATE has to switch {
       replace the opcode "increase" above with the opcode to decrease, or vice versa
    }
 }

Note that 2-state replacement of the opcode can be easily written as 'xor var at address with the value "opcodeOf(Inc) xor opcodeOf(dec)"'.

Choosing this solution must depend on the value of 'N' and the frequency of state changing.

Specialization

Suppose a set of statistics such as average, extrema, location of extrema, standard deviation, etc. are to be calculated for some large data set. In a general situation, there may be an option of associating weights with the data, so each x_i is associated with a w_i and rather than test for the presence of weights at every index value, there could be two versions of the calculation, one for use with weights and one not, with one test at the start. Now consider a further option, that each value may have associated with it a boolean to signify whether that value is to be skipped or not. This could be handled by producing four batches of code, one for each permutation and code bloat results. Alternatively, the weight and the skip arrays could be merged into a temporary array (with zero weights for values to be skipped), at the cost of processing and still there is bloat. However, with code modification, to the template for calculating the statistics could be added as appropriate the code for skipping unwanted values, and for applying weights. There would be no repeated testing of the options and the data array would be accessed once, as also would the weight and skip arrays, if involved.

Use as camouflage

Self-modifying code was used to hide copy protection instructions in 1980s disk-based programs for platforms such as IBM PC and Apple II. For example, on an IBM PC (or compatible), the floppy disk drive access instruction 'int 0x13' would not appear in the executable program's image but it would be written into the executable's memory image after the program started executing.

Self-modifying code is also sometimes used by programs that do not want to reveal their presence, such as computer viruses and some shellcodes. Viruses and shellcodes that use self-modifying code mostly do this in combination with polymorphic code. Modifying a piece of running code is also used in certain attacks, such as buffer overflows.

Self-referential machine learning systems

Traditional machine learning systems have a fixed, pre-programmed learning algorithm to adjust their parameters. However, since the 1980s Jürgen Schmidhuber has published several self-modifying systems with the ability to change their own learning algorithm. They avoid the danger of catastrophic self-rewrites by making sure that self-modifications will survive only if they are useful according to a user-given fitness, error or reward function.

Operating systems

Because of the security implications of self-modifying code, all of the major operating systems are careful to remove such vulnerabilities as they become known. The concern is typically not that programs will intentionally modify themselves, but that they could be maliciously changed by an exploit.

As consequence of the troubles that can be caused by these exploits, an OS feature called W^X (for "write xor execute") has been developed that prohibits a program from making any page of memory both writable and executable. Some systems prevent a writable page from ever being changed to be executable, even if write permission is removed. Other systems provide a 'back door' of sorts, allowing multiple mappings of a page of memory to have different permissions. A relatively portable way to bypass W^X is to create a file with all permissions, then map the file into memory twice. On Linux, one may use an undocumented SysV shared memory flag to get executable shared memory without needing to create a file.

Regardless, at a meta-level, programs can still modify their own behavior by changing data stored elsewhere (see metaprogramming) or via use of polymorphism.

Interaction of cache and self-modifying code

On architectures without coupled data and instruction cache (some ARM and MIPS cores) the cache synchronization must be explicitly performed by the modifying code (flush data cache and invalidate instruction cache for the modified memory area).

In some cases short sections of self-modifying code execute more slowly on modern processors. This is because a modern processor will usually try to keep blocks of code in its cache memory. Each time the program rewrites a part of itself, the rewritten part must be loaded into the cache again, which results in a slight delay, if the modified codelet shares the same cache line with the modifying code, as is the case when the modified memory address is located within a few bytes to the one of the modifying code.

The cache invalidation issue on modern processors usually means that self-modifying code would still be faster only when the modification will occur rarely, such as in the case of a state switching inside an inner loop.Most modern processors load the machine code before they execute it, which means that if an instruction that is too near the instruction pointer is modified, the processor will not notice, but instead execute the code as it was before it was modified. See prefetch input queue (PIQ). PC processors must handle self-modifying code correctly for backwards compatibility reasons but they are far from efficient at doing so.

Massalin's Synthesis kernel

The Synthesis kernel presented in Alexia Massalin's Ph.D. thesis is a tiny Unix kernel that takes a structured, or even object oriented, approach to self-modifying code, where code is created for individual quajects, like filehandles. Generating code for specific tasks allows the Synthesis kernel to (as a JIT interpreter might) apply a number of optimizations such as constant folding or common subexpression elimination.

The Synthesis kernel was very fast, but was written entirely in assembly. The resulting lack of portability has prevented Massalin's optimization ideas from being adopted by any production kernel. However, the structure of the techniques suggests that they could be captured by a higher level language, albeit one more complex than existing mid-level languages. Such a language and compiler could allow development of faster operating systems and applications.

Paul Haeberli and Bruce Karsh have objected to the "marginalization" of self-modifying code, and optimization in general, in favor of reduced development costs.

Advantages

Fast paths can be established for a program's execution, reducing some otherwise repetitive conditional branches.
Self-modifying code can improve algorithmic efficiency.

Disadvantages

Self-modifying code is harder to read and maintain because the instructions in the source program listing are not necessarily the instructions that will be executed. Self-modification that consists of substitution of function pointers might not be as cryptic, if it is clear that the names of functions to be called are placeholders for functions to be identified later.

Self-modifying code can be rewritten as code that tests a flag and branches to alternative sequences based on the outcome of the test, but self-modifying code typically runs faster.

On modern processors with an instruction pipeline, code that modifies itself frequently may run more slowly, if it modifies instructions that the processor has already read from memory into the pipeline. On some such processors, the only way to ensure that the modified instructions are executed correctly is to flush the pipeline and reread many instructions.

Self-modifying code cannot be used at all in some environments, such as the following:

Application software running under an operating system with strict W^X security cannot execute instructions in pages it is allowed to write to—only the operating system is allowed to both write instructions to memory and later execute those instructions.
Many Harvard architecture microcontrollers cannot execute instructions in read-write memory, but only instructions in memory that it cannot write to, ROM or non-self-programmable flash memory.
A multithreaded application may have several threads executing the same section of self-modifying code, possibly resulting in computation errors and application failures.