Search This Blog

Wednesday, April 22, 2026

Extensible programming

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Extensible_programming

In computer science, extensible programming is a style of computer programming that focuses on mechanisms to extend the programming language, compiler, and runtime system (environment). Extensible programming languages, supporting this style of programming, were an active area of work in the 1960s, but the movement was marginalized in the 1970s. Extensible programming has become a topic of renewed interest in the 21st century.

Historical movement

The first paper usually associated with the extensible programming language movement is M. Douglas McIlroy's 1960 paper on macros for high-level programming languages. Another early description of the principle of extensibility occurs in Brooker and Morris's 1960 paper on the compiler-compiler. The peak of the movement was marked by two academic symposia, in 1969 and 1971. By 1975, a survey article on the movement by Thomas A. Standish was essentially a post mortem. The Forth was an exception, but it went essentially unnoticed.

Character of the historical movement

As typically envisioned, an extensible language consisted of a base language providing elementary computing facilities, and a metalanguage able to modify the base language. A program then consisted of metalanguage modifications and code in the modified base language.

The most prominent language-extension technique used in the movement was macro definition. Grammar modification was also closely associated with the movement, resulting in the eventual development of adaptive grammar formalisms. The Lisp language community remained separate from the extensible language community, apparently because, as one researcher observed,

any programming language in which programs and data are essentially interchangeable can be regarded as an extendible [sic] language. ... this can be seen very easily from the fact that Lisp has been used as an extendible language for years.

At the 1969 conference, Simula was presented as an extensible language.

Standish described three classes of language extension, which he named paraphrase, orthophrase, and metaphrase (otherwise paraphrase and metaphrase being translation terms).

  • Paraphrase defines a facility by showing how to exchange it for something formerly defined (or to be defined). As examples, he mentions macro definitions, ordinary procedure definitions, grammatical extensions, data definitions, operator definitions, and control structure extensions.
  • Orthophrase adds features to a language that could not be achieved using the base language, such as adding an input/output (I/O) system to a base language formerly with no I/O primitives. Extensions must be understood as orthophrase relative to some given base language, since a feature not defined in terms of the base language must be defined in terms of some other language. This corresponds to the modern notion of plug-ins.
  • Metaphrase modifies the interpretation rules used for pre-existing expressions. This corresponds to the modern notion of reflective programming (reflection).

Death of the historical movement

Standish attributed the failure of the extensibility movement to the difficulty of programming successive extensions. A programmer might build a first shell of macros around a base language. Then, if a second shell of macros is built around that, any subsequent programmer must be intimately familiar with both the base language, and the first shell. A third shell would require familiarity with the base and both the first and second shells, and so on. Shielding a programmer from lower-level details is the intent of the abstraction movement that supplanted the extensibility movement.

Despite the earlier presentation of Simula as extensible, by 1975, Standish's survey does not seem in practice to have included the newer abstraction-based technologies (though he used a very general definition of extensibility that technically could have included them). A 1978 history of programming abstraction from the invention of the computer until then, made no mention of macros, and gave no hint that the extensible languages movement had ever occurred. Macros were tentatively admitted into the abstraction movement by the late 1980s (perhaps due to the advent of hygienic macros), by being granted the pseudonym syntactic abstractions.

Modern movement

In the modern sense, a system that supports extensible programming will provide all of the features described below.

Extensible syntax

This simply means that the source language(s) to be compiled must not be closed, fixed, or static. It must be possible to add new keywords, concepts, and structures to the source language(s). Languages which allow the addition of constructs with user defined syntax include RocqRacket, Camlp4, OpenC++, Seed7Red, Rebol, and Felix. While it is acceptable for some fundamental and intrinsic language features to be immutable, the system must not rely solely on those language features. It must be possible to add new ones.

Extensible compiler

In extensible programming, a compiler is not a monolithic program that converts source code input into binary executable output. The compiler itself must be extensible to the point that it is really a collection of plugins that assist with the translation of source language input into anything. For example, an extensible compiler will support the generation of object code, code documentation, re-formatted source code, or any other desired output. The architecture of the compiler must permit its users to "get inside" the compilation process and provide alternative processing tasks at every reasonable step in the compilation process.

For just the task of translating source code into something that can be executed on a computer, an extensible compiler should:

  • use a plug-in or component architecture for nearly every aspect of its function
  • determine which language or language variant is being compiled and locate the appropriate plug-in to recognize and validate that language
  • use formal language specifications to syntactically and structurally validate arbitrary source languages
  • assist with the semantic validation of arbitrary source languages by invoking an appropriate validation plug-in
  • allow users to select from different kinds of code generators so that the resulting executable can be targeted for different processors, operating systems, virtual machines, or other execution environment.
  • provide facilities for error generation and extensions to it
  • allow new kinds of nodes in the abstract syntax tree (AST),
  • allow new values in nodes of the AST,
  • allow new kinds of edges between nodes,
  • support the transformation of the input AST, or portions thereof, by some external "pass"
  • support the translation of the input AST, or portions thereof, into another form by some external "pass"
  • assist with the flow of information between internal and external passes as they both transform and translate the AST into new ASTs or other representations

Extensible runtime

At runtime, extensible programming systems must permit languages to extend the set of operations that it permits. For example, if the system uses a byte-code interpreter, it must allow new byte-code values to be defined. As with extensible syntax, it is acceptable for there to be some (smallish) set of fundamental or intrinsic operations that are immutable. However, it must be possible to overload or augment those intrinsic operations so that new or additional behavior can be supported.

Content separated from form

Extensible programming systems should regard programs as data to be processed. Those programs should be completely devoid of any kind of formatting information. The visual display and editing of programs to users should be a translation function, supported by the extensible compiler, that translates the program data into forms more suitable for viewing or editing. Naturally, this should be a two-way translation. This is important because it must be possible to easily process extensible programs in a variety of ways. It is unacceptable for the only uses of source language input to be editing, viewing and translation to machine code. The arbitrary processing of programs is facilitated by de-coupling the source input from specifications of how it should be processed (formatted, stored, displayed, edited, etc.).

Source language debugging support

Extensible programming systems must support the debugging of programs using the constructs of the original source language regardless of the extensions or transformation the program has undergone in order to make it executable. Most notably, it cannot be assumed that the only way to display runtime data is in structures or arrays. The debugger, or more correctly 'program inspector', must permit the display of runtime data in forms suitable to the source language. For example, if the language supports a data structure for a business process or work flow, it must be possible for the debugger to display that data structure as a fishbone chart or other form provided by a plugin.

Self-modifying code

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Self-modifying_code

In computer science, self-modifying code (SMC or SMoC) is code that alters its own instructions while it is executing – usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, thus simplifying maintenance. The term is usually only applied to code where the self-modification is intentional, not in situations where code accidentally modifies itself due to an error such as a buffer overflow.

Self-modifying code can involve overwriting existing instructions or generating new code at run time and transferring control to that code.

Self-modification can be used as an alternative to the method of "flag setting" and conditional program branching, used primarily to reduce the number of times a condition needs to be tested.

The method is frequently used for conditionally invoking test/debugging code without requiring additional computational overhead for every input/output cycle.

The modifications may be performed:

  • only during initialization – based on input parameters (when the process is more commonly described as software "configuration" and is somewhat analogous, in hardware terms, to setting jumpers for printed circuit boards). Alteration of program entry pointers is an equivalent indirect method of self-modification, but requiring the co-existence of one or more alternative instruction paths, increasing the program size.
  • throughout execution ("on the fly") – based on particular program states that have been reached during the execution

In either case, the modifications may be performed directly to the machine code instructions themselves, by overlaying new instructions over the existing ones (for example, altering a compare and branch to an unconditional branch or alternatively a NOP).

In the IBM System/360 architecture and its successors up to z/Architecture, an EXECUTE (EX) instruction logically overlays the second byte of its target instruction with the low-order 8 bits of register 1. This provides the effect of self-modification, although the actual instruction in storage is not altered.

Application in low and high level languages

Self-modification can be accomplished in a variety of ways depending upon the programming language and its support for pointers and/or access to dynamic compiler or interpreter "engines":

  • overlay of existing instructions (or parts of instructions such as opcode, register, flags or addresses)
  • direct creation of whole instructions or sequences of instructions in memory
  • creation or modification of source code statements followed by a "mini compile" or a dynamic interpretation (see eval statement)
  • creating an entire program dynamically and then executing it

Assembly language

Self-modifying code is quite straightforward to implement when using assembly language. Instructions can be dynamically created in memory (or else overlaid over existing code in non-protected program storage), in a sequence equivalent to the ones that a standard compiler may generate as the object code. With modern processors, there can be unintended side effects on the CPU cache that must be considered. The method was frequently used for testing "first time" conditions, as in this suitably commented IBM/360 assembler example. It uses instruction overlay to reduce the instruction path length by (N × 1) − 1, where N is the number of records on the file (−1 being the overhead to perform the overlay).

SUBRTN NOP OPENED      FIRST TIME HERE?
* The NOP is x'4700'<Address_of_opened>
       OI    SUBRTN+1,X'F0'  YES, CHANGE NOP TO UNCONDITIONAL BRANCH (47F0...)
       OPEN   INPUT               AND  OPEN THE INPUT FILE SINCE IT'S THE FIRST TIME THRU
OPENED GET    INPUT        NORMAL PROCESSING RESUMES HERE
      ...

Alternative code might involve testing a "flag" each time through. The unconditional branch is slightly faster than a compare instruction, as well as reducing the overall path length. In later operating systems for programs residing in protected storage, this technique could not be used, and so changing the pointer to the subroutine would be used instead. The pointer would reside in dynamic storage and could be altered at will after the first pass to bypass the OPEN (having to load a pointer first instead of a direct branch and link to the subroutine would add N instructions to the path length – but there would be a corresponding reduction of N for the unconditional branch that would no longer be required).

Below is an example in Zilog Z80 assembly language. The code increments register B in range [0, 5]. The CP compare instruction is modified on each loop.

;==========
ORG 0H
CALL FUNC00
HALT
;==========
FUNC00:
LD A,6
LD HL,label01+1
LD B,(HL)
label00:
INC B
LD (HL),B
label01:
CP $0
JP NZ,label00
RET
;==========

Self-modifying code is sometimes used to overcome limitations in a machine's instruction set. For example, in the Intel 8080 instruction set, one cannot input a byte from an input port that is specified in a register. The input port is statically encoded in the instruction itself, as the second byte of a two-byte instruction. Using self-modifying code, it is possible to store a register's contents into the second byte of the instruction, then execute the modified instruction in order to achieve the desired effect.

High-level languages

Some compiled languages explicitly permit self-modifying code. For example, the ALTER verb in COBOL may be implemented as a branch instruction that is modified during execution. Some batch programming techniques involve the use of self-modifying code. Clipper and SPITBOL also provide facilities for explicit self-modification. The Algol compiler on B6700 systems offered an interface to the operating system whereby executing code could pass a text string or a named disc file to the Algol compiler and was then able to invoke the new version of a procedure.

With interpreted languages, the "machine code" is the source text and may be susceptible to editing on-the-fly: in SNOBOL the source statements being executed are elements of a text array. Other languages, such as Perl and Python, allow programs to create new code at run-time and execute it using an eval function, but do not allow existing code to be mutated. The illusion of modification (even though no machine code is really being overwritten) is achieved by modifying function pointers, as in this JavaScript example:

    var f = function (x) {return x + 1};

    // assign a new definition to f:
    f = new Function('x', 'return x + 2');

Lisp macros also allow runtime code generation without parsing a string containing program code.

The Push programming language is a genetic programming system that is explicitly designed for creating self-modifying programs. While not a high-level language, it is not as low-level as assembly language.

Compound modification

Prior to the advent of multiple windows, command-line systems might offer a menu system involving the modification of a running command script. Suppose an MS-DOS batch file MENU.BAT contains the following:

   :start
   SHOWMENU.EXE

Upon initiation of MENU.BAT from the command line, SHOWMENU presents an on-screen menu, with possible help information, example usages and so forth. Eventually the user makes a selection that requires a command SOMENAME to be performed: SHOWMENU exits after rewriting the file MENU.BAT to contain

   :start
   SHOWMENU.EXE
   CALL SOMENAME.BAT
   GOTO start

Because the command interpreter does not compile a script file and then execute it, nor does it read the entire file into memory before starting execution, nor yet rely on the content of a record buffer, when SHOWMENU exits, the command interpreter finds a new command to execute (it is to invoke the script file SOMENAME, in a directory location and via a protocol known to SHOWMENU), and after that command completes, it goes back to the start of the script file and reactivates SHOWMENU ready for the next selection. Should the menu choice be to quit, the file would be rewritten back to its original state. Although this starting state has no use for the label, it, or an equivalent amount of text is required, because the command interpreter recalls the byte position of the next command when it is to start the next command, thus the re-written file must maintain alignment for the next command start point to indeed be the start of the next command.

Aside from the convenience of a menu system (and possible auxiliary features), this scheme means that the SHOWMENU.EXE system is not in memory when the selected command is activated, a significant advantage when memory is limited.

Control tables

Control table interpreters can be considered to be, in one sense, "self-modified" by data values extracted from the table entries (rather than specifically hand coded in conditional statements of the form IF inputx = 'yyy').

Channel programs

Some IBM access methods traditionally used self-modifying channel programs, where a value, such as a disk address, is read into an area referenced by a channel program, where it is used by a later channel command to access the disk.

History

The IBM SSEC, demonstrated in January 1948, had the ability to modify its instructions or otherwise treat them exactly like data. However, the capability was rarely used in practice. In the early days of computers, self-modifying code was often used to reduce use of limited memory, or improve performance, or both. It was also sometimes used to implement subroutine calls and returns when the instruction set only provided simple branching or skipping instructions to vary the control flow. This use is still relevant in certain ultra-RISC architectures, at least theoretically; see for example one-instruction set computer. Donald Knuth's MIX architecture also used self-modifying code to implement subroutine calls.

Usage

Self-modifying code can be used for various purposes:

  • Semi-automatic optimizing of a state-dependent loop.
  • Dynamic in-place code optimization for speed depending on load environment.
  • Run-time code generation, or specialization of an algorithm in runtime or loadtime (which is popular, for example, in the domain of real-time graphics) such as a general sort utility – preparing code to perform the key comparison described in a specific invocation.
  • Altering of inlined state of an object, or simulating the high-level construction of closures.
  • Patching of subroutine (pointer) address calling, usually as performed at load/initialization time of dynamic libraries, or else on each invocation, patching the subroutine's internal references to its parameters so as to use their actual addresses (i.e. indirect self-modification).
  • Evolutionary computing systems such as neuroevolution, genetic programming and other evolutionary algorithms.
  • Hiding of code to prevent reverse engineering (by use of a disassembler or debugger) or to evade detection by virus/spyware scanning software and the like.
  • Filling all memory (in some architectures) with a rolling pattern of repeating opcodes, to erase all programs and data, or to burn-in hardware or perform RAM tests.
  • Compressing code to be decompressed and executed at runtime, e.g., when memory or disk space is limited.
  • Some very limited instruction sets leave no option but to use self-modifying code to perform certain functions. For example, a one-instruction set computer (OISC) machine that uses only the subtract-and-branch-if-negative "instruction" cannot do an indirect copy (something like the equivalent of *a = **b in the C language) without using self-modifying code.
  • Booting. Early microcomputers often used self-modifying code in their bootloaders. Since the bootloader was keyed in via the front panel at every power-on, it did not matter if the bootloader modified itself. However, even today many bootstrap loaders are self-relocating, and a few are even self-modifying.
  • Altering instructions for fault-tolerance.

Optimizing a state-dependent loop

Pseudocode example:

repeat N times {
    if STATE is 1
        increase A by one
    else
        decrease A by one
    do something with A
}

Self-modifying code, in this case, would simply be a matter of rewriting the loop like this:

repeat N times {
    increase A by one
    do something with A
    when STATE has to switch {
        replace the opcode "increase" above with the opcode to decrease, or vice versa
    }
}

Note that two-state replacement of the opcode can be easily written as "xor var at address with the value opcodeOf(Inc) xor opcodeOf(dec)".

Choosing this solution must depend on the value of N and the frequency of state changing.

Specialization

Suppose a set of statistics such as average, extrema, location of extrema, standard deviation, etc. are to be calculated for some large data set. In a general situation, there may be an option of associating weights with the data, so each xi is associated with a wi, and rather than test for the presence of weights at every index value, there could be two versions of the calculation, one for use with weights and one not, with one test at the start. Now consider a further option, that each value may have associated with it a Boolean to signify whether that value is to be skipped or not. This could be handled by producing four batches of code, one for each permutation and code bloat results. Alternatively, the weight and the skip arrays could be merged into a temporary array (with zero weights for values to be skipped), at the cost of processing and still there is bloat. However, with code modification, to the template for calculating the statistics could be added as appropriate the code for skipping unwanted values, and for applying weights. There would be no repeated testing of the options and the data array would be accessed once, as also would the weight and skip arrays, if involved.

Use as camouflage

Self-modifying code is more complex to analyze than standard code and can therefore be used as a protection against reverse engineering and software cracking. Self-modifying code was used to hide copy-protection instructions in 1980s disk-based programs for systems such as IBM PC compatibles and Apple II. For example, on an IBM PC, the floppy disk drive access instruction int 0x13 would not appear in the executable program's image but would be written into the executable's memory image after the program started executing.

Self-modifying code is also sometimes used by programs that do not want to reveal their presence, such as computer viruses and some shellcodes. Viruses and shellcodes that use self-modifying code mostly do this in combination with polymorphic code. Modifying a piece of running code is also used in certain attacks, such as buffer overflows.

Self-referential machine-learning systems

Traditional machine-learning systems have a fixed, pre-programmed learning algorithm to adjust their parameters. However, since the 1980s Jürgen Schmidhuber has published several self-modifying systems with the ability to change their own learning algorithm. They avoid the danger of catastrophic self-rewrites by making sure that self-modifications will survive only if they are useful according to a user-given fitness, error or reward function.

Operating systems

The Linux kernel notably makes wide use of self-modifying code; it does so to be able to distribute a single binary image for each major architecture (e.g. IA-32, x86-64, 32-bit ARM, ARM64...) while adapting the kernel code in memory during boot depending on the specific CPU model detected, e.g. to be able to take advantage of new CPU instructions or to work around hardware bugs. To a lesser extent, the DR-DOS kernel also optimizes speed-critical sections of itself at loadtime depending on the underlying processor generation.

Regardless, at a meta-level, programs can still modify their own behavior by changing data stored elsewhere (see metaprogramming) or via use of polymorphism.

Massalin's Synthesis kernel

The Synthesis kernel presented in Alexia Massalin's Ph.D. thesis is a tiny Unix kernel that takes a structured, or even object-oriented, approach to self-modifying code, where code is created for individual quajects, like filehandles. Generating code for specific tasks allows the Synthesis kernel to (as a JIT interpreter might) apply a number of optimizations such as constant folding or common subexpression elimination.

The Synthesis kernel was very fast, but was written entirely in assembly. The resulting lack of portability has prevented Massalin's optimization ideas from being adopted by any production kernel. However, the structure of the techniques suggests that they could be captured by a higher-level language, albeit one more complex than existing mid-level languages. Such a language and compiler could allow development of faster operating systems and applications.

Paul Haeberli and Bruce Karsh have objected to the "marginalization" of self-modifying code, and optimization in general, in favor of reduced development costs.

Interaction of cache and self-modifying code

On architectures without coupled data and instruction cache (for example, some SPARC, ARM, and MIPS cores) the cache synchronization must be explicitly performed by the modifying code (flush data cache and invalidate instruction cache for the modified memory area).

In some cases short sections of self-modifying code execute more slowly on modern processors. This is because a modern processor will usually try to keep blocks of code in its cache memory. Each time the program rewrites a part of itself, the rewritten part must be loaded into the cache again, which results in a slight delay, if the modified codelet shares the same cache line with the modifying code, as is the case when the modified memory address is located within a few bytes to the one of the modifying code.

The cache invalidation issue on modern processors usually means that self-modifying code would still be faster only when the modification will occur rarely, such as in the case of a state switching inside an inner loop.

Most modern processors load the machine code before they execute it, which means that if an instruction that is too near the instruction pointer is modified, the processor will not notice, but instead execute the code as it was before it was modified. See prefetch input queue (PIQ). PC processors must handle self-modifying code correctly for backwards compatibility reasons but they are far from efficient at doing so.

Security issues

Because of the security implications of self-modifying code, all of the major operating systems are careful to remove such vulnerabilities as they become known. The concern is typically not that programs will intentionally modify themselves, but that they could be maliciously changed by an exploit.

One mechanism for preventing malicious code modification is an operating system feature called W^X (for "write xor execute"). This mechanism prohibits a program from making any page of memory both writable and executable. Some systems prevent a writable page from ever being changed to be executable, even if write permission is removed.[citation needed] Other systems provide a "backdoor" of sorts, allowing multiple mappings of a page of memory to have different permissions. A relatively portable way to bypass W^X is to create a file with all permissions, then map the file into memory twice. On Linux, one may use an undocumented SysV shared-memory flag to get executable shared memory without needing to create a file.

Advantages

Disadvantages

Self-modifying code is harder to read and maintain because the instructions in the source program listing are not necessarily the instructions that will be executed. Self-modification that consists of substitution of function pointers might not be as cryptic, if it is clear that the names of functions to be called are placeholders for functions to be identified later.

Self-modifying code can be rewritten as code that tests a flag and branches to alternative sequences based on the outcome of the test, but self-modifying code typically runs faster.

Self-modifying code conflicts with authentication of the code and may require exceptions to policies requiring that all code running on a system be signed.

Modified code must be stored separately from its original form, conflicting with memory management solutions that normally discard the code in RAM and reload it from the executable file as needed.

On modern processors with an instruction pipeline, code that modifies itself frequently may run more slowly, if it modifies instructions that the processor has already read from memory into the pipeline. On some such processors, the only way to ensure that the modified instructions are executed correctly is to flush the pipeline and reread many instructions.

Self-modifying code cannot be used at all in some environments, such as the following:

  • Application software running under an operating system with strict W^X security cannot execute instructions in pages it is allowed to write to—only the operating system is allowed to both write instructions to memory and later execute those instructions.
  • Many Harvard architecture microcontrollers cannot execute instructions in read-write memory, but only instructions in memory that it cannot write to, ROM or non-self-programmable flash memory.
  • A multithreaded application may have several threads executing the same section of self-modifying code, possibly resulting in computation errors and application failures.

Domain-specific language

From Wikipedia, the free encyclopedia

A domain-specific language (DSL) is a computer language specialized to a specific application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging from widely used languages for common domains, such as HTML for web pages, down to languages used by only one or a few pieces of software, such as MUSH soft code. DSLs can be further subdivided by the kind of language, and include domain-specific markup languages, domain-specific modeling languages (more generally, specification languages), and domain-specific programming languages. Special-purpose computer languages have always existed in the computer age, but the term "domain-specific language" has become more popular due to the rise of domain-specific modeling. Simpler DSLs, specifically ones used by a single application, are sometimes informally called mini-languages.

The line between general-purpose languages and domain-specific languages is not always sharp, as a language may have specialized features for a given domain, but be applicable more broadly, or conversely may in principle be capable of broad application but in practice used mainly for a specific domain. For example, Perl was originally developed as a text-processing and glue language, for the same domain as AWK and shell scripts, but was mostly used as a general-purpose programming language later on. In contrast, PostScript is a Turing-complete language, and in principle can be used for any task, but in practice is narrowly used as a page description language.

Use

The design and use of appropriate DSLs is a key part of domain engineering, by using a language suitable to the domain at hand – this may consist of using an existing DSL or GPL, or developing a new DSL. Language-oriented programming considers the creation of special-purpose languages for expressing problems as standard part of the problem-solving process. Creating a domain-specific language (with software to support it), rather than reusing an existing language, can be worthwhile if the language allows a specific type of problem or solution to be expressed more clearly than an existing language would allow and the type of problem in question reappears sufficiently often. Pragmatically, a DSL may be specialized to a specific problem domain, a specific problem representation technique, a specific solution technique, or other aspects of a domain.

Overview

A domain-specific language is created specifically to solve problems in a specific domain and is not intended to be able to solve problems outside of it (although that may be technically possible). In contrast, general-purpose languages are created to solve problems in many domains. The domain can also be a business area. Some examples of business areas include:

  • life insurance policies (developed internally by a large insurance enterprise)
  • combat simulation
  • salary calculation
  • billing

A domain-specific language is somewhere between a tiny programming language and a scripting language, and is often used in a way analogous to a programming library. The boundaries between these concepts are quite blurry, much like the boundary between scripting languages and general-purpose languages.

In design and implementation

Domain-specific languages are languages (or often, declared syntaxes or grammars) with very specific goals in design and implementation. A domain-specific language can be one of a visual diagramming language, such as those created by the Generic Eclipse Modeling System, programmatic abstractions, such as the Eclipse Modeling Framework, or textual languages. For instance, the command line utility grep has a regular expression syntax which matches patterns in lines of text. The sed utility defines a syntax for matching and replacing regular expressions. Often, these tiny languages can be used together inside a shell to perform more complex programming tasks.

The line between domain-specific languages and scripting languages is somewhat blurred, but domain-specific languages often lack low-level functions for file system access, interprocess control, and other functions that characterize full-featured programming languages, scripting or otherwise. Many domain-specific languages do not compile to bytecode or executable code, but to various kinds of media objects: GraphViz exports to PostScript, Graphics Interchange Format (GIF), Joint Photographic Experts Group (JPEG), etc., where Csound compiles to audio files, and a ray-tracing domain-specific language like Persistence of Vision Ray Tracer (POV-Ray) compiles to graphics files.

Data definition languages

A data definition language like SQL presents an interesting case: it can be deemed a domain-specific language because it is specific to a specific domain (in SQL's case, accessing and managing relational databases), and is often called from another application, but SQL has more keywords and functions than many scripting languages, and is often viewed as a full language, in part because of the prevalence of database manipulation in programming and the amount of mastery needed to be expert in the language.

Further blurring this line, many domain-specific languages have exposed APIs, and can be accessed from other programming languages without breaking the flow of execution or calling a separate process, and can thus operate as programming libraries.

Programming tools

Some domain-specific languages expand over time to include full-featured programming tools, which further complicates the question of whether a language is domain-specific or not. A good example is the functional language Extensible Stylesheet Language Transformation (XSLT), specifically designed to transform one XML graph into another, which has been extended since its inception to allow (specifically in its 2.0 version) various forms of file system interaction, string and date manipulation, and data typing.

In model-driven engineering, many examples of domain-specific languages may be found like Object Constraint Language (OCL), a language for decorating models with assertions or Query/View/Transformation (QVT), a domain-specific transformation language. However, languages like Unified Modeling Language (UML) are typically general-purpose modeling languages.

To summarize, an analogy is useful: a very little language is like a knife, which can be used in thousands of different ways, from cutting food to self-defense. A domain-specific language is like an electric drill: it is a powerful tool with a wide variety of uses, but a specific context, namely, making holes in things. A general purpose language is a complete workbench, with a variety of tools intended for performing a variety of tasks. Domain-specific languages should be used by programmers who, looking at their current workbench, realize they need a better drill and find that a given domain-specific language provides exactly that.[citation needed]

Domain-specific language topics

External and embedded domain specific languages

DSLs implemented via an independent interpreter or compiler are termed external domain specific languages. Well known examples include TeX or AWK. A separate category termed embedded (or internal) domain specific languages are typically implemented within a host language as a library and tend to be limited to the syntax of the host language, though this depends on host language abilities.

Usage patterns

There are several usage patterns for domain-specific languages:

  • Processing with standalone tools, invoked via direct user operation, often on the command line or from a Makefile (e.g., grep for regular expression matching, sed, lex, yacc, the GraphViz toolset, etc.)
  • Domain-specific languages which are implemented using programming language macro systems, and which are converted or expanded into a host general purpose language at compile-time or realtime
  • As embedded domain-specific language (eDSL) also known as an internal domain-specific language, is a DSL that is implemented as a library in a "host" programming language. The embedded domain-specific language leverages the syntax, semantics and runtime system–environment (sequencing, conditionals, iteration, functions, etc.) and adds domain-specific primitives that allow programmers to use the "host" programming language to create programs that generate code in the "target" programming language. Multiple eDSLs can easily be combined into a single program and the facilities of the host language can be used to extend an existing eDSL. Other possible advantages using an eDSL are improved type safety and better integrated development environment (IDE) tooling. eDSL examples: SQLAlchemy "Core" an SQL eDSL in Python, JOOQ Object Oriented Querying (jOOQ) an SQL eDSL in Java, Language Integrated Query's (LINQ) "method syntax" an SQL eDSL in C#, and kotlinx.html an HTML eDSL in Kotlin.
  • Domain-specific languages which are called (at runtime) from programs written in general purpose languages like C or Perl, to perform a specific function, often returning the results of operation to the "host" programming language for further processing; generally, an interpreter or virtual machine for the domain-specific language is embedded into the host application (e.g., format strings, a regular expression engine)
  • Domain-specific languages which are embedded into user applications (e.g., macro languages within spreadsheets) and which are (1) used to execute code that is written by users of the application, (2) dynamically generated by the application, or (3) both.

Design goals

Adopting a domain-specific language approach to software engineering involves both risks and opportunities. The well-designed domain-specific language manages to find the proper balance between these.

Domain-specific languages have important design goals that contrast with those of general-purpose languages:

  • Domain-specific languages are less comprehensive.
  • Domain-specific languages are much more expressive in their domain.
  • Domain-specific languages should exhibit minimal redundancy.

Idioms

In programming, idioms are methods imposed by programmers to handle common development tasks, e.g.:

  • Ensure data is saved before the window is closed.
  • Edit code whenever command-line parameters change because they affect program behavior.

General purpose programming languages rarely support such idioms, but domain-specific languages can describe them, e.g.:

  • A script can automatically save data.
  • A domain-specific language can parameterize command line input.

Examples

Examples of domain-specific programming languages include HTML, Logo for pencil-like drawing, Verilog and VHDL hardware description languages, MATLAB and GNU Octave for matrix programming, Mathematica, Maple and Maxima for symbolic mathematics, Specification and Description Language for reactive and distributed systems, spreadsheet formulas and macros, SQL for relational database queries, Yacc grammars for creating parsers, regular expressions for specifying lexers, the Generic Eclipse Modeling System for creating diagramming languages, Csound for sound and music synthesis, and the input languages of GraphViz and GrGen, software packages used for graph layout and graph rewriting, HashiCorp Configuration Language used for Terraform and other HashiCorp tools, Puppet also has its own configuration language.

GameMaker Language

GML is a domain-specific language used by GameMaker Studio designed to help novice programmers learn the fundamentals of coding more easily. It functions as a blend of several languages, including Delphi, C++, and BASIC. Most GML functions, once compiled, call runtime functions written in the specific language of the target platform; consequently, their final implementation remains hidden from the user. The language's primary goal is to lower the barrier to entry for game development. The GameMaker runtime, which manages the main game loop and handles function implementation, allows a simple game to use only a few lines of code instead of thousands.

ColdFusion Markup Language

ColdFusion's associated scripting language is another example of a domain-specific language for data-driven websites. The language is used to weave together languages and services such as Java, .NET, C++, SMS, email, email servers, http, ftp, exchange, directory services, and file systems for use in websites.

The ColdFusion Markup Language (CFML) includes a set of tags that can be used in ColdFusion pages to interact with data sources, manipulate data, and display output. CFML tag syntax is similar to HTML element syntax.

FilterMeister

FilterMeister is a programming environment, with a programming language that is based on C, for the specific purpose of creating Photoshop-compatible image processing filter plug-ins; FilterMeister runs as a Photoshop plug-in itself and it can load and execute scripts or compile and export them as independent plug-ins. Although the FilterMeister language reproduces a significant portion of the C language and function library, it contains only those features which can be used within the context of Photoshop plug-ins and adds a number of specific features only useful in this specific domain.

MediaWiki templates

The Template feature of MediaWiki is an embedded domain-specific language whose fundamental purpose is to support the creation of page templates and the transclusion (inclusion by reference) of MediaWiki pages into other MediaWiki pages.

Software engineering uses

There has been much interest in domain-specific languages to improve the productivity and quality of software engineering. Domain-specific language could possibly provide a robust set of tools for efficient software engineering. Such tools are beginning to make their way into the development of critical software systems.

The Software Cost Reduction Toolkit[7] is an example of this. The toolkit is a suite of utilities including a specification editor to create a requirements specification, a dependency graph browser to display variable dependencies, a consistency checker to catch missing cases in well-formed formulas in the specification, a model checker and a theorem prover to check program properties against the specification, and an invariant generator that automatically constructs invariants based on the requirements.

A newer development is language-oriented programming, an integrated software engineering methodology based mainly on creating, optimizing, and using domain-specific languages.

Metacompilers

Complementing language-oriented programming, and all other forms of domain-specific languages, are the class of compiler writing tools called metacompilers. Such compiler is useful for generating parsers and code generators for domain-specific languages, and a metacompiler compiles a domain-specific metalanguage specifically designed for the domain of metaprogramming.

Besides parsing domain-specific languages, metacompilers are useful for generating a wide range of software engineering and analysis tools. The meta-compiler methodology is often found in program transformation systems.

Metacompilers that played a significant role in both computer science and the computer industry include META II, and its descendant TreeMeta.

Unreal Engine before version 4 and other games

Unreal and Unreal Tournament unveiled a language named UnrealScript. This allowed rapid development of modifications relative to the competitor Quake (using the Id Tech 2 engine). The Id Tech engine used standard C code meaning C had to be learned and properly applied, while UnrealScript was optimized for ease of use and efficiency. Similarly, more recent games have introduced their own specific languages for development. One more common example is Lua for scripting.

Rules engines for policy automation

Various business rules engines have been developed for automating policy and business rules used in both government and private industry. ILOG, Oracle Policy Automation, DTRules, Drools and others provide support for DSLs aimed to support various problem domains. DTRules goes so far as to define an interface for the use of multiple DSLs within a rule set.

The purpose of business rules engines is to define a representation of business logic in as human-readable fashion as possible. This allows both subject-matter experts and developers to work with and understand the same representation of the business logic. Most rules engines provide both an approach to simplifying the control structures for business logic (for example, using declarative rules or decision tables) coupled with alternatives to programming syntax in favor of DSLs.

Statistical modelling languages

Statistical modelers have developed domain-specific languages such as R (an implementation of the S language), Bugs, Jags, and Stan. These languages provide a syntax for describing a Bayesian model and generate a method for solving it using simulation.

Generate model and services to multiple programming Languages

Generate object handling and services based on an Interface Description Language for a domain-specific language such as JavaScript for web applications, HTML for documentation, C++ for high-performance code, etc. This is done by cross-language frameworks such as Apache Thrift or Google Protocol Buffers.

Gherkin

Gherkin is a language designed to define test cases to check the behavior of software, without specifying how that behavior is implemented. It is meant to be read and used by non-technical users using a natural language syntax and a line-oriented design. The tests defined with Gherkin must then be implemented in a general programming language. Then, the steps in a Gherkin program acts as a syntax for method invocation accessible to non-developers.

Other examples

Other prominent examples of domain-specific languages include:

Advantages and disadvantages

Some of the advantages:

  • Domain-specific languages allow solutions to be expressed in the idiom and at the level of abstraction of the problem domain. The idea is that domain experts themselves may understand, validate, modify, and often even develop domain-specific language programs. However, this is seldom the case.
  • Domain-specific languages allow validation at the domain level. As long as the language constructs are safe any sentence written with them can be considered safe.
  • Domain-specific languages can help to shift the development of business information systems from traditional software developers to the typically larger group of domain-experts who (despite having less technical expertise) have a deeper knowledge of the domain.
  • Domain-specific languages are easier to learn, given their limited scope.

Some of the disadvantages:

  • Cost of learning a new language
  • Limited applicability
  • Cost of designing, implementing, and maintaining a language and the tools needed to develop with it: integrated development environment (IDE)
  • Finding, setting, and maintaining proper scope.
  • Difficulty of balancing trade-offs between domain-specificity and general-purpose programming language constructs
  • Potential loss of processor efficiency compared with hand-coded software
  • Proliferation of similar non-standard domain-specific languages, for example, a DSL used in one insurance company versus a DSL used in another insurance company.
  • Non-technical domain experts can find it hard to write or modify DSL programs by themselves
  • Increased difficulty of integrating the DSL with other components of the IT system; relative to integrating with a general-purpose language
  • Low supply of experts in a specific DSL tends to raise labor costs
  • Harder to find code examples

Tools for designing domain-specific languages

  • JetBrains MPS is a tool for designing domain-specific languages. It uses projectional editing which allows overcoming the limits of language parsers and building DSL editors, such as ones with tables and diagrams. It implements language-oriented programming. MPS combines an environment for language definition, a language workbench, and an integrated development environment (IDE) for such languages.
  • MontiCore is a language workbench for the efficient development of domain-specific languages. It processes an extended grammar format that defines the DSL and generates Java components for processing the DSL documents.
  • Xtext is an open-source software framework for developing programming languages and domain-specific languages (DSLs). Unlike standard parser generators, Xtext generates not only a parser but also a class model for the abstract syntax tree. In addition, it provides a fully featured, customizable Eclipse-based IDE. The project was archived in April 2023.
  • Racket is a cross-platform language toolchain including native code, JIT and JavaScript compiler, IDE (in addition to supporting Emacs, Vim, VSCode and others) and command line tools designed to accommodate creating both domain-specific and general purpose languages.

Programming paradigm

From Wikipedia, the free encyclopedia

A programming paradigm is a relatively high-level way to conceptualize and structure the implementation of a computer program. A programming language can be classified as supporting one or many paradigms.

Paradigms are separated along and described by different dimensions of programming. Some paradigms are about implications of the execution model, such as allowing side effects, or whether the sequence of operations is defined by the execution model. Other paradigms are about the way code is organized, such as grouping into units that include both state and behavior. Yet others are about syntax and grammar.

Some common programming paradigms include (shown in hierarchical relationship):

  • Imperative – code directly controls execution flow and state change, explicit statements that change a program state
    • procedural – organized as procedures that call each other
    • object-oriented – organized as objects that contain both data structure and associated behavior, uses data structures consisting of data fields and methods together with their interactions (objects) to design programs
      • Class-based – object-oriented programming in which abstract data types and inheritance are achieved by defining classes of objects, versus the objects themselves
      • Object-based - paradigm in which the object has a construct to encapsulate state and behavior, but without inheritance or subtyping
      • Prototype-based – object-oriented programming that avoids classes and implements inheritance via cloning of instances
      • Data, Context, and Interaction - paradigm that emphasizes mental models and run-time behavior of networks of objects, whose responsibilities are granted dynamically based on roles they play in interactions with other objects
  • Declarative – code declares properties of the desired result, but not how to compute it, describes what computation should perform, without specifying detailed state changes
    • functional – a desired result is declared as the value of a series of function evaluations, uses evaluation of mathematical functions and avoids state and mutable data
    • logic – a desired result is declared as the answer to a question about a system of facts and rules, uses explicit mathematical logic for programming
    • reactive – a desired result is declared with data streams and the propagation of change
  • Concurrent programming – has language constructs for concurrency, these may involve multi-threading, support for distributed computing, message passing, shared resources (including shared memory), or futures
    • Actor programming – concurrent computation with actors that make local decisions in response to the environment (capable of selfish or competitive behaviour)
  • Constraint programming – relations between variables are expressed as constraints (or constraint networks), directing allowable solutions (uses constraint satisfaction or simplex algorithm)
  • Dataflow programming – forced recalculation of formulas when data values change (e.g. spreadsheets)
  • Distributed programming – has support for multiple autonomous computers that communicate via computer networks
  • Generic programming – uses algorithms written in terms of to-be-specified-later types that are then instantiated as needed for specific types provided as parameters
  • Metaprogramming – writing programs that write or manipulate other programs (or themselves) as their data, or that do part of the work at compile time that would otherwise be done at runtime
    • Template metaprogramming – metaprogramming methods in which a compiler uses templates to generate temporary source code, which is merged by the compiler with the rest of the source code and then compiled
    • Reflective programming – metaprogramming methods in which a program modifies or extends itself
  • Pipeline programming – a simple syntax change to add syntax to nest function calls to language originally designed with none
  • Rule-based programming – a network of rules of thumb that comprise a knowledge base and can be used for expert systems and problem deduction & resolution
  • Visual programming – manipulating program elements graphically rather than by specifying them textually (e.g. Simulink); also termed diagrammatic programming

Overview

Overview of the various programming paradigms according to Peter Van Roy

Programming paradigms come from computer science research into existing practices of software development. The findings allow for describing and comparing programming practices and the languages used to code programs. For perspective, other fields of research study software engineering processes and describe various methodologies to describe and compare them.

A programming language can be described in terms of paradigms. Some languages support only one paradigm. For example, Smalltalk supports object-oriented and Haskell supports functional. Most languages support multiple paradigms. For example, a program written in C++, Object Pascal, or PHP can be purely procedural, purely object-oriented, or can contain aspects of both paradigms, or others.

When using a language that supports multiple paradigms, the developer chooses which paradigm elements to use. But, this choice may not involve considering paradigms per se. The developer often uses the features of a language as the language provides them and to the extent that the developer knows them. Categorizing the resulting code by paradigm is often an academic activity done in retrospect.

Languages categorized as imperative paradigm have two main features: they state the order in which operations occur, with constructs that explicitly control that order, and they allow side effects, in which state can be modified at one point in time, within one unit of code, and then later read at a different point in time inside a different unit of code. The communication between the units of code is not explicit.

In contrast, languages in the declarative paradigm do not state the order in which to execute operations. Instead, they supply a number of available operations in the system, along with the conditions under which each is allowed to execute. The implementation of the language's execution model tracks which operations are free to execute and chooses the order independently. More at Comparison of multi-paradigm programming languages.

In object-oriented programming, code is organized into objects that contain state that is owned by and (usually) controlled by the code of the object. Most object-oriented languages are also imperative languages.

In object-oriented programming, programs are treated as a set of interacting objects. In functional programming, programs are treated as a sequence of stateless function evaluations. When programming computers or systems with many processors, in process-oriented programming, programs are treated as sets of concurrent processes that act on a logical shared data structures.

Many programming paradigms are as well known for the techniques they forbid as for those they support. For instance, pure functional programming disallows side-effects, while structured programming disallows the goto construct. Partly for this reason, new paradigms are often regarded as doctrinaire or overly rigid by those accustomed to older ones. Yet, avoiding certain techniques can make it easier to understand program behavior, and to prove theorems about program correctness.

Programming paradigms can also be compared with programming models, which allows invoking an execution model by using only an API. Programming models can also be classified into paradigms based on features of the execution model.

For parallel computing, using a programming model instead of a language is common. The reason is that details of the parallel hardware leak into the abstractions used to program the hardware. This causes the programmer to have to map patterns in the algorithm onto patterns in the execution model (which have been inserted due to leakage of hardware into the abstraction). As a consequence, no one parallel programming language maps well to all computation problems. Thus, it is more convenient to use a base sequential language and insert API calls to parallel execution models via a programming model. Such parallel programming models can be classified according to abstractions that reflect the hardware, such as shared memory, distributed memory with message passing, notions of place visible in the code, and so forth. These can be considered flavors of programming paradigm that apply to only parallel languages and programming models.

Criticism

Some programming language researchers criticise the notion of paradigms as a classification of programming languages, e.g. Harper, and Krishnamurthi. They argue that many programming languages cannot be strictly classified into one paradigm, but rather include features from several paradigms. See Comparison of multi-paradigm programming languages.

History

Different approaches to programming have developed over time. Classification of each approach was either described at the time the approach was first developed, but often not until some time later, retrospectively. An early approach consciously identified as such is structured programming, advocated since the mid 1960s. The concept of a programming paradigm as such dates at least to 1978, in the Turing Award lecture of Robert W. Floyd, entitled The Paradigms of Programming, which cites the notion of paradigm as used by Thomas Kuhn in his The Structure of Scientific Revolutions (1962). Early programming languages did not have clearly defined programming paradigms and sometimes programs made extensive use of goto statements, liberal use of which led to spaghetti code which is difficult to understand and maintain. This led to the development of structured programming paradigms that disallowed the use of goto statements, only allowing the use of more structured programming constructs.

Languages and paradigms

Machine code

Machine code is the lowest-level of computer programming as it is machine instructions that define behavior at the lowest level of abstract possible for a computer. As it is the most prescriptive way to code it is classified as imperative.

It is sometimes called the first-generation programming language.

Assembly

Assembly language introduced mnemonics for machine instructions and memory addresses. Assembly is classified as imperative and is sometimes called the second-generation programming language.

In the 1960s, assembly languages were developed to support library COPY and quite sophisticated conditional macro generation and preprocessing abilities, CALL to subroutine, external variables and common sections (globals), enabling significant code re-use and isolation from hardware specifics via the use of logical operators such as READ/WRITE/GET/PUT. Assembly was, and still is, used for time-critical systems and often in embedded systems as it gives the most control of what the machine does.

Procedural languages

Procedural languages, also called the third-generation programming languages are the first described as high-level languages. They support vocabulary related to the problem being solved. For example,

  • COmmon Business Oriented Language (COBOL) – uses terms like file, move and copy.
  • FORmula TRANslation (FORTRAN) – using mathematical language terminology, it was developed mainly for scientific and engineering problems.
  • ALGOrithmic Language (ALGOL) – focused on being an appropriate language to define algorithms, while using mathematical language terminology, targeting scientific and engineering problems, just like FORTRAN.
  • Programming Language One (PL/I) – a hybrid commercial-scientific general purpose language supporting pointers.
  • Beginners All purpose Symbolic Instruction Code (BASIC) – it was developed to enable more people to write programs.
  • C – a general-purpose programming language, initially developed by Dennis Ritchie between 1969 and 1973 at AT&T Bell Labs.

These languages are classified as procedural paradigm. They directly control the step by step process that a computer program follows. The efficacy and efficiency of such a program is therefore highly dependent on the programmer's skill.

Object-oriented programming

In attempt to improve on procedural languages, object-oriented programming (OOP) languages were created, such as Simula, Smalltalk, C++, Eiffel, Python, PHP, Java, and C#. In these languages, data and methods to manipulate the data are in the same code unit called an object. This encapsulation ensures that the only way that an object can access data is via methods of the object that contains the data. Thus, an object's inner workings may be changed without affecting code that uses the object.

There is controversy raised by Alexander Stepanov, Richard Stallman and other programmers, concerning the efficacy of the OOP paradigm versus the procedural paradigm. The need for every object to have associative methods leads some skeptics to associate OOP with software bloat; an attempt to resolve this dilemma came through polymorphism.

Although most OOP languages are third-generation, it is possible to create an object-oriented assembler language. High Level Assembly (HLA) is an example of this that fully supports advanced data types and object-oriented assembly language programming – despite its early origins. Thus, differing programming paradigms can be seen rather like motivational memes of their advocates, rather than necessarily representing progress from one level to the next. Precise comparisons of competing paradigms' efficacy are frequently made more difficult because of new and differing terminology applied to similar entities and processes together with numerous implementation distinctions across languages.

Declarative languages

A declarative programming program describes what the problem is, not how to solve it. The program is structured as a set of properties to find in the expected result, not as a procedure to follow. Given a database or a set of rules, the computer tries to find a solution matching all the desired properties. An archetype of a declarative language is the fourth generation language SQL, and the family of functional languages and logic programming.

Functional programming is a subset of declarative programming. Programs written using this paradigm use functions, blocks of code intended to behave like mathematical functions. Functional languages discourage changes in the value of variables through assignment, making a great deal of use of recursion instead.

The logic programming paradigm views computation as automated reasoning over a body of knowledge. Facts about the problem domain are expressed as logic formulas, and programs are executed by applying inference rules over them until an answer to the problem is found, or the set of formulas is proved inconsistent.

Other paradigms

Symbolic programming is a paradigm that describes programs able to manipulate formulas and program components as data.[4] Programs can thus effectively modify themselves, and appear to "learn", making them suited for applications such as artificial intelligence, expert systems, natural-language processing and computer games. Languages that support this paradigm include Lisp and Prolog.

Differentiable programming structures programs so that they can be differentiated throughout, usually via automatic differentiation

Literate programming, as a form of imperative programming, structures programs as a human-centered web, as in a hypertext essay: documentation is integral to the program, and the program is structured following the logic of prose exposition, rather than compiler convenience.

Symbolic programming techniques such as reflective programming (reflection), which allow a program to refer to itself, might also be considered as a programming paradigm. However, this is compatible with the major paradigms and thus is not a real paradigm in its own right.

Extensible programming

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Extensible_programming In comp...