Search This Blog

Friday, November 17, 2023

Memory leak

From Wikipedia, the free encyclopedia
 
In computer science, a memory leak is a type of resource leak that occurs when a computer program incorrectly manages memory allocations in a way that memory which is no longer needed is not released. A memory leak may also happen when an object is stored in memory but cannot be accessed by the running code (i.e. unreachable memory). A memory leak has symptoms similar to a number of other problems and generally can only be diagnosed by a programmer with access to the program's source code.

A related concept is the "space leak", which is when a program consumes excessive memory but does eventually release it.

Because they can exhaust available system memory as an application runs, memory leaks are often the cause of or a contributing factor to software aging.

Consequences

A memory leak reduces the performance of the computer by reducing the amount of available memory. Eventually, in the worst case, too much of the available memory may become allocated and all or part of the system or device stops working correctly, the application fails, or the system slows down vastly due to thrashing.

Memory leaks may not be serious or even detectable by normal means. In modern operating systems, normal memory used by an application is released when the application terminates. This means that a memory leak in a program that only runs for a short time may not be noticed and is rarely serious.

Much more serious leaks include those:

  • where a program runs for a long time and consumes added memory over time, such as background tasks on servers, and especially in embedded systems which may be left running for many years
  • where new memory is allocated frequently for one-time tasks, such as when rendering the frames of a computer game or animated video
  • where a program can request memory, such as shared memory, that is not released, even when the program terminates
  • where memory is very limited, such as in an embedded system or portable device, or where the program requires a very large amount of memory to begin with, leaving little margin for leaks
  • where a leak occurs within the operating system or memory manager
  • when a system device driver causes a leak
  • running on an operating system that does not automatically release memory on program termination.

An example of memory leak

The following example, written in pseudocode, is intended to show how a memory leak can come about, and its effects, without needing any programming knowledge. The program in this case is part of some very simple software designed to control an elevator. This part of the program is run whenever anyone inside the elevator presses the button for a floor.

When a button is pressed:
  Get some memory, which will be used to remember the floor number
  Put the floor number into the memory
  Are we already on the target floor?
    If so, we have nothing to do: finished
    Otherwise:
      Wait until the lift is idle
      Go to the required floor
      Release the memory we used to remember the floor number

The memory leak would occur if the floor number requested is the same floor that the elevator is on; the condition for releasing the memory would be skipped. Each time this case occurs, more memory is leaked.

Cases like this would not usually have any immediate effects. People do not often press the button for the floor they are already on, and in any case, the elevator might have enough spare memory that this could happen hundreds or thousands of times. However, the elevator will eventually run out of memory. This could take months or years, so it might not be discovered despite thorough testing.

The consequences would be unpleasant; at the very least, the elevator would stop responding to requests to move to another floor (such as when an attempt is made to call the elevator or when someone is inside and presses the floor buttons). If other parts of the program need memory (a part assigned to open and close the door, for example), then no one would be able to enter, and if someone happens to be inside, they will become trapped (assuming the doors cannot be opened manually).

The memory leak lasts until the system is reset. For example: if the elevator's power were turned off or in a power outage, the program would stop running. When power was turned on again, the program would restart and all the memory would be available again, but the slow process of memory leak would restart together with the program, eventually prejudicing the correct running of the system.

The leak in the above example can be corrected by bringing the 'release' operation outside of the conditional:

When a button is pressed:
  Get some memory, which will be used to remember the floor number
  Put the floor number into the memory
  Are we already on the target floor?
    If not:
      Wait until the lift is idle
      Go to the required floor
  Release the memory we used to remember the floor number

Programming issues

Memory leaks are a common error in programming, especially when using languages that have no built in automatic garbage collection, such as C and C++. Typically, a memory leak occurs because dynamically allocated memory has become unreachable. The prevalence of memory leak bugs has led to the development of a number of debugging tools to detect unreachable memory. BoundsChecker, Deleaker, Memory Validator, IBM Rational Purify, Valgrind, Parasoft Insure++, Dr. Memory and memwatch are some of the more popular memory debuggers for C and C++ programs. "Conservative" garbage collection capabilities can be added to any programming language that lacks it as a built-in feature, and libraries for doing this are available for C and C++ programs. A conservative collector finds and reclaims most, but not all, unreachable memory.

Although the memory manager can recover unreachable memory, it cannot free memory that is still reachable and therefore potentially still useful. Modern memory managers therefore provide techniques for programmers to semantically mark memory with varying levels of usefulness, which correspond to varying levels of reachability. The memory manager does not free an object that is strongly reachable. An object is strongly reachable if it is reachable either directly by a strong reference or indirectly by a chain of strong references. (A strong reference is a reference that, unlike a weak reference, prevents an object from being garbage collected.) To prevent this, the developer is responsible for cleaning up references after use, typically by setting the reference to null once it is no longer needed and, if necessary, by deregistering any event listeners that maintain strong references to the object.

In general, automatic memory management is more robust and convenient for developers, as they do not need to implement freeing routines or worry about the sequence in which cleanup is performed or be concerned about whether or not an object is still referenced. It is easier for a programmer to know when a reference is no longer needed than to know when an object is no longer referenced. However, automatic memory management can impose a performance overhead, and it does not eliminate all of the programming errors that cause memory leaks.

RAII

Resource acquisition is initialization (RAII) is an approach to the problem commonly taken in C++, D, and Ada. It involves associating scoped objects with the acquired resources, and automatically releasing the resources once the objects are out of scope. Unlike garbage collection, RAII has the advantage of knowing when objects exist and when they do not. Compare the following C and C++ examples:

/* C version */
#include <stdlib.h>

void f(int n)
{
  int* array = calloc(n, sizeof(int));
  do_some_work(array);
  free(array);
}
// C++ version
#include <vector>

void f(int n)
{
  std::vector<int> array (n);
  do_some_work(array);
}

The C version, as implemented in the example, requires explicit deallocation; the array is dynamically allocated (from the heap in most C implementations), and continues to exist until explicitly freed.

The C++ version requires no explicit deallocation; it will always occur automatically as soon as the object array goes out of scope, including if an exception is thrown. This avoids some of the overhead of garbage collection schemes. And because object destructors can free resources other than memory, RAII helps to prevent the leaking of input and output resources accessed through a handle, which mark-and-sweep garbage collection does not handle gracefully. These include open files, open windows, user notifications, objects in a graphics drawing library, thread synchronisation primitives such as critical sections, network connections, and connections to the Windows Registry or another database.

However, using RAII correctly is not always easy and has its own pitfalls. For instance, if one is not careful, it is possible to create dangling pointers (or references) by returning data by reference, only to have that data be deleted when its containing object goes out of scope.

D uses a combination of RAII and garbage collection, employing automatic destruction when it is clear that an object cannot be accessed outside its original scope, and garbage collection otherwise.

Reference counting and cyclic references

More modern garbage collection schemes are often based on a notion of reachability – if you do not have a usable reference to the memory in question, it can be collected. Other garbage collection schemes can be based on reference counting, where an object is responsible for keeping track of how many references are pointing to it. If the number goes down to zero, the object is expected to release itself and allow its memory to be reclaimed. The flaw with this model is that it does not cope with cyclic references, and this is why nowadays most programmers are prepared to accept the burden of the more costly mark and sweep type of systems.

The following Visual Basic code illustrates the canonical reference-counting memory leak:

Dim A, B
Set A = CreateObject("Some.Thing")
Set B = CreateObject("Some.Thing")
' At this point, the two objects each have one reference,

Set A.member = B
Set B.member = A
' Now they each have two references.

Set A = Nothing   ' You could still get out of it...

Set B = Nothing   ' And now you've got a memory leak!

End

In practice, this trivial example would be spotted straight away and fixed. In most real examples, the cycle of references spans more than two objects, and is more difficult to detect.

A well-known example of this kind of leak came to prominence with the rise of AJAX programming techniques in web browsers in the lapsed listener problem. JavaScript code which associated a DOM element with an event handler, and failed to remove the reference before exiting, would leak memory (AJAX web pages keep a given DOM alive for a lot longer than traditional web pages, so this leak was much more apparent).

Effects

If a program has a memory leak and its memory usage is steadily increasing, there will not usually be an immediate symptom. Every physical system has a finite amount of memory, and if the memory leak is not contained (for example, by restarting the leaking program) it will eventually cause problems.

Most modern consumer desktop operating systems have both main memory which is physically housed in RAM microchips, and secondary storage such as a hard drive. Memory allocation is dynamic – each process gets as much memory as it requests. Active pages are transferred into main memory for fast access; inactive pages are pushed out to secondary storage to make room, as needed. When a single process starts consuming a large amount of memory, it usually occupies more and more of main memory, pushing other programs out to secondary storage – usually significantly slowing performance of the system. Even if the leaking program is terminated, it may take some time for other programs to swap back into main memory, and for performance to return to normal.

When all the memory on a system is exhausted (whether there is virtual memory or only main memory, such as on an embedded system) any attempt to allocate more memory will fail. This usually causes the program attempting to allocate the memory to terminate itself, or to generate a segmentation fault. Some programs are designed to recover from this situation (possibly by falling back on pre-reserved memory). The first program to experience the out-of-memory may or may not be the program that has the memory leak.

Some multi-tasking operating systems have special mechanisms to deal with an out-of-memory condition, such as killing processes at random (which may affect "innocent" processes), or killing the largest process in memory (which presumably is the one causing the problem). Some operating systems have a per-process memory limit, to prevent any one program from hogging all of the memory on the system. The disadvantage to this arrangement is that the operating system sometimes must be re-configured to allow proper operation of programs that legitimately require large amounts of memory, such as those dealing with graphics, video, or scientific calculations.

The "sawtooth" pattern of memory utilization: the sudden drop in used memory is a candidate symptom for a memory leak.

If the memory leak is in the kernel, the operating system itself will likely fail. Computers without sophisticated memory management, such as embedded systems, may also completely fail from a persistent memory leak.

Publicly accessible systems such as web servers or routers are prone to denial-of-service attacks if an attacker discovers a sequence of operations which can trigger a leak. Such a sequence is known as an exploit.

A "sawtooth" pattern of memory utilization may be an indicator of a memory leak within an application, particularly if the vertical drops coincide with reboots or restarts of that application. Care should be taken though because garbage collection points could also cause such a pattern and would show a healthy usage of the heap.

Other memory consumers

Note that constantly increasing memory usage is not necessarily evidence of a memory leak. Some applications will store ever increasing amounts of information in memory (e.g. as a cache). If the cache can grow so large as to cause problems, this may be a programming or design error, but is not a memory leak as the information remains nominally in use. In other cases, programs may require an unreasonably large amount of memory because the programmer has assumed memory is always sufficient for a particular task; for example, a graphics file processor might start by reading the entire contents of an image file and storing it all into memory, something that is not viable where a very large image exceeds available memory.

To put it another way, a memory leak arises from a particular kind of programming error, and without access to the program code, someone seeing symptoms can only guess that there might be a memory leak. It would be better to use terms such as "constantly increasing memory use" where no such inside knowledge exists.

A simple example in C++

The following C++ program deliberately leaks memory by losing the pointer to the allocated memory.

int main() {
     int* a = new int(5);
     a = nullptr;
    /* The pointer in the 'a' no longer exists, and therefore cannot be freed,
     but the memory is still allocated by the system.
     If the program continues to create such pointers without freeing them, 
     it will consume memory continuously.
     Therefore, a leak would occur. */
}

Memory management

From Wikipedia, the free encyclopedia
 
Memory management is a form of resource management applied to computer memory. The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and free it for reuse when no longer needed. This is critical to any advanced computer system where more than a single process might be underway at any time.

Several methods have been devised that increase the effectiveness of memory management. Virtual memory systems separate the memory addresses used by a process from actual physical addresses, allowing separation of processes and increasing the size of the virtual address space beyond the available amount of RAM using paging or swapping to secondary storage. The quality of the virtual memory manager can have an extensive effect on overall system performance.

In some operating systems, e.g. OS/360 and successors, memory is managed by the operating system. In other operating systems, e.g. Unix-like operating systems, memory is managed at the application level.

Memory management within an address space is generally categorized as either manual memory management or automatic memory management.

Manual memory management

An example of external fragmentation

The task of fulfilling an allocation request consists of locating a block of unused memory of sufficient size. Memory requests are satisfied by allocating portions from a large pool of memory called the heap or free store. At any given time, some parts of the heap are in use, while some are "free" (unused) and thus available for future allocations. In the C language, the function which allocates memory from the heap is called malloc and the function which takes previously allocated memory and marks it as "free" (to be used by future allocations) is called free

Several issues complicate the implementation, such as external fragmentation, which arises when there are many small gaps between allocated memory blocks, which invalidates their use for an allocation request. The allocator's metadata can also inflate the size of (individually) small allocations. This is often managed by chunking. The memory management system must track outstanding allocations to ensure that they do not overlap and that no memory is ever "lost" (i.e. that there are no "memory leaks").

Efficiency

The specific dynamic memory allocation algorithm implemented can impact performance significantly. A study conducted in 1994 by Digital Equipment Corporation illustrates the overheads involved for a variety of allocators. The lowest average instruction path length required to allocate a single memory slot was 52 (as measured with an instruction level profiler on a variety of software).

Implementations

Since the precise location of the allocation is not known in advance, the memory is accessed indirectly, usually through a pointer reference. The specific algorithm used to organize the memory area and allocate and deallocate chunks is interlinked with the kernel, and may use any of the following methods:

Fixed-size blocks allocation

Fixed-size blocks allocation, also called memory pool allocation, uses a free list of fixed-size blocks of memory (often all of the same size). This works well for simple embedded systems where no large objects need to be allocated, but suffers from fragmentation, especially with long memory addresses. However, due to the significantly reduced overhead this method can substantially improve performance for objects that need frequent allocation / de-allocation and is often used in video games.

Buddy blocks

In this system, memory is allocated into several pools of memory instead of just one, where each pool represents blocks of memory of a certain power of two in size, or blocks of some other convenient size progression. All blocks of a particular size are kept in a sorted linked list or tree and all new blocks that are formed during allocation are added to their respective memory pools for later use. If a smaller size is requested than is available, the smallest available size is selected and split. One of the resulting parts is selected, and the process repeats until the request is complete. When a block is allocated, the allocator will start with the smallest sufficiently large block to avoid needlessly breaking blocks. When a block is freed, it is compared to its buddy. If they are both free, they are combined and placed in the correspondingly larger-sized buddy-block list.

Slab allocation

This memory allocation mechanism preallocates memory chunks suitable to fit objects of a certain type or size. These chunks are called caches and the allocator only has to keep track of a list of free cache slots. Constructing an object will use any one of the free cache slots and destructing an object will add a slot back to the free cache slot list. This technique alleviates memory fragmentation and is efficient as there is no need to search for a suitable portion of memory, as any open slot will suffice.

Stack allocation

Many Unix-like systems as well as Microsoft Windows implement a function called alloca for dynamically allocating stack memory in a way similar to the heap-based malloc. A compiler typically translates it to inlined instructions manipulating the stack pointer. Although there is no need of manually freeing memory allocated this way as it is automatically freed when the function that called alloca returns, there exists a risk of overflow. And since alloca is an ad hoc expansion seen in many systems but never in POSIX or the C standard, its behavior in case of a stack overflow is undefined.

A safer version of alloca called _malloca, which reports errors, exists on Microsoft Windows. It requires the use of _freea. gnulib provides an equivalent interface, albeit instead of throwing an SEH exception on overflow, it delegates to malloc when an overlarge size is detected. A similar feature can be emulated using manual accounting and size-checking, such as in the uses of alloca_account in glibc.

Automated memory management

The proper management of memory in an application is a difficult problem, and several different strategies for handling memory management have been devised.

Automatic Management of Call Stack Variables

In many programming language implementations, the runtime environment for the program automatically allocates memory in the call stack for non-static local variables of a subroutine, called automatic variables, when the subroutine is called, and automatically releases that memory when the subroutine is exited. Special declarations may allow local variables to retain values between invocations of the procedure, or may allow local variables to be accessed by other subroutines. The automatic allocation of local variables makes recursion possible, to a depth limited by available memory.

Garbage collection

Garbage collection is a strategy for automatically detecting memory allocated to objects that are no longer usable in a program, and returning that allocated memory to a pool of free memory locations. This method is in contrast to "manual" memory management where a programmer explicitly codes memory requests and memory releases in the program. While automatic garbage collection has the advantages of reducing programmer workload and preventing certain kinds of memory allocation bugs, garbage collection does require memory resources of its own, and can compete with the application program for processor time.

Reference Counting

Reference counting is a strategy for detecting that memory is no longer usable by a program by maintaining a counter for how many independent pointers point to the memory. Whenever a new pointer points to a piece of memory, the programmer is supposed to increase the counter. When the pointer changes where it points, or when the pointer is no longer pointing to anything or has itself been freed, the counter should decrease. When the counter drops to zero, the memory should be considered unused and freed. Some reference counting systems require programmer involvement and some are implemented automatically by the compiler. A disadvantage of reference counting is that circular references can develop which cause a memory leak to occur. This can be mitigated by either adding the concept of a "weak reference" (a reference that does not participate in reference counting, but is notified when the thing it is pointing to is no longer valid) or by combining reference counting and garbage collection together.

Memory Pools

A memory pool is a technique of automatically deallocating memory based on the state of the application, such as the lifecycle of a request or transaction. The idea is that many applications execute large chunks of code which may generate memory allocations, but that there is a point in execution where all of those chunks are known to be no longer valid. For example, in a web service, after each request the web service no longer needs any of the memory allocated during the execution of the request. Therefore, rather than keeping track of whether or not memory is currently being referenced, the memory is allocated according to the request and/or lifecycle stage it is associated with. When that request or stage has passed, all associated memory is deallocated simultaneously.

Systems with virtual memory

Virtual memory is a method of decoupling the memory organization from the physical hardware. The applications operate on memory via virtual addresses. Each attempt by the application to access a particular virtual memory address results in the virtual memory address being translated to an actual physical address. In this way the addition of virtual memory enables granular control over memory systems and methods of access.

In virtual memory systems the operating system limits how a process can access the memory. This feature, called memory protection, can be used to disallow a process to read or write to memory that is not allocated to it, preventing malicious or malfunctioning code in one program from interfering with the operation of another.

Even though the memory allocated for specific processes is normally isolated, processes sometimes need to be able to share information. Shared memory is one of the fastest techniques for inter-process communication.

Memory is usually classified by access rate into primary storage and secondary storage. Memory management systems, among other operations, also handle the moving of information between these two levels of memory.

Memory management in OS/360 and successors

IBM System/360 does not support virtual memory. Memory isolation of jobs is optionally accomplished using protection keys, assigning storage for each job a different key, 0 for the supervisor or 1–15. Memory management in OS/360 is a supervisor function. Storage is requested using the GETMAIN macro and freed using the FREEMAIN macro, which result in a call to the supervisor (SVC) to perform the operation.

In OS/360 the details vary depending on how the system is generated, e.g., for PCP, MFT, MVT.

In OS/360 MVT, suballocation within a job's region or the shared System Queue Area (SQA) is based on subpools, areas a multiple of 2 KB in size—the size of an area protected by a protection key. Subpools are numbered 0–255. Within a region subpools are assigned either the job's storage protection or the supervisor's key, key 0. Subpools 0–127 receive the job's key. Initially only subpool zero is created, and all user storage requests are satisfied from subpool 0, unless another is specified in the memory request. Subpools 250–255 are created by memory requests by the supervisor on behalf of the job. Most of these are assigned key 0, although a few get the key of the job. Subpool numbers are also relevant in MFT, although the details are much simpler. MFT uses fixed partitions redefinable by the operator instead of dynamic regions and PCP has only a single partition.

Each subpool is mapped by a list of control blocks identifying allocated and free memory blocks within the subpool. Memory is allocated by finding a free area of sufficient size, or by allocating additional blocks in the subpool, up to the region size of the job. It is possible to free all or part of an allocated memory area.

The details for OS/VS1 are similar to those for MFT and for MVT; the details for OS/VS2 are similar to those for MVT, except that the page size is 4 KiB. For both OS/VS1 and OS/VS2 the shared System Queue Area (SQA) is nonpageable.

In MVS the address space includes an additional pageable shared area, the Common Storage Area (CSA), and an additional private area, the System Work area (SWA). Also, the storage keys 0-7 are all reserved for use by privileged code.

Software regression

From Wikipedia, the free encyclopedia

A software regression is a type of software bug where a feature that has worked before stops working. This may happen after changes are applied to the software's source code, including the addition of new features and bug fixes. They may also be introduced by changes to the environment in which the software is running, such as system upgrades, system patching or a change to daylight saving time. A software performance regression is a situation where the software still functions correctly, but performs more slowly or uses more memory or resources than before. Various types of software regressions have been identified in practice, including the following:

  • Local – a change introduces a new bug in the changed module or component.
  • Remote – a change in one part of the software breaks functionality in another module or component.
  • Unmasked – a change unmasks an already existing bug that had no effect before the change.

Regressions are often caused by encompassed bug fixes included in software patches. One approach to avoiding this kind of problem is regression testing. A properly designed test plan aims at preventing this possibility before releasing any software. Automated testing and well-written test cases can reduce the likelihood of a regression.

Prevention and detection

Techniques have been proposed that try to prevent regressions from being introduced into software at various stages of development, as outlined below.

Prior to release

In order to avoid regressions being seen by the end-user after release, developers regularly run regression tests after changes are introduced to the software. These tests can include unit tests to catch local regressions as well as integration tests to catch remote regressions. Regression testing techniques often leverage existing test cases to minimize the effort involved in creating them. However, due to the volume of these existing tests, it is often necessary to select a representative subset, using techniques such as test-case prioritization.

For detecting performance regressions, software performance tests are run on a regular basis, to monitor the response time and resource usage metrics of the software after subsequent changes. Unlike functional regression tests, the results of performance tests are subject to variance - that is, results can differ between tests due to variance in performance measurements; as a result, a decision must be made on whether a change in performance numbers constitutes a regression, based on experience and end-user demands. Approaches such as statistical significance testing and change point detection are sometimes used to aid in this decision.

Prior to commit

Since debugging and localizing the root cause of a software regression can be expensive, there also exist some methods that try to prevent regressions from being committed into the code repository in the first place. For example, Git Hooks enable developers to run test scripts before code changes are committed or pushed to the code repository. In addition, change impact analysis has been applied to software to predict the impact of a code change on various components of the program, and to supplement test case selection and prioritization. Software linters are also often added to commit hooks to ensure consistent coding style, thereby minimizing stylistic issues that can make the software prone to regressions.

Localization

Many of the techniques used to find the root cause of non-regression software bugs can also be used to debug software regressions, including breakpoint debugging, print debugging, and program slicing. The techniques described below are often used specifically to debug software regressions.

Functional regressions

A common technique used to localize functional regressions is bisection, which takes both a buggy commit and a previously working commit as input, and tries to find the root cause by doing a binary search on the commits in between. Version control systems such as Git and Mercurial provide built-in ways to perform bisection on a given pair of commits.

Other options include directly associating the result of a regression test with code changes; setting divergence breakpoints; or using incremental data-flow analysis, which identifies test cases - including failing ones - that are relevant to a set of code changes, among others.

Performance regressions

Profiling measures the performance and resource usage of various components of a program, and is used to generate data useful in debugging performance issues. In the context of software performance regressions, developers often compare the call trees (also known as "timelines") generated by profilers for both the buggy version and the previously working version, and mechanisms exist to simplify this comparison. Web development tools typically provide developers the ability to record these performance profiles.

Logging also helps with performance regression localization, and similar to call trees, developers can compare systematically-placed performance logs of multiple versions of the same software. A tradeoff exists when adding these performance logs, as adding many logs can help developers pinpoint which portions of the software are regressing at smaller granularities, while adding only a few logs will also reduce overhead when executing the program.

Additional approaches include writing performance-aware unit tests to help with localization, and ranking subsystems based on performance counter deviations. Bisection can also be repurposed for performance regressions by considering commits that perform below (or above) a certain baseline value as buggy, and taking either the left or the right side of the commits based on the results of this comparison.

Software bug

From Wikipedia, the free encyclopedia

Bugs in software can arise from mistakes and errors made in interpreting and extracting users' requirements, planning a program's design, writing its source code, and from interaction with humans, hardware and programs, such as operating systems or libraries. A program with many, or serious, bugs is often described as buggy. Bugs can trigger errors that may have ripple effects. The effects of bugs may be subtle, such as unintended text formatting, through to more obvious effects such as causing a program to crash, freezing the computer, or causing damage to hardware. Other bugs qualify as security bugs and might, for example, enable a malicious user to bypass access controls in order to obtain unauthorized privileges.

Some software bugs have been linked to disasters. Bugs in code that controlled the Therac-25 radiation therapy machine were directly responsible for patient deaths in the 1980s. In 1996, the European Space Agency's US$1 billion prototype Ariane 5 rocket was destroyed less than a minute after launch due to a bug in the on-board guidance computer program. In 1994, an RAF Chinook helicopter crashed, killing 29; this was initially blamed on pilot error, but was later thought to have been caused by a software bug in the engine-control computer. Buggy software caused the early 21st century British Post Office scandal, the most widespread miscarriage of justice in British legal history.

In 2002, a study commissioned by the US Department of Commerce's National Institute of Standards and Technology concluded that "software bugs, or errors, are so prevalent and so detrimental that they cost the US economy an estimated $59 billion annually, or about 0.6 percent of the gross domestic product".

History

The Middle English word bugge is the basis for the terms "bugbear" and "bugaboo" as terms used for a monster.

The term "bug" to describe defects has been a part of engineering jargon since the 1870s and predates electronics and computers; it may have originally been used in hardware engineering to describe mechanical malfunctions. For instance, Thomas Edison wrote in a letter to an associate in 1878:

... difficulties arise—this thing gives out and [it is] then that "Bugs"—as such little faults and difficulties are called—show themselves

Baffle Ball, the first mechanical pinball game, was advertised as being "free of bugs" in 1931. Problems with military gear during World War II were referred to as bugs (or glitches). In a book published in 1942, Louise Dickinson Rich, speaking of a powered ice cutting machine, said, "Ice sawing was suspended until the creator could be brought in to take the bugs out of his darling."

Isaac Asimov used the term "bug" to relate to issues with a robot in his short story "Catch That Rabbit", published in 1944.

A page from the Harvard Mark II electromechanical computer's log, featuring a dead moth that was removed from the device

The term "bug" was used in an account by computer pioneer Grace Hopper, who publicized the cause of a malfunction in an early electromechanical computer. A typical version of the story is:

In 1946, when Hopper was released from active duty, she joined the Harvard Faculty at the Computation Laboratory where she continued her work on the Mark II and Mark III. Operators traced an error in the Mark II to a moth trapped in a relay, coining the term bug. This bug was carefully removed and taped to the log book. Stemming from the first bug, today we call errors or glitches in a program a bug.

Hopper was not present when the bug was found, but it became one of her favorite stories. The date in the log book was September 9, 1947. The operators who found it, including William "Bill" Burke, later of the Naval Weapons Laboratory, Dahlgren, Virginia, were familiar with the engineering term and amusedly kept the insect with the notation "First actual case of bug being found." This log book, complete with attached moth, is part of the collection of the Smithsonian National Museum of American History.

The related term "debug" also appears to predate its usage in computing: the Oxford English Dictionary's etymology of the word contains an attestation from 1945, in the context of aircraft engines.

The concept that software might contain errors dates back to Ada Lovelace's 1843 notes on the analytical engine, in which she speaks of the possibility of program "cards" for Charles Babbage's analytical engine being erroneous:

... an analysing process must equally have been performed in order to furnish the Analytical Engine with the necessary operative data; and that herein may also lie a possible source of error. Granted that the actual mechanism is unerring in its processes, the cards may give it wrong orders.

Terminology

While the use of the term "bug" to describe software errors is common, many have suggested that it should be abandoned. One argument is that the word "bug" is divorced from a sense that a human being caused the problem, and instead implies that the defect arose on its own, leading to a push to abandon the term "bug" in favor of terms such as "defect", with limited success.

The term "bug" may also be used to cover up an intentional design decision. In 2011, after receiving scrutiny from US Senator Al Franken for recording and storing users' locations in unencrypted files, Apple called the behavior a bug. However, Justin Brookman of the Center for Democracy and Technology directly challenged that portrayal, stating "I'm glad that they are fixing what they call bugs, but I take exception with their strong denial that they track users."

In software engineering, mistake metamorphism (from Greek meta = "change", morph = "form") refers to the evolution of a defect in the final stage of software deployment. Transformation of a "mistake" committed by an analyst in the early stages of the software development lifecycle, which leads to a "defect" in the final stage of the cycle has been called 'mistake metamorphism'.

Different stages of a "mistake" in the entire cycle may be described as "mistakes", "anomalies", "faults", "failures", "errors", "exceptions", "crashes", "glitches", "bugs", "defects", "incidents", or "side effects".

Prevention

Error resulting from a software bug displayed on two screens at La Croix de Berny station in France

The software industry has put much effort into reducing bug counts. These include:

Typographical errors

Bugs usually appear when the programmer makes a logic error. Various innovations in programming style and defensive programming are designed to make these bugs less likely, or easier to spot. Some typos, especially of symbols or operators, allow the program to operate incorrectly, while others such as a missing symbol or misspelled name may prevent the program from operating. Compiled languages can reveal some typos when the source code is compiled.

Development methodologies

Several schemes assist managing programmer activity so that fewer bugs are produced. Software engineering (which addresses software design issues as well) applies many techniques to prevent defects. For example, formal program specifications state the exact behavior of programs so that design bugs may be eliminated. Formal specifications are impractical for anything but the shortest programs, because of problems of combinatorial explosion and indeterminacy.

Unit testing involves writing a test for every function (unit) that a program is to perform.

In test-driven development unit tests are written before the code and the code is not considered complete until all tests complete successfully.

Agile software development involves frequent software releases with relatively small changes. Defects are revealed by user feedback.

Open source development allows anyone to examine source code. A school of thought popularized by Eric S. Raymond as Linus's law says that popular open-source software has more chance of having few or no bugs than other software, because "given enough eyeballs, all bugs are shallow". This assertion has been disputed, however: computer security specialist Elias Levy wrote that "it is easy to hide vulnerabilities in complex, little understood and undocumented source code," because, "even if people are reviewing the code, that doesn't mean they're qualified to do so." An example of an open-source software bug was the 2008 OpenSSL vulnerability in Debian.

Programming language support

Programming languages include features to help prevent bugs, such as static type systems, restricted namespaces and modular programming. For example, when a programmer writes (pseudocode) LET REAL_VALUE PI = "THREE AND A BIT", although this may be syntactically correct, the code fails a type check. Compiled languages catch this without having to run the program. Interpreted languages catch such errors at runtime. Some languages deliberately exclude features that easily lead to bugs, at the expense of slower performance: the general principle being that, it is almost always better to write simpler, slower code than inscrutable code that runs slightly faster, especially considering that maintenance cost is substantial. For example, the Java programming language does not support pointer arithmetic; implementations of some languages such as Pascal and scripting languages often have runtime bounds checking of arrays, at least in a debugging build.

Code analysis

Tools for code analysis help developers by inspecting the program text beyond the compiler's capabilities to spot potential problems. Although in general the problem of finding all programming errors given a specification is not solvable (see halting problem), these tools exploit the fact that human programmers tend to make certain kinds of simple mistakes often when writing software.

Instrumentation

Tools to monitor the performance of the software as it is running, either specifically to find problems such as bottlenecks or to give assurance as to correct working, may be embedded in the code explicitly (perhaps as simple as a statement saying PRINT "I AM HERE"), or provided as tools. It is often a surprise to find where most of the time is taken by a piece of code, and this removal of assumptions might cause the code to be rewritten.

Testing

Software testers are people whose primary task is to find bugs, or write code to support testing. On some efforts, more resources may be spent on testing than in developing the program.

Measurements during testing can provide an estimate of the number of likely bugs remaining; this becomes more reliable the longer a product is tested and developed.

Debugging

The typical bug history (GNU Classpath project data). A new bug submitted by the user is unconfirmed. Once it has been reproduced by a developer, it is a confirmed bug. The confirmed bugs are later fixed. Bugs belonging to other categories (unreproducible, will not be fixed, etc.) are usually in the minority.

Finding and fixing bugs, or debugging, is a major part of computer programming. Maurice Wilkes, an early computing pioneer, described his realization in the late 1940s that much of the rest of his life would be spent finding mistakes in his own programs.

Usually, the most difficult part of debugging is finding the bug. Once it is found, correcting it is usually relatively easy. Programs known as debuggers help programmers locate bugs by executing code line by line, watching variable values, and other features to observe program behavior. Without a debugger, code may be added so that messages or values may be written to a console or to a window or log file to trace program execution or show values.

However, even with the aid of a debugger, locating bugs is something of an art. It is not uncommon for a bug in one section of a program to cause failures in a completely different section, thus making it especially difficult to track (for example, an error in a graphics rendering routine causing a file I/O routine to fail), in an apparently unrelated part of the system.

Sometimes, a bug is not an isolated flaw, but represents an error of thinking or planning on the part of the programmer. Such logic errors require a section of the program to be overhauled or rewritten. As a part of code review, stepping through the code and imagining or transcribing the execution process may often find errors without ever reproducing the bug as such.

More typically, the first step in locating a bug is to reproduce it reliably. Once the bug is reproducible, the programmer may use a debugger or other tool while reproducing the error to find the point at which the program went astray.

Some bugs are revealed by inputs that may be difficult for the programmer to re-create. One cause of the Therac-25 radiation machine deaths was a bug (specifically, a race condition) that occurred only when the machine operator very rapidly entered a treatment plan; it took days of practice to become able to do this, so the bug did not manifest in testing or when the manufacturer attempted to duplicate it. Other bugs may stop occurring whenever the setup is augmented to help find the bug, such as running the program with a debugger; these are called heisenbugs (humorously named after the Heisenberg uncertainty principle).

Since the 1990s, particularly following the Ariane 5 Flight 501 disaster, interest in automated aids to debugging rose, such as static code analysis by abstract interpretation.

Some classes of bugs have nothing to do with the code. Faulty documentation or hardware may lead to problems in system use, even though the code matches the documentation. In some cases, changes to the code eliminate the problem even though the code then no longer matches the documentation. Embedded systems frequently work around hardware bugs, since to make a new version of a ROM is much cheaper than remanufacturing the hardware, especially if they are commodity items.

Benchmark of bugs

To facilitate reproducible research on testing and debugging, researchers use curated benchmarks of bugs:

  • the Siemens benchmark
  • ManyBugs is a benchmark of 185 C bugs in nine open-source programs.
  • Defects4J is a benchmark of 341 Java bugs from 5 open-source projects. It contains the corresponding patches, which cover a variety of patch type.

Bug management

Bug management includes the process of documenting, categorizing, assigning, reproducing, correcting and releasing the corrected code. Proposed changes to software – bugs as well as enhancement requests and even entire releases – are commonly tracked and managed using bug tracking systems or issue tracking systems. The items added may be called defects, tickets, issues, or, following the agile development paradigm, stories and epics. Categories may be objective, subjective or a combination, such as version number, area of the software, severity and priority, as well as what type of issue it is, such as a feature request or a bug.

A bug triage reviews bugs and decides whether and when to fix them. The decision is based on the bug's priority, and factors such as development schedules. The triage is not meant to investigate the cause of bugs, but rather the cost of fixing them. The triage happens regularly, and goes through bugs opened or reopened since the previous meeting. The attendees of the triage process typically are the project manager, development manager, test manager, build manager, and technical experts.

Severity

Severity is the intensity of the impact the bug has on system operation. This impact may be data loss, financial, loss of goodwill and wasted effort. Severity levels are not standardized. Impacts differ across industry. A crash in a video game has a totally different impact than a crash in a web browser, or real time monitoring system. For example, bug severity levels might be "crash or hang", "no workaround" (meaning there is no way the customer can accomplish a given task), "has workaround" (meaning the user can still accomplish the task), "visual defect" (for example, a missing image or displaced button or form element), or "documentation error". Some software publishers use more qualified severities such as "critical", "high", "low", "blocker" or "trivial". The severity of a bug may be a separate category to its priority for fixing, and the two may be quantified and managed separately.

Priority

Priority controls where a bug falls on the list of planned changes. The priority is decided by each software producer. Priorities may be numerical, such as 1 through 5, or named, such as "critical", "high", "low", or "deferred". These rating scales may be similar or even identical to severity ratings, but are evaluated as a combination of the bug's severity with its estimated effort to fix; a bug with low severity but easy to fix may get a higher priority than a bug with moderate severity that requires excessive effort to fix. Priority ratings may be aligned with product releases, such as "critical" priority indicating all the bugs that must be fixed before the next software release.

A bug severe enough to delay or halt the release of the product is called a "show stopper" or "showstopper bug". It is named so because it "stops the show" – causes unacceptable product failure.

Software releases

It is common practice to release software with known, low-priority bugs. Bugs of sufficiently high priority may warrant a special release of part of the code containing only modules with those fixes. These are known as patches. Most releases include a mixture of behavior changes and multiple bug fixes. Releases that emphasize bug fixes are known as maintenance releases, to differentiate it from major releases that emphasize feature additions or changes.

Reasons that a software publisher opts not to patch or even fix a particular bug include:

  • A deadline must be met and resources are insufficient to fix all bugs by the deadline.
  • The bug is already fixed in an upcoming release, and it is not of high priority.
  • The changes required to fix the bug are too costly or affect too many other components, requiring a major testing activity.
  • It may be suspected, or known, that some users are relying on the existing buggy behavior; a proposed fix may introduce a breaking change.
  • The problem is in an area that will be obsolete with an upcoming release; fixing it is unnecessary.
  • "It's not a bug, it's a feature". A misunderstanding has arisen between expected and perceived behavior or undocumented feature.

Types

In software development, a mistake or error may be introduced at any stage. Bugs arise from oversight or misunderstanding by a software team during specification, design, coding, configuration, data entry or documentation. For example, a relatively simple program to alphabetize a list of words, the design might fail to consider what should happen when a word contains a hyphen. Or when converting an abstract design into code, the coder might inadvertently create an off-by-one error which can be a "<" where "<=" was intended, and fail to sort the last word in a list.

Another category of bug is called a race condition that may occur when programs have multiple components executing at the same time. If the components interact in a different order than the developer intended, they could interfere with each other and stop the program from completing its tasks. These bugs may be difficult to detect or anticipate, since they may not occur during every execution of a program.

Conceptual errors are a developer's misunderstanding of what the software must do. The resulting software may perform according to the developer's understanding, but not what is really needed. Other types:

Arithmetic

In operations on numerical values, problems can arise that result in unexpected output, slowing of a process, or crashing. These can be from a lack of awareness of the qualities of the data storage such as a loss of precision due to rounding, numerically unstable algorithms, arithmetic overflow and underflow, or from lack of awareness of how calculations are handled by different software coding languages such as division by zero which in some languages may throw an exception, and in others may return a special value such as NaN or infinity.

Control flow

Control flow bugs are those found in processes with valid logic, but that lead to unintended results, such as infinite loops and infinite recursion, incorrect comparisons for conditional statements such as using the incorrect comparison operator, and off-by-one errors (counting one too many or one too few iterations when looping).

Interfacing

  • Incorrect API usage.
  • Incorrect protocol implementation.
  • Incorrect hardware handling.
  • Incorrect assumptions of a particular platform.
  • Incompatible systems. A new API or communications protocol may seem to work when two systems use different versions, but errors may occur when a function or feature implemented in one version is changed or missing in another. In production systems which must run continually, shutting down the entire system for a major update may not be possible, such as in the telecommunication industry or the internet. In this case, smaller segments of a large system are upgraded individually, to minimize disruption to a large network. However, some sections could be overlooked and not upgraded, and cause compatibility errors which may be difficult to find and repair.
  • Incorrect code annotations.

Concurrency

Resourcing

Syntax

  • Use of the wrong token, such as performing assignment instead of equality test. For example, in some languages x=5 will set the value of x to 5 while x==5 will check whether x is currently 5 or some other number. Interpreted languages allow such code to fail. Compiled languages can catch such errors before testing begins.

Teamwork

  • Unpropagated updates; e.g. programmer changes "myAdd" but forgets to change "mySubtract", which uses the same algorithm. These errors are mitigated by the Don't Repeat Yourself philosophy.
  • Comments out of date or incorrect: many programmers assume the comments accurately describe the code.
  • Differences between documentation and product.

Implications

The amount and type of damage a software bug may cause naturally affects decision-making, processes and policy regarding software quality. In applications such as human spaceflight, aviation, nuclear power, health care, public transport or automotive safety, since software flaws have the potential to cause human injury or even death, such software will have far more scrutiny and quality control than, for example, an online shopping website. In applications such as banking, where software flaws have the potential to cause serious financial damage to a bank or its customers, quality control is also more important than, say, a photo editing application.

Other than the damage caused by bugs, some of their cost is due to the effort invested in fixing them. In 1978, Lientz et al. showed that the median of projects invest 17 percent of the development effort in bug fixing. In 2020, research on GitHub repositories showed the median is 20%.

Residual bugs in delivered product

In 1994, NASA's Goddard Space Flight Center managed to reduce their average number of errors from 4.5 per 1000 lines of code (SLOC) down to 1 per 1000 SLOC.

Another study in 1990 reported that exceptionally good software development processes can achieve deployment failure rates as low as 0.1 per 1000 SLOC. This figure is iterated in literature such as Code Complete by Steve McConnell, and the NASA study on Flight Software Complexity. Some projects even attained zero defects: the firmware in the IBM Wheelwriter typewriter which consists of 63,000 SLOC, and the Space Shuttle software with 500,000 SLOC.

Well-known bugs

A number of software bugs have become well-known, usually due to their severity: examples include various space and military aircraft crashes. Possibly the most famous bug is the Year 2000 problem or Y2K bug, which caused many programs written long before the transition from 19xx to 20xx dates to malfunction, for example treating a date such as "25 Dec 04" as being in 1904, displaying "19100" instead of "2000", and so on. A huge effort at the end of the 20th century resolved the most severe problems, and there were no major consequences.

The 2012 stock trading disruption involved one such incompatibility between the old API and a new API.

In politics

"Bugs in the System" report

The Open Technology Institute, run by the group, New America, released a report "Bugs in the System" in August 2016 stating that U.S. policymakers should make reforms to help researchers identify and address software bugs. The report "highlights the need for reform in the field of software vulnerability discovery and disclosure." One of the report's authors said that Congress has not done enough to address cyber software vulnerability, even though Congress has passed a number of bills to combat the larger issue of cyber security.

Government researchers, companies, and cyber security experts are the people who typically discover software flaws. The report calls for reforming computer crime and copyright laws.

The Computer Fraud and Abuse Act, the Digital Millennium Copyright Act and the Electronic Communications Privacy Act criminalize and create civil penalties for actions that security researchers routinely engage in while conducting legitimate security research, the report said.

In popular culture

  • In video gaming, the term "glitch" is sometimes used to refer to a software bug. An example is the glitch and unofficial Pokémon species MissingNo.
  • In both the 1968 novel 2001: A Space Odyssey and the corresponding 1968 film 2001: A Space Odyssey, a spaceship's onboard computer, HAL 9000, attempts to kill all its crew members. In the follow-up 1982 novel, 2010: Odyssey Two, and the accompanying 1984 film, 2010, it is revealed that this action was caused by the computer having been programmed with two conflicting objectives: to fully disclose all its information, and to keep the true purpose of the flight secret from the crew; this conflict caused HAL to become paranoid and eventually homicidal.
  • In the English version of the Nena 1983 song 99 Luftballons (99 Red Balloons) as a result of "bugs in the software", a release of a group of 99 red balloons are mistaken for an enemy nuclear missile launch, requiring an equivalent launch response, resulting in catastrophe.
  • In the 1999 American comedy Office Space, three employees attempt (unsuccessfully) to exploit their company's preoccupation with the Y2K computer bug using a computer virus that sends rounded-off fractions of a penny to their bank account—a long-known technique described as salami slicing.
  • The 2004 novel The Bug, by Ellen Ullman, is about a programmer's attempt to find an elusive bug in a database application.
  • The 2008 Canadian film Control Alt Delete is about a computer programmer at the end of 1999 struggling to fix bugs at his company related to the year 2000 problem.

Authorship of the Bible

From Wikipedia, the free encyclopedia ...