Search This Blog

Tuesday, June 16, 2020

Automatic bug fixing

From Wikipedia, the free encyclopedia
 
Automatic bug-fixing is the automatic repair of software bugs without the intervention of a human programmer. It is also commonly referred to as automatic patch generation, automatic bug repair, or automatic program repair. The typical goal of such techniques is to automatically generate correct patches to eliminate bugs in software programs without causing software regression.

Specification

Automatic bug fixing is made according to a specification of the expected behavior which can be for instance a formal specification or a test suite.

A test-suite – the input/output pairs specify the functionality of the program, possibly captured in assertions can be used as a test oracle to drive the search. This oracle can in fact be divided between the bug oracle that exposes the faulty behavior, and the regression oracle, which encapsulates the functionality any program repair method must preserve. Note that a test suite is typically incomplete and does not cover all possible cases. Therefore, it is often possible for a validated patch to produce expected outputs for all inputs in the test suite but incorrect outputs for other inputs. The existence of such validated but incorrect patches is a major challenge for generate-and-validate techniques. Recent successful automatic bug-fixing techniques often rely on additional information other than the test suite, such as information learned from previous human patches, to further identify correct patches among validated patches.

Another way to specify the expected behavior is to use formal specifications Verification against full specifications that specify the whole program behavior including functionalities is less common because such specifications are typically not available in practice and the computation cost of such verification is prohibitive. For specific classes of errors, however, implicit partial specifications are often available. For example, there are targeted bug-fixing techniques validating that the patched program can no longer trigger overflow errors in the same execution path.

Techniques

Generate-and-validate

Generate-and-validate approaches compile and test each candidate patch to collect all validated patches that produce expected outputs for all inputs in the test suite. Such a technique typically starts with a test suite of the program, i.e., a set of test cases, at least one of which exposes the bug. An early generate-and-validate bug-fixing systems is GenProg. The effectiveness of generate-and-validate techniques remains controversial, because they typically do not provide patch correctness guarantees. Nevertheless, the reported results of recent state-of-the-art techniques are generally promising. For example, on systematically collected 69 real world bugs in eight large C software programs, the state-of-the-art bug-fixing system Prophet generates correct patches for 18 out of the 69 bugs.

One way to generate candidate patches is to apply mutation operators on the original program. Mutation operators manipulate the original program, potentially via its abstract syntax tree representation, or a more coarse-grained representation such as operating at the statement-level or block-level. Earlier genetic improvement approaches operate at the statement level and carry out simple delete/replace operations such as deleting an existing statement or replacing an existing statement with another statement in the same source file. Recent approaches use more fine-grained operators at the abstract syntax tree level to generate more diverse set of candidate patches.

Another way to generate candidate patches consists of using fix templates. Fix templates are typically predefined changes for fixing specific classes of bugs. Examples of fix templates include inserting a conditional statement to check whether the value of a variable is null to fix null pointer exception, or changing an integer constant by one to fix off-by-one errors. It is also possible to automatically mine fix templates for generate-and-validate approaches.

Many generate-and-validate techniques rely on the redundancy insight: the code of the patch can be found elsewhere in the application. This idea was introduced in the Genprog system, where two operators, addition and replacement of AST nodes, were based on code taken from elsewhere (i.e. adding an existing AST node). This idea has been validated empirically, with two independent studies that have shown that a significant proportion of commits (3%-17%) are composed of existing code. Beyond the fact that the code to reuse exists somewhere else, it has also been shown that the context of the potential repair ingredients is useful: often, the donor context is similar to the recipient context.

Synthesis-based

Repair techniques exist that are based on symbolic execution. For example, Semfix uses symbolic execution to extract a repair constraint. Angelix introduced the concept of angelic forest in order to deal with multiline patches.

Under certain assumptions, it is possible to state the repair problem as a synthesis problem. SemFix and Nopol uses component-based synthesis. Dynamoth uses dynamic synthesis. S3 is based on syntax-guided synthesis. SearchRepair converts potential patches into an SMT formula and queries candidate patches that allow the patched program to pass all supplied test cases.

Data-driven

Machine learning techniques can improve the effectiveness of automatic bug-fixing systems. One example of such techniques learns from past successful patches from human developers collected from open source repositories in GitHub and SourceForge. It then use the learned information to recognize and prioritize potentially correct patches among all generated candidate patches. Alternatively, patches can be directly mined from existing sources. Example approaches include mining patches from donor applications or from QA web sites.

SequenceR uses sequence-to-sequence learning on source code in order to generate one-line patches. It defines a neural network architecture that works well with source code, with the copy mechanism that allows to produce patches with tokens that are not in the learned vocabulary. Those tokens are taken from the code of the Java class under repair.

Other

Targeted automatic bug-fixing techniques generate repairs for specific classes of errors such as null pointer exception, integer overflow, buffer overflow, memory leak, etc. Such techniques often use empirical fix templates to fix bugs in the targeted scope. For example, insert a conditional statement to check whether the value of a variable is null or insert missing memory deallocation statements. Comparing to generate-and-validate techniques, targeted techniques tend to have better bug-fixing accuracy but a much narrowed scope.

Use

There are multiple uses of automatic bug fixing:
  • in the development environment: when the developer encounters a bug, she activates a feature to search for a patch (for instance by clicking on a button). This search can even happen in the background, when the IDE proactively searches for solutions to potential problems, without waiting for a developer explicit action.
  • in the continuous integration server: when a build fails during continuous, a patch search can be attempted as soon as the build has failed. If the search is successful, the patch is given to the developer before she has started working on it, or before she has found the solution. When a synthesized patch is suggested to the developers as pull-request, an explanation has to be provided in addition to the code changes (eg. a pull request title and description). An experiment has shown that generated patches can be accepted by open-source developers and merged in the code repository.
  • at runtime: when a failure happens at runtime, a binary patch can be searched for and applied online. An example of such a repair system is ClearView, which does repair on x86 code, with x86 binary patches. The Itzal system is different from Clearview: while the repair search happens at runtime, in production, the produced patches are at the source code level. The BikiniProxy system does online repair of Javascript errors happening in the browser.

Search space

In essence, automatic bug fixing is a search activity, whether deductive-based or heuristic-based. The search space of automatic bug fixing is composed of all edits that can be possibly made to a program. There have been studies to understand the structure of this search space. Qi et al. showed that the original fitness function of Genprog is not better than random search to drive the search. Martinez et al. explored the imbalance between possible repair actions, showing its significant impact on the search. Long et al.'s study indicated that correct patches can be considered as sparse in the search space and that incorrect overfitting patches are vastly more abundant (see also discussion about overfitting below).

If one explicitly enumerates all possible variants in a repair algorithm, this defines a design space for program repair. Each variant selects an algorithm involved at some point in the repair process (eg the fault localization algorithm), or selects a specific heuristic which yields different patches. For instance, in the design space of generate-and-validate program repair, there is one variation point about the granularity of the program elements to be modified: an expression, a statement, a block, etc.

Limitations of automatic bug-fixing

Automatic bug-fixing techniques that rely on a test suite do not provide patch correctness guarantees, because the test suite is incomplete and does not cover all cases. A weak test suite may cause generate-and-validate techniques to produce validated but incorrect patches that have negative effects such as eliminating desirable functionalities, causing memory leaks, and introducing security vulnerabilities.

Sometimes, in test-suite based program repair, tools generate patches that pass the test suite, yet are actually incorrect, this is known as the "overfitting" problem. "Overfitting" in this context refers to the fact that the patch overfits to the test inputs. There are different kinds of overfitting: incomplete fixing means that only some buggy inputs are fixed, regression introduction means some previously working features are broken after the patch (because they were poorly tested). Early prototypes for automatic repair suffered a lot from overfitting: on the Manybugs C benchmark, Qi et al. reported that 104/110 of plausible GenProg patches were overfitting; on the Defects4J Java benchmark, Martinez et al. reported that 73/84 plausible patches as overfitting. In the context of synthesis-based repair, Le et al. obtained more than 80% of overfitting patches.

Another limitation of generate-and-validate systems is the search space explosion. For a program, there are a large number of statements to change and for each statement there are a large number of possible modifications. State-of-the-art systems address this problem by assuming that a small modification is enough for fixing a bug, resulting in a search space reduction.

The limitation of approaches based on symbolic analysis is that real world programs are often converted to intractably large formulas especially for modifying statements with side effects.

Benchmarks

The benchmark collected by GenProg authors contains 69 real world defects and it is widely used to evaluate many other bug-fixing tools in C as well. The state-of-the-art tool evaluated on GenProg benchmark is Prophet, generating correct patches for 18 out of 69 defects. In Java, the main benchmark is Defects4J, initially explored by Martinez et al., and now extensively used in most research papers on program repair for Java. Alternative benchmarks exist, such as the Quixbugs benchmark, which contains original bugs for program repair. Other benchmarks of Java bugs include Bugs.jar, based on past commits, and BEARS which is a benchmark of continuous integration build failures.

Example tools

Automatic bug-fixing is an active research topic in computer science. There are many implementations of various bug-fixing techniques especially for C and Java programs. Note that most of these implementations are research prototypes for demonstrating their techniques, i.e., it is unclear whether their current implementations are ready for industrial usage or not.

C

  • ClearView: A generate-and-validate tool of generating binary patches for deployed systems. It is evaluated on 10 security vulnerability cases. A later study shows that it generates correct patches for at least 4 of the 10 cases.
  • GenProg: A seminal generate-and-validate bug-fixing tool. It has been extensively studied in the context of the ManyBugs benchmark.
  • SemFix: The first solver-based bug-fixing tool for C.
  • CodePhage: The first bug-fixing tool that directly transfer code across programs to generate patch for C program. Note that although it generates C patches, it can extract code from binary programs without source code.
  • LeakFix: A tool that automatically fixes memory leaks in C programs.
  • Prophet: The first generate-and-validate tool that uses machine learning techniques to learn useful knowledge from past human patches to recognize correct patches. It is evaluated on the same benchmark as GenProg and generate correct patches (i.e., equivalent to human patches) for 18 out of 69 cases.
  • SearchRepair: A tool for replacing buggy code using snippets of code from elsewhere. It is evaluated on the IntroClass benchmark and generates much higher quality patches on that benchmark than GenProg, RSRepair, and AE.
  • Angelix: An improved solver-based bug-fixing tool. It is evaluated on the GenProg benchmark. For 10 out of the 69 cases, it generate patches that is equivalent to human patches.

Java

  • PAR: A generate-and-validate tool that uses a set of manually defined fix templates. A later study raised concerns about the generalizability of the fix templates in PAR.
  • NOPOL: A solver-based tool focusing on modifying condition statements.
  • QACrashFix: A tool that fixes Java crash bugs by mining fixes from Q&A web site.
  • Astor: An automatic repair library for Java, containing jGenProg, a Java implementation of GenProg.
  • NpeFix: An automatic repair tool for NullPointerException in Java, available on Github.

Other languages

  • AutoFixE: A bug-fixing tool for Eiffel language. It relies the contracts (i.e., a form of formal specification) in Eiffel programs to validate generated patches.

Proprietary

Software bug

From Wikipedia, the free encyclopedia

A software bug is an error, flaw or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways. The process of finding and fixing bugs is termed "debugging" and often uses formal techniques or tools to pinpoint bugs, and since the 1950s, some computer systems have been designed to also deter, detect or auto-correct various computer bugs during operations.

Most bugs arise from mistakes and errors made in either a program's design or its source code, or in components and operating systems used by such programs. A few are caused by compilers producing incorrect code. A program that contains many bugs, and/or bugs that seriously interfere with its functionality, is said to be buggy (defective). Bugs can trigger errors that may have ripple effects. Bugs may have subtle effects or cause the program to crash or freeze the computer. Other bugs qualify as security bugs and might, for example, enable a malicious user to bypass access controls in order to obtain unauthorized privileges.

Some software bugs have been linked to disasters. Bugs in code that controlled the Therac-25 radiation therapy machine were directly responsible for patient deaths in the 1980s. In 1996, the European Space Agency's US$1 billion prototype Ariane 5 rocket had to be destroyed less than a minute after launch due to a bug in the on-board guidance computer program. In June 1994, a Royal Air Force Chinook helicopter crashed into the Mull of Kintyre, killing 29. This was initially dismissed as pilot error, but an investigation by Computer Weekly convinced a House of Lords inquiry that it may have been caused by a software bug in the aircraft's engine-control computer.

In 2002, a study commissioned by the US Department of Commerce's National Institute of Standards and Technology concluded that "software bugs, or errors, are so prevalent and so detrimental that they cost the US economy an estimated $59 billion annually, or about 0.6 percent of the gross domestic product".

History

The Middle English word bugge is the basis for the terms "bugbear" and "bugaboo" as terms used for a monster.

The term "bug" to describe defects has been a part of engineering jargon since the 1870s and predates electronic computers and computer software; it may have originally been used in hardware engineering to describe mechanical malfunctions. For instance, Thomas Edison wrote the following words in a letter to an associate in 1878:
It has been just so in all of my inventions. The first step is an intuition, and comes with a burst, then difficulties arise—this thing gives out and [it is] then that "Bugs"—as such little faults and difficulties are called—show themselves and months of intense watching, study and labor are requisite before commercial success or failure is certainly reached.
Baffle Ball, the first mechanical pinball game, was advertised as being "free of bugs" in 1931. Problems with military gear during World War II were referred to as bugs (or glitches). In the 1940 film, Flight Command, a defect in a piece of direction-finding gear is called a "bug". In a book published in 1942, Louise Dickinson Rich, speaking of a powered ice cutting machine, said, "Ice sawing was suspended until the creator could be brought in to take the bugs out of his darling."

Isaac Asimov used the term "bug" to relate to issues with a robot in his short story "Catch That Rabbit", published in 1944.

A page from the Harvard Mark II electromechanical computer's log, featuring a dead moth that was removed from the device.
 
The term "bug" was used in an account by computer pioneer Grace Hopper, who publicized the cause of a malfunction in an early electromechanical computer. A typical version of the story is:
In 1946, when Hopper was released from active duty, she joined the Harvard Faculty at the Computation Laboratory where she continued her work on the Mark II and Mark III. Operators traced an error in the Mark II to a moth trapped in a relay, coining the term bug. This bug was carefully removed and taped to the log book. Stemming from the first bug, today we call errors or glitches in a program a bug.
Hopper did not find the bug, as she readily acknowledged. The date in the log book was September 9, 1947. The operators who found it, including William "Bill" Burke, later of the Naval Weapons Laboratory, Dahlgren, Virginia, were familiar with the engineering term and amusedly kept the insect with the notation "First actual case of bug being found." Hopper loved to recount the story. This log book, complete with attached moth, is part of the collection of the Smithsonian National Museum of American History.

The related term "debug" also appears to predate its usage in computing: the Oxford English Dictionary's etymology of the word contains an attestation from 1945, in the context of aircraft engines.

The concept that software might contain errors dates back to Ada Lovelace's 1843 notes on the analytical engine, in which she speaks of the possibility of program "cards" for Charles Babbage's analytical engine being erroneous:
... an analysing process must equally have been performed in order to furnish the Analytical Engine with the necessary operative data; and that herein may also lie a possible source of error. Granted that the actual mechanism is unerring in its processes, the cards may give it wrong orders.

"Bugs in the System" report

The Open Technology Institute, run by the group, New America, released a report "Bugs in the System" in August 2016 stating that U.S. policymakers should make reforms to help researchers identify and address software bugs. The report "highlights the need for reform in the field of software vulnerability discovery and disclosure." One of the report's authors said that Congress has not done enough to address cyber software vulnerability, even though Congress has passed a number of bills to combat the larger issue of cyber security.

Government researchers, companies, and cyber security experts are the people who typically discover software flaws. The report calls for reforming computer crime and copyright laws.
The Computer Fraud and Abuse Act, the Digital Millennium Copyright Act and the Electronic Communications Privacy Act criminalize and create civil penalties for actions that security researchers routinely engage in while conducting legitimate security research, the report said.

Terminology

While the use of the term "bug" to describe software errors is common, many have suggested that it should be abandoned. One argument is that the word "bug" is divorced from a sense that a human being caused the problem, and instead implies that the defect arose on its own, leading to a push to abandon the term "bug" in favor of terms such as "defect", with limited success. Since the 1970s Gary Kildall somewhat humorously suggested to use the term "blunder".

In software engineering, mistake metamorphism (from Greek meta = "change", morph = "form") refers to the evolution of a defect in the final stage of software deployment. Transformation of a "mistake" committed by an analyst in the early stages of the software development lifecycle, which leads to a "defect" in the final stage of the cycle has been called 'mistake metamorphism'. Different stages of a "mistake" in the entire cycle may be described as "mistakes", "anomalies", "faults", "failures", "errors", "exceptions", "crashes", " glitches", "bugs", "defects", "incidents", or "side effects".

Prevention

The software industry has put much effort into reducing bug counts. These include:

Typographical errors

Bugs usually appear when the programmer makes a logic error. Various innovations in programming style and defensive programming are designed to make these bugs less likely, or easier to spot. Some typos, especially of symbols or logical/mathematical operators, allow the program to operate incorrectly, while others such as a missing symbol or misspelled name may prevent the program from operating. Compiled languages can reveal some typos when the source code is compiled.

Development methodologies

Several schemes assist managing programmer activity so that fewer bugs are produced. Software engineering (which addresses software design issues as well) applies many techniques to prevent defects. For example, formal program specifications state the exact behavior of programs so that design bugs may be eliminated. Unfortunately, formal specifications are impractical for anything but the shortest programs, because of problems of combinatorial explosion and indeterminacy

Unit testing involves writing a test for every function (unit) that a program is to perform.

In test-driven development unit tests are written before the code and the code is not considered complete until all tests complete successfully.

Agile software development involves frequent software releases with relatively small changes. Defects are revealed by user feedback.

Open source development allows anyone to examine source code. A school of thought popularized by Eric S. Raymond as Linus's law says that popular open-source software has more chance of having few or no bugs than other software, because "given enough eyeballs, all bugs are shallow". This assertion has been disputed, however: computer security specialist Elias Levy wrote that "it is easy to hide vulnerabilities in complex, little understood and undocumented source code," because, "even if people are reviewing the code, that doesn't mean they're qualified to do so." An example of this actually happening, accidentally, was the 2008 OpenSSL vulnerability in Debian.

Programming language support

Programming languages include features to help prevent bugs, such as static type systems, restricted namespaces and modular programming. For example, when a programmer writes (pseudocode) LET REAL_VALUE PI = "THREE AND A BIT", although this may be syntactically correct, the code fails a type check. Compiled languages catch this without having to run the program. Interpreted languages catch such errors at runtime. Some languages deliberately exclude features that easily lead to bugs, at the expense of slower performance: the general principle being that, it is almost always better to write simpler, slower code than inscrutable code that runs slightly faster, especially considering that maintenance cost is substantial. For example, the Java programming language does not support pointer arithmetic; implementations of some languages such as Pascal and scripting languages often have runtime bounds checking of arrays, at least in a debugging build.

Code analysis

Tools for code analysis help developers by inspecting the program text beyond the compiler's capabilities to spot potential problems. Although in general the problem of finding all programming errors given a specification is not solvable, these tools exploit the fact that human programmers tend to make certain kinds of simple mistakes often when writing software.

Instrumentation

Tools to monitor the performance of the software as it is running, either specifically to find problems such as bottlenecks or to give assurance as to correct working, may be embedded in the code explicitly (perhaps as simple as a statement saying PRINT "I AM HERE"), or provided as tools. It is often a surprise to find where most of the time is taken by a piece of code, and this removal of assumptions might cause the code to be rewritten.

Testing

Software testers are people whose primary task is to find bugs, or write code to support testing. On some projects, more resources may be spent on testing than in developing the program.

Measurements during testing can provide an estimate of the number of likely bugs remaining; this becomes more reliable the longer a product is tested and developed.

Debugging

The typical bug history (GNU Classpath project data). A new bug submitted by the user is unconfirmed. Once it has been reproduced by a developer, it is a confirmed bug. The confirmed bugs are later fixed. Bugs belonging to other categories (unreproducible, will not be fixed, etc.) are usually in the minority

Finding and fixing bugs, or debugging, is a major part of computer programming. Maurice Wilkes, an early computing pioneer, described his realization in the late 1940s that much of the rest of his life would be spent finding mistakes in his own programs.

Usually, the most difficult part of debugging is finding the bug. Once it is found, correcting it is usually relatively easy. Programs known as debuggers help programmers locate bugs by executing code line by line, watching variable values, and other features to observe program behavior. Without a debugger, code may be added so that messages or values may be written to a console or to a window or log file to trace program execution or show values.

However, even with the aid of a debugger, locating bugs is something of an art. It is not uncommon for a bug in one section of a program to cause failures in a completely different section, thus making it especially difficult to track (for example, an error in a graphics rendering routine causing a file I/O routine to fail), in an apparently unrelated part of the system. 

Sometimes, a bug is not an isolated flaw, but represents an error of thinking or planning on the part of the programmer. Such logic errors require a section of the program to be overhauled or rewritten. As a part of code review, stepping through the code and imagining or transcribing the execution process may often find errors without ever reproducing the bug as such. 

More typically, the first step in locating a bug is to reproduce it reliably. Once the bug is reproducible, the programmer may use a debugger or other tool while reproducing the error to find the point at which the program went astray. 

Some bugs are revealed by inputs that may be difficult for the programmer to re-create. One cause of the Therac-25 radiation machine deaths was a bug (specifically, a race condition) that occurred only when the machine operator very rapidly entered a treatment plan; it took days of practice to become able to do this, so the bug did not manifest in testing or when the manufacturer attempted to duplicate it. Other bugs may stop occurring whenever the setup is augmented to help find the bug, such as running the program with a debugger; these are called heisenbugs (humorously named after the Heisenberg uncertainty principle). 

Since the 1990s, particularly following the Ariane 5 Flight 501 disaster, interest in automated aids to debugging rose, such as static code analysis by abstract interpretation.

Some classes of bugs have nothing to do with the code. Faulty documentation or hardware may lead to problems in system use, even though the code matches the documentation. In some cases, changes to the code eliminate the problem even though the code then no longer matches the documentation. Embedded systems frequently work around hardware bugs, since to make a new version of a ROM is much cheaper than remanufacturing the hardware, especially if they are commodity items.

Benchmark of bugs

To facilitate reproducible research on testing and debugging, researchers use curated benchmarks of bugs:
  • the Siemens benchmark
  • ManyBugs is a benchmark of 185 C bugs in nine open-source programs.
  • Defects4J is a benchmark of 341 Java bugs from 5 open-source projects. It contains the corresponding patches, which cover a variety of patch type.
  • BEARS is a benchmark of continuous integration build failures focusing on test failures. It has been created by monitoring builds from open-source projects on Travis CI.

Bug management

Bug management includes the process of documenting, categorizing, assigning, reproducing, correcting and releasing the corrected code. Proposed changes to software – bugs as well as enhancement requests and even entire releases – are commonly tracked and managed using bug tracking systems or issue tracking systems. The items added may be called defects, tickets, issues, or, following the agile development paradigm, stories and epics. Categories may be objective, subjective or a combination, such as version number, area of the software, severity and priority, as well as what type of issue it is, such as a feature request or a bug.

Severity

Severity is the impact the bug has on system operation. This impact may be data loss, financial, loss of goodwill and wasted effort. Severity levels are not standardized. Impacts differ across industry. A crash in a video game has a totally different impact than a crash in a web browser, or real time monitoring system. For example, bug severity levels might be "crash or hang", "no workaround" (meaning there is no way the customer can accomplish a given task), "has workaround" (meaning the user can still accomplish the task), "visual defect" (for example, a missing image or displaced button or form element), or "documentation error". Some software publishers use more qualified severities such as "critical", "high", "low", "blocker" or "trivial". The severity of a bug may be a separate category to its priority for fixing, and the two may be quantified and managed separately.

Priority

Priority controls where a bug falls on the list of planned changes. The priority is decided by each software producer. Priorities may be numerical, such as 1 through 5, or named, such as "critical", "high", "low", or "deferred". These rating scales may be similar or even identical to severity ratings, but are evaluated as a combination of the bug's severity with its estimated effort to fix; a bug with low severity but easy to fix may get a higher priority than a bug with moderate severity that requires excessive effort to fix. Priority ratings may be aligned with product releases, such as "critical" priority indicating all the bugs that must be fixed before the next software release.

Software releases

It is common practice to release software with known, low-priority bugs. Most big software projects maintain two lists of "known bugs" – those known to the software team, and those to be told to users. The second list informs users about bugs that are not fixed in a specific release and workarounds may be offered. Releases are of different kinds. Bugs of sufficiently high priority may warrant a special release of part of the code containing only modules with those fixes. These are known as patches. Most releases include a mixture of behavior changes and multiple bug fixes. Releases that emphasize bug fixes are known as maintenance releases. Releases that emphasize feature additions/changes are known as major releases and often have names to distinguish the new features from the old.
Reasons that a software publisher opts not to patch or even fix a particular bug include:
  • A deadline must be met and resources are insufficient to fix all bugs by the deadline.
  • The bug is already fixed in an upcoming release, and it is not of high priority.
  • The changes required to fix the bug are too costly or affect too many other components, requiring a major testing activity.
  • It may be suspected, or known, that some users are relying on the existing buggy behavior; a proposed fix may introduce a breaking change.
  • The problem is in an area that will be obsolete with an upcoming release; fixing it is unnecessary.
  • It's "not a bug". A misunderstanding has arisen between expected and perceived behavior, when such misunderstanding is not due to confusion arising from design flaws, or faulty documentation.

Types

In software development projects, a "mistake" or "fault" may be introduced at any stage. Bugs arise from oversights or misunderstandings made by a software team during specification, design, coding, data entry or documentation. For example, a relatively simple program to alphabetize a list of words, the design might fail to consider what should happen when a word contains a hyphen. Or when converting an abstract design into code, the coder might inadvertently create an off-by-one error and fail to sort the last word in a list. Errors may be as simple as a typing error: a "<" where a ">" was intended.

Another category of bug is called a race condition that may occur when programs have multiple components executing at the same time. If the components interact in a different order than the developer intended, they could interfere with each other and stop the program from completing its tasks. These bugs may be difficult to detect or anticipate, since they may not occur during every execution of a program.

Conceptual errors are a developer's misunderstanding of what the software must do. The resulting software may perform according to the developer's understanding, but not what is really needed. Other types:

Arithmetic

Logic

Syntax

  • Use of the wrong operator, such as performing assignment instead of equality test. For example, in some languages x=5 will set the value of x to 5 while x==5 will check whether x is currently 5 or some other number. Interpreted languages allow such code to fail. Compiled languages can catch such errors before testing begins.

Resource

Multi-threading

Interfacing

  • Incorrect API usage.
  • Incorrect protocol implementation.
  • Incorrect hardware handling.
  • Incorrect assumptions of a particular platform.
  • Incompatible systems. A new API or communications protocol may seem to work when two systems use different versions, but errors may occur when a function or feature implemented in one version is changed or missing in another. In production systems which must run continually, shutting down the entire system for a major update may not be possible, such as in the telecommunication industry or the internet. In this case, smaller segments of a large system are upgraded individually, to minimize disruption to a large network. However, some sections could be overlooked and not upgraded, and cause compatibility errors which may be difficult to find and repair.
  • Incorrect code annotations

Teamworking

  • Unpropagated updates; e.g. programmer changes "myAdd" but forgets to change "mySubtract", which uses the same algorithm. These errors are mitigated by the Don't Repeat Yourself philosophy.
  • Comments out of date or incorrect: many programmers assume the comments accurately describe the code.
  • Differences between documentation and product.

Implications

The amount and type of damage a software bug may cause naturally affects decision-making, processes and policy regarding software quality. In applications such as manned space travel or automotive safety, since software flaws have the potential to cause human injury or even death, such software will have far more scrutiny and quality control than, for example, an online shopping website. In applications such as banking, where software flaws have the potential to cause serious financial damage to a bank or its customers, quality control is also more important than, say, a photo editing application. NASA's Software Assurance Technology Center managed to reduce the number of errors to fewer than 0.1 per 1000 lines of code (SLOC) but this was not felt to be feasible for projects in the business world.

Well-known bugs

A number of software bugs have become well-known, usually due to their severity: examples include various space and military aircraft crashes. Possibly the most famous bug is the Year 2000 problem, also known as the Y2K bug, in which it was feared that worldwide economic collapse would happen at the start of the year 2000 as a result of computers thinking it was 1900. (In the end, no major problems occurred.) The 2012 stock trading disruption involved one such incompatibility between the old API and a new API.

In popular culture

  • In both the 1968 novel 2001: A Space Odyssey and the corresponding 1968 film 2001: A Space Odyssey, a spaceship's onboard computer, HAL 9000, attempts to kill all its crew members. In the follow-up 1982 novel, 2010: Odyssey Two, and the accompanying 1984 film, 2010, it is revealed that this action was caused by the computer having been programmed with two conflicting objectives: to fully disclose all its information, and to keep the true purpose of the flight secret from the crew; this conflict caused HAL to become paranoid and eventually homicidal.
  • In the 1999 American comedy Office Space, three employees attempt to exploit their company's preoccupation with fixing the Y2K computer bug by infecting the company's computer system with a virus that sends rounded off pennies to a separate bank account. The plan backfires as the virus itself has its own bug, which sends large amounts of money to the account prematurely.
  • The 2004 novel The Bug, by Ellen Ullman, is about a programmer's attempt to find an elusive bug in a database application.
  • The 2008 Canadian film Control Alt Delete is about a computer programmer at the end of 1999 struggling to fix bugs at his company related to the year 2000 problem.

Glitch

From Wikipedia, the free encyclopedia

A glitch is a short-lived fault in a system, such as a transient fault that corrects itself, making it difficult to troubleshoot. The term is particularly common in the computing and electronics industries, in circuit bending, as well as among players of video games. More generally, all types of systems including human organizations and nature experience glitches.

A glitch, which is slight and often temporary, differs from a more serious bug which is a genuine functionality-breaking problem. Alex Pieschel, writing for Arcade Review, said: "“bug” is often cast as the weightier and more blameworthy pejorative, while “glitch” suggests something more mysterious and unknowable inflicted by surprise inputs or stuff outside the realm of code."

Etymology

Some reference books, including Random House's American Slang, claim that the term comes from the German word glitschen ("to slip") and the Yiddish word gletshn ("to slide or skid"). Either way, it is a relatively new term. It was first widely defined for the American people by Bennett Cerf on the June 20, 1965 episode of What's My Line as "a kink... when anything goes wrong down there [Cape Kennedy], they say there's been a slight glitch." Astronaut John Glenn explained the term in his section of the book Into Orbit, writing that
Another term we adopted to describe some of our problems was "glitch." Literally, a glitch is a spike or change in voltage in an electrical circuit which takes place when the circuit suddenly has a new load put on it. You have probably noticed a dimming of lights in your home when you turn a switch or start the dryer or the television set. Normally, these changes in voltage are protected by fuses. A glitch, however, is such a minute change in voltage that no fuse could protect against it.
John Daily further defined the word on the July 4, 1965, episode of the same show, saying that it's a term used by the Air Force at Cape Kennedy, in the process of launching rockets, "it means something's gone wrong and you can't figure out what it is so you call it a 'glitch'." Later, on July 23, 1965, Time Magazine felt it necessary to define it in an article: "Glitches—a spaceman's word for irritating disturbances." In relation to the reference by Time Magazine, the term has been believed to enter common usage during the American Space Race of the 1950s, where it was used to describe minor faults in the rocket hardware that were difficult to pinpoint.

Electronics glitch

An electronics glitch or logic hazard is a transition that occurs on a signal before the signal settles to its intended value, particularly in a digital circuit. Generally, this implies an electrical pulse of short duration, often due to a race condition between two signals derived from a common source but with different delays. In some cases, such as a well-timed synchronous circuit, this could be a harmless and well-tolerated effect that occurs normally in a design. In other contexts, a glitch can represent an undesirable result of a fault or design error that can produce a malfunction. Some electronic components, such as flip-flops, are triggered by a pulse that must not be shorter than a specified minimum duration in order to function correctly; a pulse shorter than the specified minimum may be called a glitch. A related concept is the runt pulse, a pulse whose amplitude is smaller than the minimum level specified for correct operation, and a spike, a short pulse similar to a glitch but often caused by ringing or crosstalk.

Computer glitch

A computer glitch is the failure of a system, usually containing a computing device, to complete its functions or to perform them properly. 

In public declarations, glitch is used to suggest a minor fault which will soon be rectified and is therefore used as a euphemism for a bug, which is a factual statement that a programming fault is to blame for a system failure. 

It frequently refers to an error which is not detected at the time it occurs but shows up later in data errors or incorrect human decisions. Situations which are frequently called computer glitches are incorrectly written software (software bugs), incorrect instructions given by the operator (operator errors, and a failure to account for this possibility might also be considered a software bug), undetected invalid input data (this might also be considered a software bug), undetected communications errors, computer viruses, Trojan attacks and computer exploiting (sometimes called "hacking"). 

Such glitches could produce problems such as keyboard malfunction, number key failures, screen abnormalities (turned left, right or upside-down), random program malfunctions, and abnormal program registering. 

Examples of computer glitches causing disruption include an unexpected shutdown of a water filtration plant in New Canaan, 2010, failures in the Computer Aided Dispatch system used by the police in Austin, resulting in unresponded 911 calls, and an unexpected bit flip causing the Cassini spacecraft to enter "safe mode" in November 2010. Glitches can also be costly: in 2015, a bank was unable to raise interest rates for weeks resulting in losses of more than a million dollars per day.

Video game glitches

The start-up screen of the Virtual Boy is affected by a visual glitch

Glitches/bugs are software errors that can cause drastic problems within the code, and typically go unnoticed or unsolved during the production of said software. These errors can be game caused or otherwise exploited until a developer/development team repairs them with patches. Complex software is rarely bug-free or otherwise free from errors upon first release.

Texture/model glitches are a kind of bug or other error that causes any specific model or texture to either become distorted or otherwise to not look as intended by the developers. Bethesda's The Elder Scrolls V: Skyrim is notorious for texture glitches, as well as other errors that affect many of the company's popular titles. Many games that use ragdoll physics for their character models can have such glitches happen to them.

Physics glitches are errors in a game's physics engine that causes a specific entity, be it a physics object or an NPC (Non-Player Character), to be unintentionally moved to some degree. These kinds of errors can be exploited, unlike many. The chance of a physics error happening can either be entirely random or accidentally caused.

Sound glitches are in which there is an error with the game's sound. These can range from sounds playing when not intended to play or even not playing at all. Occasionally, a certain sound will loop or otherwise the player will be given the option to continuously play the sound when not intended. Often, games will play sounds incorrectly due to corrupt data altering the values predefined in the code. Examples include, but are not limited to, extremely high or low pitched sounds, volume being mute or too high to understand, and also rarely even playing in reverse order/playing reversed.

Glitches may include incorrectly displayed graphics, collision detection errors, game freezes/crashes, sound errors, and other issues. Graphical glitches are especially notorious in platforming games, where malformed textures can directly affect gameplay (for example, by displaying a ground texture where the code calls for an area that should damage the character, or by not displaying a wall texture where there should be one, resulting in an invisible wall). Some glitches are potentially dangerous to the game's stored data.

"Glitching" is the practice of players exploiting faults in a video game's programming to achieve tasks that give them an unfair advantage in the game, over NPC's or other players, such as running through walls or defying the game's physics. Glitches can be deliberately induced in certain home video game consoles by manipulating the game medium, such as tilting a ROM cartridge to disconnect one or more connections along the edge connector and interrupt part of the flow of data between the cartridge and the console. This can result in graphic, music, or gameplay errors. Doing this, however, carries the risk of crashing the game or even causing permanent damage to the game medium.

Heavy use of glitches are often used in performing a speedrun of a video game. One type of glitch often used for speedrunning is a stack overflow, which is referred to as "overflowing." Another type of speedrunning glitch, which is almost impossible to do by humans and is mostly made use of in tool assisted speedruns, is arbitrary code execution which will cause an object in a game to do something outside of its intended function.

Part of the quality assurance process (as performed by game testers for video games) is locating and reproducing glitches, and then compiling reports on the glitches to be fed back to the programmers so that they can repair the bugs. Certain games have a cloud-type system for updates to the software that can be used to repair coding faults and other errors in the games.

Glitches can also be found in electronic toys. For example, in 2013, Hasbro released a game called Bop It Beats. It was discovered by several players that the DJ Expert and Lights Only modes have a bug that will give players a fail sound upon reaching a pattern with six actions and completing them successfully. The more difficult DJ modes can be completed in the Party mode as long as there is a "Pass It" on the last few patterns. Hasbro was informed about this glitch but as it was discovered after manufacture, they can no longer update or upgrade existing units. Foreign versions of the game, however, were shipped with this glitch already patched.

Glitches in games should not be confused with exploits. Despite them both performing unintended actions, an exploit is not a programming error, but instead an oversight by the developers. (Ex. Bunny hopping or Lag Exploits)

Television glitch

In broadcasting, a corrupted signal may glitch in the form of jagged lines on the screen, misplaced squares, static looking effects, freezing problems, or inverted colors. The glitches may affect the video and/or audio (usually audio dropout) or the transmission. These glitches may be caused by a variety of issues, interference from portable electronics or microwaves, damaged cables at the broadcasting center, or weather.

In popular culture

Multiple works of popular culture deal with glitches; those with the word "glitch" or derivations thereof are detailed in Glitch (disambiguation).
  • The nonfiction book CB Bible (1976) includes glitch in its glossary of citizens band radio slang, defining it as "an indefinable technical defect in CB equipment", indicating the term was already then in use on citizens band.
  • The short film The Glitch (2008), opening film and best science fiction finalist at Dragon Con Independent Film Festival 2008, deals with the disorientation of late-night TV viewer Harry Owen (Scott Charles Blamphin), who experiences 'heavy brain-splitting digital breakdowns.'

Soft error

From Wikipedia, the free encyclopedia

In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before. One cause of soft errors is single event upsets from cosmic rays.

In a computer's memory system, a soft error changes an instruction in a program or a data value. Soft errors typically can be remedied by cold booting the computer. A soft error will not damage a system's hardware; the only damage is to the data that is being processed.

There are two types of soft errors, chip-level soft error and system-level soft error. Chip-level soft errors occur when particles hit the chip, e.g., when secondary particles from cosmic rays land on the silicon die. If a particle with certain properties hits a memory cell it can cause the cell to change state to a different value. The atomic reaction in this example is so tiny that it does not damage the physical structure of the chip. System-level soft errors occur when the data being processed is hit with a noise phenomenon, typically when the data is on a data bus. The computer tries to interpret the noise as a data bit, which can cause errors in addressing or processing program code. The bad data bit can even be saved in memory and cause problems at a later time.

If detected, a soft error may be corrected by rewriting correct data in place of erroneous data. Highly reliable systems use error correction to correct soft errors on the fly. However, in many systems, it may be impossible to determine the correct data, or even to discover that an error is present at all. In addition, before the correction can occur, the system may have crashed, in which case the recovery procedure must include a reboot. Soft errors involve changes to data‍—‌the electrons in a storage circuit, for example‍—‌but not changes to the physical circuit itself, the atoms. If the data is rewritten, the circuit will work perfectly again. Soft errors can occur on transmission lines, in digital logic, analog circuits, magnetic storage, and elsewhere, but are most commonly known in semiconductor storage.

Critical charge

Whether or not a circuit experiences a soft error depends on the energy of the incoming particle, the geometry of the impact, the location of the strike, and the design of the logic circuit. Logic circuits with higher capacitance and higher logic voltages are less likely to suffer an error. This combination of capacitance and voltage is described by the critical charge parameter, Qcrit, the minimum electron charge disturbance needed to change the logic level. A higher Qcrit means fewer soft errors. Unfortunately, a higher Qcrit also means a slower logic gate and a higher power dissipation. Reduction in chip feature size and supply voltage, desirable for many reasons, decreases Qcrit. Thus, the importance of soft errors increases as chip technology advances.

In a logic circuit, Qcrit is defined as the minimum amount of induced charge required at a circuit node to cause a voltage pulse to propagate from that node to the output and be of sufficient duration and magnitude to be reliably latched. Since a logic circuit contains many nodes that may be struck, and each node may be of unique capacitance and distance from output, Qcrit is typically characterized on a per-node basis.

Causes of soft errors

Alpha particles from package decay

Soft errors became widely known with the introduction of dynamic RAM in the 1970s. In these early devices, ceramic chip packaging materials contained small amounts of radioactive contaminants. Very low decay rates are needed to avoid excess soft errors, and chip companies have occasionally suffered problems with contamination ever since. It is extremely hard to maintain the material purity needed. Controlling alpha particle emission rates for critical packaging materials to less than a level of 0.001 counts per hour per cm2 (cph/cm2) is required for reliable performance of most circuits. For comparison, the count rate of a typical shoe's sole is between 0.1 and 10 cph/cm2.

Package radioactive decay usually causes a soft error by alpha particle emission. The positive charged alpha particle travels through the semiconductor and disturbs the distribution of electrons there. If the disturbance is large enough, a digital signal can change from a 0 to a 1 or vice versa. In combinational logic, this effect is transient, perhaps lasting a fraction of a nanosecond, and this has led to the challenge of soft errors in combinational logic mostly going unnoticed. In sequential logic such as latches and RAM, even this transient upset can become stored for an indefinite time, to be read out later. Thus, designers are usually much more aware of the problem in storage circuits. 

A 2011 Black Hat paper discusses the real-life security implications of such bit-flips in the Internet's DNS system. The paper found up to 3,434 incorrect requests per day due to bit-flip changes for various common domains. Many of these bit-flips would probably be attributable to hardware problems, but some could be attributed to alpha particles. These bit-flip errors may be taken advantage of by malicious actors in the form of bitsquatting

Isaac Asimov received a letter congratulating him on an accidental prediction of alpha-particle RAM errors in a 1950s novel.

Cosmic rays creating energetic neutrons and protons

Once the electronics industry had determined how to control package contaminants, it became clear that other causes were also at work. James F. Ziegler led a program of work at IBM which culminated in the publication of a number of papers (Ziegler and Lanford, 1979) demonstrating that cosmic rays also could cause soft errors. Indeed, in modern devices, cosmic rays may be the predominant cause. Although the primary particle of the cosmic ray does not generally reach the Earth's surface, it creates a shower of energetic secondary particles. At the Earth's surface approximately 95% of the particles capable of causing soft errors are energetic neutrons with the remainder composed of protons and pions. IBM estimated in 1996 that one error per month per 256 MiB of ram was expected for a desktop computer. This flux of energetic neutrons is typically referred to as "cosmic rays" in the soft error literature. Neutrons are uncharged and cannot disturb a circuit on their own, but undergo neutron capture by the nucleus of an atom in a chip. This process may result in the production of charged secondaries, such as alpha particles and oxygen nuclei, which can then cause soft errors.

Cosmic ray flux depends on altitude. For the common reference location of 40.7° N, 74° W at sea level (New York City, NY, USA) the flux is approximately 14 neutrons/cm2/hour. Burying a system in a cave reduces the rate of cosmic-ray induced soft errors to a negligible level. In the lower levels of the atmosphere, the flux increases by a factor of about 2.2 for every 1000 m (1.3 for every 1000 ft) increase in altitude above sea level. Computers operated on top of mountains experience an order of magnitude higher rate of soft errors compared to sea level. The rate of upsets in aircraft may be more than 300 times the sea level upset rate. This is in contrast to package decay induced soft errors, which do not change with location. As chip density increases, Intel expects the errors caused by cosmic rays to increase and become a limiting factor in design.

The average rate of cosmic-ray soft errors is inversely proportional to sunspot activity. That is, the average number of cosmic-ray soft errors decreases during the active portion of the sunspot cycle and increases during the quiet portion. This counter-intuitive result occurs for two reasons. The Sun does not generally produce cosmic ray particles with energy above 1 GeV that are capable of penetrating to the Earth's upper atmosphere and creating particle showers, so the changes in the solar flux do not directly influence the number of errors. Further, the increase in the solar flux during an active sun period does have the effect of reshaping the Earth's magnetic field providing some additional shielding against higher energy cosmic rays, resulting in a decrease in the number of particles creating showers. The effect is fairly small in any case resulting in a ±7% modulation of the energetic neutron flux in New York City. Other locations are similarly affected.

One experiment measured the soft error rate at the sea level to be 5,950 failures in time (FIT = failures per billion hours) per DRAM chip. When the same test setup was moved to an underground vault, shielded by over 50 feet (15 m) of rock that effectively eliminated all cosmic rays, zero soft errors were recorded. In this test, all other causes of soft errors are too small to be measured, compared to the error rate caused by cosmic rays.

Energetic neutrons produced by cosmic rays may lose most of their kinetic energy and reach thermal equilibrium with their surroundings as they are scattered by materials. The resulting neutrons are simply referred to as thermal neutrons and have an average kinetic energy of about 25 millielectron-volts at 25 °C. Thermal neutrons are also produced by environmental radiation sources such as the decay of naturally occurring uranium or thorium. The thermal neutron flux from sources other than cosmic-ray showers may still be noticeable in an underground location and an important contributor to soft errors for some circuits.

Thermal neutrons

Neutrons that have lost kinetic energy until they are in thermal equilibrium with their surroundings are an important cause of soft errors for some circuits. At low energies many neutron capture reactions become much more probable and result in fission of certain materials creating charged secondaries as fission byproducts. For some circuits the capture of a thermal neutron by the nucleus of the 10B isotope of boron is particularly important. This nuclear reaction is an efficient producer of an alpha particle, 7Li nucleus and gamma ray. Either of the charged particles (alpha or 7Li) may cause a soft error if produced in very close proximity, approximately 5 µm, to a critical circuit node. The capture cross section for 11B is 6 orders of magnitude smaller and does not contribute to soft errors.

Boron has been used in BPSG, the insulator in the interconnection layers of integrated circuits, particularly in the lowest one. The inclusion of boron lowers the melt temperature of the glass providing better reflow and planarization characteristics. In this application the glass is formulated with a boron content of 4% to 5% by weight. Naturally occurring boron is 20% 10B with the remainder the 11B isotope. Soft errors are caused by the high level of 10B in this critical lower layer of some older integrated circuit processes. Boron-11, used at low concentrations as a p-type dopant, does not contribute to soft errors. Integrated circuit manufacturers eliminated borated dielectrics by the time individual circuit components decreased in size to 150 nm, largely due to this problem.

In critical designs, depleted boron‍—‌consisting almost entirely of boron-11‍—‌is used, to avoid this effect and therefore to reduce the soft error rate. Boron-11 is a by-product of the nuclear industry.

For applications in medical electronic devices this soft error mechanism may be extremely important. Neutrons are produced during high-energy cancer radiation therapy using photon beam energies above 10 MeV. These neutrons are moderated as they are scattered from the equipment and walls in the treatment room resulting in a thermal neutron flux that is about 40 × 106 higher than the normal environmental neutron flux. This high thermal neutron flux will generally result in a very high rate of soft errors and consequent circuit upset.

Other causes

Soft errors can also be caused by random noise or signal integrity problems, such as inductive or capacitive crosstalk. However, in general, these sources represent a small contribution to the overall soft error rate when compared to radiation effects.

Some tests conclude that the isolation of DRAM memory cells can be circumvented by unintended side effects of specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cells to leak their charges and interact electrically, as a result of high cells density in modern memory, altering the content of nearby memory rows that actually were not addressed in the original memory access. This effect is known as row hammer, and it has also been used in some privilege escalation computer security exploits.

Designing around soft errors

Soft error mitigation

A designer can attempt to minimize the rate of soft errors by judicious device design, choosing the right semiconductor, package and substrate materials, and the right device geometry. Often, however, this is limited by the need to reduce device size and voltage, to increase operating speed and to reduce power dissipation. The susceptibility of devices to upsets is described in the industry using the JEDEC JESD-89 standard. 

One technique that can be used to reduce the soft error rate in digital circuits is called radiation hardening. This involves increasing the capacitance at selected circuit nodes in order to increase its effective Qcrit value. This reduces the range of particle energies to which the logic value of the node can be upset. Radiation hardening is often accomplished by increasing the size of transistors who share a drain/source region at the node. Since the area and power overhead of radiation hardening can be restrictive to design, the technique is often applied selectively to nodes which are predicted to have the highest probability of resulting in soft errors if struck. Tools and models that can predict which nodes are most vulnerable are the subject of past and current research in the area of soft errors.

Detecting soft errors

There has been work addressing soft errors in processor and memory resources using both hardware and software techniques. Several research efforts addressed soft errors by proposing error detection and recovery via hardware-based redundant multi-threading. These approaches used special hardware to replicate an application execution to identify errors in the output, which increased hardware design complexity and cost including high performance overhead. Software-based soft error tolerant schemes, on the other hand, are flexible and can be applied on commercial off-the-shelf microprocessors. Many works propose compiler-level instruction replication and result checking for soft error detection. 

Correcting soft errors

Designers can choose to accept that soft errors will occur, and design systems with appropriate error detection and correction to recover gracefully. Typically, a semiconductor memory design might use forward error correction, incorporating redundant data into each word to create an error correcting code. Alternatively, roll-back error correction can be used, detecting the soft error with an error-detecting code such as parity, and rewriting correct data from another source. This technique is often used for write-through cache memories

Soft errors in logic circuits are sometimes detected and corrected using the techniques of fault tolerant design. These often include the use of redundant circuitry or computation of data, and typically come at the cost of circuit area, decreased performance, and/or higher power consumption. The concept of triple modular redundancy (TMR) can be employed to ensure very high soft-error reliability in logic circuits. In this technique, three identical copies of a circuit compute on the same data in parallel and outputs are fed into majority voting logic, returning the value that occurred in at least two of three cases. In this way, the failure of one circuit due to soft error is discarded assuming the other two circuits operated correctly. In practice, however, few designers can afford the greater than 200% circuit area and power overhead required, so it is usually only selectively applied. Another common concept to correct soft errors in logic circuits is temporal (or time) redundancy, in which one circuit operates on the same data multiple times and compares subsequent evaluations for consistency. This approach, however, often incurs performance overhead, area overhead (if copies of latches are used to store data), and power overhead, though is considerably more area-efficient than modular redundancy.

Traditionally, DRAM has had the most attention in the quest to reduce or work around soft errors, due to the fact that DRAM has comprised the majority-share of susceptible device surface area in desktop, and server computer systems (ref. the prevalence of ECC RAM in server computers). Hard figures for DRAM susceptibility are hard to come by, and vary considerably across designs, fabrication processes, and manufacturers. 1980s technology 256 kilobit DRAMS could have clusters of five or six bits flip from a single alpha particle. Modern DRAMs have much smaller feature sizes, so the deposition of a similar amount of charge could easily cause many more bits to flip. 

The design of error detection and correction circuits is helped by the fact that soft errors usually are localised to a very small area of a chip. Usually, only one cell of a memory is affected, although high energy events can cause a multi-cell upset. Conventional memory layout usually places one bit of many different correction words adjacent on a chip. So, even a multi-cell upset leads to only a number of separate single-bit upsets in multiple correction words, rather than a multi-bit upset in a single correction word. So, an error correcting code needs only to cope with a single bit in error in each correction word in order to cope with all likely soft errors. The term 'multi-cell' is used for upsets affecting multiple cells of a memory, whatever correction words those cells happen to fall in. 'Multi-bit' is used when multiple bits in a single correction word are in error.

Soft errors in combinational logic

The three natural masking effects in combinational logic that determine whether a single event upset (SEU) will propagate to become a soft error are electrical masking, logical masking, and temporal (or timing-window) masking. An SEU is logically masked if its propagation is blocked from reaching an output latch because off-path gate inputs prevent a logical transition of that gate's output. An SEU is electrically masked if the signal is attenuated by the electrical properties of gates on its propagation path such that the resulting pulse is of insufficient magnitude to be reliably latched. An SEU is temporally masked if the erroneous pulse reaches an output latch, but it does not occur close enough to when the latch is actually triggered to hold.

If all three masking effects fail to occur, the propagated pulse becomes latched and the output of the logic circuit will be an erroneous value. In the context of circuit operation, this erroneous output value may be considered a soft error event. However, from a microarchitectural-level standpoint, the affected result may not change the output of the currently-executing program. For instance, the erroneous data could be overwritten before use, masked in subsequent logic operations, or simply never be used. If erroneous data does not affect the output of a program, it is considered to be an example of microarchitectural masking.

Soft error rate

Soft error rate (SER) is the rate at which a device or system encounters or is predicted to encounter soft errors. It is typically expressed as either the number of failures-in-time (FIT) or mean time between failures (MTBF). The unit adopted for quantifying failures in time is called FIT, which is equivalent to one error per billion hours of device operation. MTBF is usually given in years of device operation; to put it into perspective, one FIT equals to approximately 1,000,000,000 / (24 × 365.25) = 114,077 times more than one-year MTBF.

While many electronic systems have an MTBF that exceeds the expected lifetime of the circuit, the SER may still be unacceptable to the manufacturer or customer. For instance, many failures per million circuits due to soft errors can be expected in the field if the system does not have adequate soft error protection. The failure of even a few products in the field, particularly if catastrophic, can tarnish the reputation of the product and company that designed it. Also, in safety- or cost-critical applications where the cost of system failure far outweighs the cost of the system itself, a 1% chance of soft error failure per lifetime may be too high to be acceptable to the customer. Therefore, it is advantageous to design for low SER when manufacturing a system in high-volume or requiring extremely high reliability

Infanticide (zoology)

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Infanticide_(zoology) Lion cubs may be...