In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before. One cause of soft errors is single event upsets from cosmic rays.
In a computer's memory system, a soft error changes an instruction in a program or a data value. Soft errors typically can be remedied by cold booting the computer. A soft error will not damage a system's hardware; the only damage is to the data that is being processed.
There are two types of soft errors, chip-level soft error and system-level soft error. Chip-level soft errors occur when particles hit the chip, e.g., when secondary particles from cosmic rays land on the silicon die. If a particle with certain properties hits a memory cell it can cause the cell to change state to a different value. The atomic reaction in this example is so tiny that it does not damage the physical structure of the chip. System-level soft errors occur when the data being processed is hit with a noise phenomenon, typically when the data is on a data bus. The computer tries to interpret the noise as a data bit, which can cause errors in addressing or processing program code. The bad data bit can even be saved in memory and cause problems at a later time.
If detected, a soft error may be corrected by rewriting correct data in place of erroneous data. Highly reliable systems use error correction to correct soft errors on the fly. However, in many systems, it may be impossible to determine the correct data, or even to discover that an error is present at all. In addition, before the correction can occur, the system may have crashed, in which case the recovery procedure must include a reboot. Soft errors involve changes to data—the electrons in a storage circuit, for example—but not changes to the physical circuit itself, the atoms. If the data is rewritten, the circuit will work perfectly again. Soft errors can occur on transmission lines, in digital logic, analog circuits, magnetic storage, and elsewhere, but are most commonly known in semiconductor storage.
Critical charge
Whether
 or not a circuit experiences a soft error depends on the energy of the 
incoming particle, the geometry of the impact, the location of the 
strike, and the design of the logic circuit. Logic circuits with higher capacitance and higher logic voltages are less likely to suffer an error. This combination of capacitance and voltage is described by the critical charge parameter, Qcrit, the minimum electron charge disturbance needed to change the logic level. A higher Qcrit means fewer soft errors. Unfortunately, a higher Qcrit
 also means a slower logic gate and a higher power dissipation. 
Reduction in chip feature size and supply voltage, desirable for many 
reasons, decreases Qcrit. Thus, the importance of soft errors increases as chip technology advances.
In a logic circuit, Qcrit is defined as the minimum 
amount of induced charge required at a circuit node to cause a voltage 
pulse to propagate from that node to the output and be of sufficient 
duration and magnitude to be reliably latched.  Since a logic circuit 
contains many nodes that may be struck, and each node may be of unique 
capacitance and distance from output, Qcrit is typically characterized on a per-node basis.
Causes of soft errors
Alpha particles from package decay
Soft errors became widely known with the introduction of dynamic RAM in the 1970s. In these early devices, ceramic chip packaging materials contained small amounts of radioactive
 contaminants. Very low decay rates are needed to avoid excess soft 
errors, and chip companies have occasionally suffered problems with 
contamination ever since. It is extremely hard to maintain the material 
purity needed. Controlling alpha particle emission rates for critical 
packaging materials to less than a level of 0.001 counts per hour per cm2 (cph/cm2)
 is required for reliable performance of most circuits. For comparison, 
the count rate of a typical shoe's sole is between 0.1 and 10 cph/cm2.
Package radioactive decay usually causes a soft error by alpha particle
 emission.  The positive charged alpha particle travels through the 
semiconductor and disturbs the distribution of electrons there.  If the 
disturbance is large enough, a digital signal can change from a 0 to a 1 or vice versa.  In combinational logic,
 this effect is transient, perhaps lasting a fraction of a nanosecond, 
and this has led to the challenge of soft errors in combinational logic 
mostly going unnoticed.  In sequential logic such as latches and RAM,
 even this transient upset can become stored for an indefinite time, to 
be read out later.  Thus, designers are usually much more aware of the 
problem in storage circuits. 
A 2011 Black Hat paper discusses the real-life security implications of such bit-flips in the Internet's DNS system.
 The paper found up to 3,434 incorrect requests per day due to bit-flip 
changes for various common domains. Many of these bit-flips would 
probably be attributable to hardware problems, but some could be 
attributed to alpha particles. These bit-flip errors may be taken advantage of by malicious actors in the form of bitsquatting. 
Isaac Asimov received a letter congratulating him on an accidental prediction of alpha-particle RAM errors in a 1950s novel.
Cosmic rays creating energetic neutrons and protons
Once
 the electronics industry had determined how to control package 
contaminants, it became clear that other causes were also at work. James F. Ziegler led a program of work at IBM which culminated in the publication of a number of papers (Ziegler and Lanford, 1979) demonstrating that cosmic rays
 also could cause soft errors. Indeed, in modern devices, cosmic rays 
may be the predominant cause. Although the primary particle of the 
cosmic ray does not generally reach the Earth's surface, it creates a shower
 of energetic secondary particles. At the Earth's surface approximately 
95% of the particles capable of causing soft errors are energetic 
neutrons with the remainder composed of protons and pions.
IBM estimated in 1996 that one error per month per 256 MiB of ram was expected for a desktop computer.
This flux of energetic neutrons is typically referred to as "cosmic 
rays" in the soft error literature. Neutrons are uncharged and cannot 
disturb a circuit on their own, but undergo neutron capture
 by the nucleus of an atom in a chip. This process may result in the 
production of charged secondaries, such as alpha particles and oxygen 
nuclei, which can then cause soft errors.
Cosmic ray flux depends on altitude. For the common reference 
location of 40.7° N, 74° W at sea level (New York City, NY, USA) the 
flux is approximately 14 neutrons/cm2/hour. Burying a system 
in a cave reduces the rate of cosmic-ray induced soft errors to a 
negligible level. In the lower levels of the atmosphere, the flux 
increases by a factor of about 2.2 for every 1000 m (1.3 for every 
1000 ft) increase in altitude above sea level. Computers operated on top
 of mountains experience an order of magnitude higher rate of soft 
errors compared to sea level. The rate of upsets in aircraft
 may be more than 300 times the sea level upset rate. This is in 
contrast to package decay induced soft errors, which do not change with 
location.
As chip density increases, Intel expects the errors caused by cosmic rays to increase and become a limiting factor in design.
The average rate of cosmic-ray soft errors is inversely 
proportional to sunspot activity. That is, the average number of 
cosmic-ray soft errors decreases during the active portion of the sunspot cycle
 and increases during the quiet portion. This counter-intuitive result 
occurs for two reasons. The Sun does not generally produce cosmic ray 
particles with energy above 1 GeV that are capable of penetrating to the
 Earth's upper atmosphere and creating particle showers, so the changes 
in the solar flux do not directly influence the number of errors. 
Further, the increase in the solar flux during an active sun period does
 have the effect of reshaping the Earth's magnetic field providing some 
additional shielding against higher energy cosmic rays, resulting in a 
decrease in the number of particles creating showers. The effect is 
fairly small in any case resulting in a ±7% modulation of the energetic 
neutron flux in New York City. Other locations are similarly affected.
One experiment measured the soft error rate at the sea level to be 5,950 failures in time
 (FIT = failures per billion hours) per DRAM chip.  When the same test 
setup was moved to an underground vault, shielded by over 50 feet (15 m)
 of rock that effectively eliminated all cosmic rays, zero soft errors 
were recorded.  In this test, all other causes of soft errors are too small to be measured, compared to the error rate caused by cosmic rays.
Energetic neutrons produced by cosmic rays may lose most of their
 kinetic energy and reach thermal equilibrium with their surroundings as
 they are scattered by materials. The resulting neutrons are simply 
referred to as thermal neutrons
 and have an average kinetic energy of about 25 millielectron-volts at 
25 °C. Thermal neutrons are also produced by environmental radiation 
sources such as the decay of naturally occurring uranium or thorium. The
 thermal neutron flux from sources other than cosmic-ray showers may 
still be noticeable in an underground location and an important 
contributor to soft errors for some circuits.
Thermal neutrons
Neutrons
 that have lost kinetic energy until they are in thermal equilibrium 
with their surroundings are an important cause of soft errors for some 
circuits. At low energies many neutron capture
 reactions become much more probable and result in fission of certain 
materials creating charged secondaries as fission byproducts. For some 
circuits the capture of a thermal neutron by the nucleus of the 10B isotope of boron is particularly important. This nuclear reaction is an efficient producer of an alpha particle, 7Li nucleus and gamma ray. Either of the charged particles (alpha or 7Li) may cause a soft error if produced in very close proximity, approximately 5 µm, to a critical circuit node. The capture cross section for 11B is 6 orders of magnitude smaller and does not contribute to soft errors.
Boron has been used in BPSG,
 the insulator in the interconnection layers of integrated circuits, 
particularly in the lowest one. The inclusion of boron lowers the melt 
temperature of the glass providing better reflow
 and planarization characteristics. In this application the glass is 
formulated with a boron content of 4% to 5% by weight. Naturally 
occurring boron is 20% 10B with the remainder the 11B isotope. Soft errors are caused by the high level of 10B
 in this critical lower layer of some older integrated circuit 
processes. Boron-11, used at low concentrations as a p-type dopant, does
 not contribute to soft errors. Integrated circuit manufacturers 
eliminated borated dielectrics by the time individual circuit components
 decreased in size to 150 nm, largely due to this problem.
In critical designs, depleted boron—consisting almost entirely 
of boron-11—is used, to avoid this effect and therefore to reduce the 
soft error rate. Boron-11 is a by-product of the nuclear industry.
For applications in medical electronic devices this soft error 
mechanism may be extremely important. Neutrons are produced during 
high-energy cancer radiation therapy using photon beam energies above 
10 MeV. These neutrons are moderated as they are scattered from the 
equipment and walls in the treatment room resulting in a thermal neutron
 flux that is about 40 × 106 higher than the normal 
environmental neutron flux. This high thermal neutron flux will 
generally result in a very high rate of soft errors and consequent 
circuit upset.
Other causes
Soft errors can also be caused by random noise or signal integrity problems, such as inductive or capacitive crosstalk.
 However, in general, these sources represent a small contribution to 
the overall soft error rate when compared to radiation effects.
Some tests conclude that the isolation of DRAM
 memory cells can be circumvented by unintended side effects of 
specially crafted accesses to adjacent cells.  Thus, accessing data 
stored in DRAM causes memory cells to leak their charges and interact 
electrically, as a result of high cells density in modern memory, 
altering the content of nearby memory rows that actually were not 
addressed in the original memory access.  This effect is known as row hammer, and it has also been used in some privilege escalation computer security exploits.
Designing around soft errors
Soft error mitigation
A
 designer can attempt to minimize the rate of soft errors by judicious 
device design, choosing the right semiconductor, package and substrate 
materials, and the right device geometry. Often, however, this is 
limited by the need to reduce device size and voltage, to increase 
operating speed and to reduce power dissipation. The susceptibility of 
devices to upsets is described in the industry using the JEDEC JESD-89 standard. 
One technique that can be used to reduce the soft error rate in digital circuits is called radiation hardening. This involves increasing the
capacitance at selected circuit nodes in order to increase its effective Qcrit
 value. This reduces the range of particle energies
to which the logic value of the node can be upset.  Radiation hardening 
is often accomplished by increasing the size of transistors who share
a drain/source region at the node.  Since the area and power overhead of
 radiation hardening can be restrictive to design, the technique is 
often applied selectively to nodes which are predicted to have the 
highest probability of resulting in soft errors if struck. Tools and 
models that can
predict which nodes are most vulnerable are the subject of past and 
current research in the area of soft errors.
Detecting soft errors
There
 has been  work addressing soft errors in processor and memory resources
 using both hardware and software techniques. Several research efforts 
addressed soft errors by proposing error detection and recovery via 
hardware-based redundant multi-threading.
These approaches used special hardware to replicate an application 
execution to identify errors in the output, which increased hardware 
design complexity and cost including high performance overhead. 
Software-based soft error tolerant schemes, on the other hand, are 
flexible and can be applied on commercial off-the-shelf microprocessors.
 Many works propose compiler-level instruction replication and result 
checking for soft error detection. 
Correcting soft errors
Designers can choose to accept that soft errors will occur, and 
design systems with appropriate error detection and correction to 
recover gracefully. Typically, a semiconductor memory design might use forward error correction, incorporating redundant data into each word to create an error correcting code. Alternatively, roll-back error correction can be used, detecting the soft error with an error-detecting code such as parity, and rewriting correct data from another source. This technique is often used for write-through cache memories. 
Soft errors in logic circuits are sometimes detected and corrected using the techniques of fault tolerant
 design. These often include the use of redundant circuitry or 
computation of data, and typically come at the cost of circuit area, 
decreased performance, and/or higher power consumption. The concept of triple modular redundancy
 (TMR) can be employed to ensure very high soft-error reliability in 
logic circuits. In this technique, three identical copies of a circuit 
compute on the same data in parallel and outputs are fed into majority voting logic,
 returning the value that occurred in at least two of three cases. In 
this way, the failure of one circuit due to soft error is discarded 
assuming the other two circuits operated correctly.  In practice, 
however, few designers can afford the greater than 200% circuit area and
 power overhead required, so it is usually only selectively applied.  
Another common concept to correct soft errors in logic circuits is 
temporal (or time) redundancy, in which one circuit operates on the same
 data multiple times and compares subsequent evaluations for 
consistency.  This approach, however, often incurs performance overhead,
 area overhead (if copies of latches are used to store data), and power 
overhead, though is considerably more area-efficient than modular 
redundancy.
Traditionally, DRAM
 has had the most attention in the quest to reduce or work around soft 
errors, due to the fact that DRAM has comprised the majority-share of 
susceptible device surface area in desktop, and server computer systems 
(ref. the prevalence of ECC RAM in server computers).  Hard figures for 
DRAM susceptibility are hard to come by, and vary considerably across 
designs, fabrication processes, and manufacturers. 1980s technology 256 
kilobit DRAMS could have clusters of five or six bits flip from a single
 alpha particle.
 Modern DRAMs have much smaller feature sizes, so the deposition of a 
similar amount of charge could easily cause many more bits to flip. 
The design of error detection and correction circuits is helped 
by the fact that soft errors usually are localised to a very small area 
of a chip. Usually, only one cell of a memory is affected, although high
 energy events can cause a multi-cell upset. Conventional memory layout 
usually places one bit of many different correction words adjacent on a 
chip. So, even a multi-cell upset leads to only a number of separate single-bit upsets in multiple correction words, rather than a multi-bit upset
 in a single correction word. So, an error correcting code needs only to
 cope with a single bit in error in each correction word in order to 
cope with all likely soft errors. The term 'multi-cell' is used for 
upsets affecting multiple cells of a memory, whatever correction words 
those cells happen to fall in. 'Multi-bit' is used when multiple bits in
 a single correction word are in error.
Soft errors in combinational logic
The three natural masking effects in combinational logic that determine whether
a single event upset (SEU) will propagate to become a soft error are electrical masking, logical masking, and temporal (or timing-window) masking. An SEU is logically masked if its
propagation is blocked from reaching an output latch because off-path gate
inputs prevent a logical transition of that gate's output. An SEU is 
electrically masked if the signal is attenuated by the electrical properties of
gates on its propagation path such that the resulting pulse is of insufficient magnitude to be
reliably latched. An SEU is temporally masked if the erroneous pulse reaches
an output latch, but it does not occur close enough to when the latch is actually triggered to hold.
If all three masking effects fail to occur, the propagated pulse 
becomes latched and the output of the logic circuit will be an erroneous
 value. In the context of circuit operation, this erroneous output value
 may be considered a soft error event. However, from a 
microarchitectural-level standpoint, the affected result may not change 
the output of the currently-executing program. For instance, the 
erroneous data could be overwritten before use, masked in subsequent 
logic operations, or simply never be used.  If erroneous data does not 
affect the output of a program, it is considered to be an example of microarchitectural masking.
Soft error rate
Soft
 error rate (SER) is the rate at which a device or system encounters or 
is predicted to encounter soft errors. It is typically expressed as 
either the number of failures-in-time (FIT) or mean time between failures
 (MTBF). The unit adopted for quantifying failures in time is called 
FIT, which is equivalent to one error per billion hours of device 
operation. MTBF is usually given in years of device operation; to put it
 into perspective, one FIT equals to approximately 1,000,000,000 / (24 ×
 365.25) = 114,077 times more than one-year MTBF.
While many electronic systems have an MTBF that exceeds the 
expected lifetime of the circuit, the SER may still be unacceptable to 
the manufacturer or customer. For instance, many failures per million 
circuits due to soft errors can be expected in the field if the system 
does not have adequate soft error protection. The failure of even a few 
products in the field, particularly if catastrophic, can tarnish the 
reputation of the product and company that designed it. Also, in safety-
 or cost-critical applications where the cost of system failure far 
outweighs the cost of the system itself, a 1% chance of soft error 
failure per lifetime may be too high to be acceptable to the customer. 
Therefore, it is advantageous to design for low SER when manufacturing a
 system in high-volume or requiring extremely high reliability
