A Medley of Potpourri

Thursday, September 1, 2022

Floating-point arithmetic

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Floating-point_arithmetic

An early electromechanical programmable computer, the Z3, included floating-point arithmetic (replica on display at Deutsches Museum in Munich).

In computing, floating-point arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. For this reason, floating-point computation is often used in systems with very small and very large real numbers that require fast processing times. In general, a floating-point number is represented approximately with a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:

{\text{significand}}\times {\text{base}}^{\text{exponent}},

where significand is an integer, base is an integer greater than or equal to two, and exponent is also an integer. For example:

{\displaystyle 1.2345=\underbrace {12345} _{\text{significand}}\times \underbrace {10} _{\text{base}}\!\!\!\!\!\!^{\overbrace {-4} ^{\text{exponent}}}.}

The term floating point refers to the fact that a number's radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated by the exponent, and thus the floating-point representation can be thought of as a form of scientific notation.

A floating-point system can be used to represent, with a fixed number of digits, numbers of different orders of magnitude: e.g. the distance between galaxies or the diameter of an atomic nucleus can be expressed with the same unit of length. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers varies with the chosen scale.

Single-precision floating point numbers on a number line: the green lines mark representable values.

Augmented version above showing both signs of representable values

Over the years, a variety of floating-point representations have been used in computers. In 1985, the IEEE 754 Standard for Floating-Point Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.

The speed of floating-point operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations.

A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating-point numbers.

Overview

Floating-point numbers

A number representation specifies some way of encoding a number, usually as a string of digits.

There are several mechanisms by which strings of digits can represent numbers. In common mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit "point" character (dot or comma) there. If the radix point is not specified, then the string implicitly represents an integer and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In fixed-point systems, a position in the string is specified for the radix point. So a fixed-point scheme might be to use a string of 8 decimal digits with the decimal point in the middle, whereby "00012345" would represent 0001.2345.

In scientific notation, the given number is scaled by a power of 10, so that it lies within a certain range—typically between 1 and 10, with the radix point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the orbital period of Jupiter's moon Io is 152,853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047×10⁵ seconds.

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:

A signed (meaning positive or negative) digit string of a given length in a given base (or radix). This digit string is referred to as the significand, mantissa, or coefficient. The length of the significand determines the precision to which numbers can be represented. The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article generally follows the convention that the radix point is set just after the most significant (leftmost) digit.
A signed integer exponent (also referred to as the characteristic, or scale), which modifies the magnitude of the number.

To derive the value of the floating-point number, the significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.

Using base-10 (the familiar decimal notation) as an example, the number 152,853.5047, which has ten decimal digits of precision, is represented as the significand 1,528,535,047 together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 10⁵ to give 1.528535047×10⁵, or 152,853.5047. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.

Symbolically, this final value is:

{\frac {s}{b^{\,p-1}}}\times b^{e},

where $s$ is the significand (ignoring any implied decimal point), $p$ is the precision (the number of digits in the significand), $b$ is the base (in our example, this is the number ten), and $e$ is the exponent.

Historically, several number bases have been used for representing floating-point numbers, with base two (binary) being the most common, followed by base ten (decimal floating point), and other less common varieties, such as base sixteen (hexadecimal floating point), base eight (octal floating point), base four (quaternary floating point), base three (balanced ternary floating point) and even base 256 and base 65,536.

A floating-point number is a rational number, because it can be represented as one integer divided by another; for example 1.45×10³ is (145/100)×1000 or 145,000/100. The base determines the fractions that can be represented; for instance, 1/5 cannot be represented exactly as a floating-point number using a binary base, but 1/5 can be represented exactly using a decimal base (0.2, or 2×10⁻¹). However, 1/3 cannot be represented exactly by either binary (0.010101...) or decimal (0.333...), but in base 3, it is trivial (0.1 or 1×3⁻¹) . The occasions on which infinite expansions occur depend on the base and its prime factors.

The way in which the significand (including its sign) and exponent are stored in a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation, $p=24$ , and so the significand is a string of 24 bits. For instance, the number π's first 33 bits are:

11001001\ 00001111\ 1101101{\underline {0}}\ 10100010\ 0.

In this binary expansion, let us denote the positions from 0 (leftmost bit, or most significant bit) to 32 (rightmost bit). The 24-bit significand will stop at position 23, shown as the underlined bit 0 above. The next bit, at position 24, is called the round bit or rounding bit. It is used to round the 33-bit approximation to the nearest 24-bit number (there are specific rules for halfway values, which is not the case here). This bit, which is 1 in this example, is added to the integer formed by the leftmost 24 bits, yielding:

11001001\ 00001111\ 1101101{\underline {1}}.

When this is stored in memory using the IEEE 754 encoding, this becomes the significand $s$ . The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows:

{\displaystyle {\begin{aligned}&\left(\sum _{n=0}^{p-1}{\text{bit}}_{n}\times 2^{-n}\right)\times 2^{e}\\={}&\left(1\times 2^{-0}+1\times 2^{-1}+0\times 2^{-2}+0\times 2^{-3}+1\times 2^{-4}+\cdots +1\times 2^{-23}\right)\times 2^{1}\\\approx {}&1.5707964\times 2\\\approx {}&3.1415928\end{aligned}}}

where $p$ is the precision (24 in this example), $n$ is the position of the bit of the significand from the left (starting at 0 and finishing at 23 here) and $e$ is the exponent (1 in this example).

It can be required that the most significant digit of the significand of a non-zero number be non-zero (except when the corresponding exponent would be smaller than the minimum one). This process is called normalization. For binary formats (which uses only the digits 0 and 1), this non-zero digit is necessarily 1. Therefore, it does not need to be represented in memory; allowing the format to have one more bit of precision. This rule is variously called the leading bit convention, the implicit bit convention, the hidden bit convention, or the assumed bit convention.

Alternatives to floating-point numbers

The floating-point representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives:

Fixed-point representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
Logarithmic number systems (LNSs) represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the value-to-representation curve (i.e., the graph of the logarithm function) is smooth (except at 0). Conversely to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The (symmetric) level-index arithmetic (LI and SLI) of Charles Clenshaw, Frank Olver and Peter Turner is a scheme based on a generalized logarithm representation.
Tapered floating-point representation, which does not appear to be used in practice.
Some simple rational numbers (e.g., 1/3 and 1/10) cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them (e.g., 1/10 in decimal floating point), but the possibilities remain limited. Software packages that perform rational arithmetic represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "bignum" arithmetic for the individual integers.
Interval arithmetic allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point.
Computer algebra systems such as Mathematica, Maxima, and Maple can often handle irrational numbers like $\pi$ or ${\sqrt {3}}$ in a completely "formal" way, without dealing with a specific encoding of the significand. Such a program can evaluate expressions like " $\sin(3\pi )$ " exactly, because it is programmed to process the underlying mathematics directly, instead of using approximate values for each intermediate calculation.

History

Leonardo Torres y Quevedo, who proposed a form of floating point in 1914

In 1914, Leonardo Torres y Quevedo proposed a form of floating point in the course of discussing his design for a special-purpose electromechanical calculator. In 1938, Konrad Zuse of Berlin completed the Z1, the first binary, programmable mechanical computer; it uses a 24-bit binary floating-point number representation with a 7-bit signed exponent, a 17-bit significand (including one implicit bit), and a sign bit. The more reliable relay-based Z3, completed in 1941, has representations for both positive and negative infinities; in particular, it implements defined operations with infinity, such as $^{1}/_{\infty }=0$ , and it stops on undefined operations, such as $0\times \infty$ .

Konrad Zuse, architect of the Z3 computer, which uses a 22-bit binary floating-point representation

Zuse also proposed, but did not complete, carefully rounded floating-point arithmetic that includes $\pm \infty$ and NaN representations, anticipating features of the IEEE Standard by four decades. In contrast, von Neumann recommended against floating-point numbers for the 1951 IAS machine, arguing that fixed-point arithmetic is preferable.

The first commercial computer with floating-point hardware was Zuse's Z4 computer, designed in 1942–1945. In 1946, Bell Laboratories introduced the Mark V, which implemented decimal floating-point numbers.

The Pilot ACE has binary floating-point arithmetic, and it became operational in 1950 at National Physical Laboratory, UK. Thirty-three were later sold commercially as the English Electric DEUCE. The arithmetic is actually implemented in software, but with a one megahertz clock rate, the speed of floating-point and fixed-point operations in this machine were initially faster than those of many competing computers.

The mass-produced IBM 704 followed in 1954; it introduced the use of a biased exponent. For many decades after that, floating-point hardware was typically an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computation" (SC) capability (see also Extensions for Scientific Computation (XSC)). It was not until the launch of the Intel i486 in 1989 that general-purpose personal computers had floating-point capability in hardware as a standard feature.

The UNIVAC 1100/2200 series, introduced in 1962, supported two floating-point representations:

Single precision: 36 bits, organized as a 1-bit sign, an 8-bit exponent, and a 27-bit significand.
Double precision: 72 bits, organized as a 1-bit sign, an 11-bit exponent, and a 60-bit significand.

The IBM 7094, also introduced in 1962, supported single-precision and double-precision representations, but with no relation to the UNIVAC's representations. Indeed, in 1964, IBM introduced hexadecimal floating-point representations in its System/360 mainframes; these same representations are still available for use in modern z/Architecture systems. In 1998, IBM implemented IEEE-compatible binary floating-point arithmetic in its mainframes; in 2005, IBM also added IEEE-compatible decimal floating-point arithmetic.

Initially, computers used many different representations for floating-point numbers. The lack of standardization at the mainframe level was an ongoing problem by the early 1970s for those writing and maintaining higher-level source code; these manufacturer floating-point standards differed in the word sizes, the representations, and the rounding behavior and general accuracy of operations. Floating-point compatibility across multiple computing systems was in desperate need of standardization by the early 1980s, leading to the creation of the IEEE 754 standard once the 32-bit (or 64-bit) word had become commonplace. This standard was significantly based on a proposal from Intel, which was designing the i8087 numerical coprocessor; Motorola, which was designing the 68000 around the same time, gave significant input as well.

In 1989, mathematician and computer scientist William Kahan was honored with the Turing Award for being the primary architect behind this proposal; he was aided by his student (Jerome Coonen) and a visiting professor (Harold Stone).

Among the x86 innovations are these:

A precisely specified floating-point representation at the bit-string level, so that all compliant computers interpret bit patterns the same way. This makes it possible to accurately and efficiently transfer floating-point numbers from one computer to another (after accounting for endianness).
A precisely specified behavior for the arithmetic operations: A result is required to be produced as if infinitely precise arithmetic were used to yield a value that is then rounded according to specific rules. This means that a compliant computer program would always produce the same result when given a particular input, thus mitigating the almost mystical reputation that floating-point computation had developed for its hitherto seemingly non-deterministic behavior.
The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and then be handled by the software in a controlled fashion.

Range of floating-point numbers

A floating-point number consists of two fixed-point components, whose range depends exclusively on the number of bits or digits in their representation. Whereas components linearly depend on their range, the floating-point range linearly depends on the significand range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number.

On a typical computer system, a double-precision (64-bit) binary floating-point number has a coefficient of 53 bits (including 1 implied bit), an exponent of 11 bits, and 1 sign bit. Since 2¹⁰ = 1024, the complete range of the positive normal floating-point numbers in this format is from 2⁻¹⁰²² ≈ 2 × 10⁻³⁰⁸ to approximately 2¹⁰²⁴ ≈ 2 × 10³⁰⁸.

The number of normalized floating-point numbers in a system (B, P, L, U) where

B is the base of the system,
P is the precision of the significand (in base B),
L is the smallest exponent of the system,
U is the largest exponent of the system,

is $2\left(B-1\right)\left(B^{P-1}\right)\left(U-L+1\right)$ .

There is a smallest positive normalized floating-point number,

Underflow level = UFL =

B^{L}

which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent.

There is a largest floating-point number,

Overflow level = OFL =

\left(1-B^{-P}\right)\left(B^{U+1}\right)

which has B − 1 as the value for each digit of the significand and the largest possible value for the exponent.

In addition, there are representable values strictly between −UFL and UFL. Namely, positive and negative zeros, as well as denormalized numbers.

IEEE 754: floating point in modern computers

Floating-point formats
IEEE 754
16-bit: Half (binary16) 32-bit: Single (binary32), decimal32 64-bit: Double (binary64), decimal64 128-bit: Quadruple (binary128), decimal128 256-bit: Octuple (binary256) 40-bit or 80-bit: Extended precision
Other
Minifloat bfloat16 Microsoft Binary Format IBM floating-point architecture Posit G.711 8-bit floats Arbitrary precision

The IEEE standardized the computer representation for binary floating-point numbers in IEEE 754 (a.k.a. IEC 60559) in 1985. This first standard is followed by almost all modern machines. It was revised in 2008. IBM mainframes support IBM's own hexadecimal floating point format and IEEE 754-2008 decimal floating point in addition to the IEEE 754 binary format. The Cray T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.

The standard provides for many closely related formats, differing in only a few details. Five of these formats are called basic formats, and others are termed extended precision formats and extendable precision format. Three formats are especially widely used in computer hardware and languages:

Single precision (binary32), usually used to represent the "float" type in the C language family. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
Double precision (binary64), usually used to represent the "double" type in the C language family. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits).
Double extended, also ambiguously called "extended precision" format. This is a binary format that occupies at least 79 bits (80 if the hidden/implicit bit rule is not used) and its significand has a precision of at least 64 bits (about 19 decimal digits). The C99 and C11 standards of the C language family, in their annex F ("IEC 60559 floating-point arithmetic"), recommend such an extended format to be provided as "long double". A format satisfying the minimal requirements (64-bit significand precision, 15-bit exponent, thus fitting on 80 bits) is provided by the x86 architecture. Often on such processors, this format can be used with "long double", though extended precision is not available with MSVC. For alignment purposes, many tools store this 80-bit value in a 96-bit or 128-bit space. On other processors, "long double" may stand for a larger format, such as quadruple precision, or just double precision, if any form of extended precision is not available.

Increasing the precision of the floating-point representation generally reduces the amount of accumulated round-off error caused by intermediate calculations. Less common IEEE formats include:

Quadruple precision (binary128). This is a binary format that occupies 128 bits (16 bytes) and its significand has a precision of 113 bits (about 34 decimal digits).
Decimal64 and decimal128 floating-point formats. These formats, along with the decimal32 format, are intended for performing decimal rounding correctly.
Half precision, also called binary16, a 16-bit floating-point value. It is being used in the NVIDIA Cg graphics language, and in the openEXR standard.

Any integer with absolute value less than 2²⁴ can be exactly represented in the single-precision format, and any integer with absolute value less than 2⁵³ can be exactly represented in the double-precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53-bit integers on platforms that have double-precision floats but only 32-bit integers.

The standard specifies some special values, and their representation: positive infinity ( $+\infty$ ), negative infinity ( $-\infty$ ), a negative zero (−0) distinct from ordinary ("positive") zero, and "not a number" values (NaNs).

Comparison of floating-point numbers, as defined by the IEEE standard, is a bit different from usual integer comparison. Negative and positive zero compare equal, and every NaN compares unequal to every value, including itself. All finite floating-point numbers are strictly smaller than $+\infty$ and strictly greater than $-\infty$ , and they are ordered in the same way as their values (in the set of real numbers).

Internal representation

Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand or mantissa, from left to right. For the IEEE 754 binary formats (basic and extended) which have extant hardware implementations, they are apportioned as follows:

Type	Sign	Exponent	Significand field	Total bits	Exponent bias	Bits precision	Number of decimal digits
Half (IEEE 754-2008)	1	5	10	16	15	11	~3.3
Single	1	8	23	32	127	24	~7.2
Double	1	11	52	64	1023	53	~15.9
x86 extended precision	1	15	64	80	16383	64	~19.2
Quad	1	15	112	128	16383	113	~34.0

While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and subnormal numbers; values of all 1s are reserved for the infinities and NaNs. The exponent range for normalized numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad. Normalized numbers exclude subnormal values, zeros, infinities, and NaNs.

In the IEEE binary interchange formats the leading 1 bit of a normalized significand is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, the single-precision format actually has a significand with 24 bits of precision, the double-precision format has 53, and quad has 113.

For example, it was shown above that π, rounded to 24 bits of precision, has:

sign = 0 ; e = 1 ; s = 110010010000111111011011 (including the hidden bit)

The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in the single-precision format as

0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB as a hexadecimal number.

An example of a layout for 32-bit floating point is

and the 64 bit layout is similar.

Special values

Signed zero

In the IEEE 754 standard, zero is signed, meaning that there exist both a "positive zero" (+0) and a "negative zero" (−0). In most run-time environments, positive zero is usually printed as "0" and the negative zero as "-0". The two values behave as equal in numerical comparisons, but some operations return different results for +0 and −0. For instance, 1/(−0) returns negative infinity, while 1/+0 returns positive infinity (so that the identity $1/(1/\pm\infty) = \pm\infty$ is maintained). Other common functions with a discontinuity at x=0 which might treat +0 and −0 differently include log(x), signum(x), and the principal square root of y + xi for any negative number y. As with any approximation scheme, operations involving "negative zero" can occasionally cause confusion. For example, in IEEE 754, x = y does not always imply 1/x = 1/y, as 0 = −0 but 1/0 ≠ 1/−0.

Subnormal numbers

Subnormal values fill the underflow gap with values where the absolute distance between them is the same as for adjacent values just outside the underflow gap. This is an improvement over the older practice to just have zero in the underflow gap, and where underflowing results were replaced by zero (flush to zero).

Modern floating-point hardware usually handles subnormal values (as well as normal values), and does not require software emulation for subnormals.

Infinities

The infinities of the extended real number line can be represented in IEEE floating-point datatypes, just like ordinary floating-point values like 1, 1.5, etc. They are not error values in any way, though they are often (depends on the rounding) used as replacement values when there is an overflow. Upon a divide-by-zero exception, a positive or negative infinity is returned as an exact result. An infinity can also be introduced as a numeral (like C's "INFINITY" macro, or " $\infty$ " if the programming language allows that syntax).

IEEE 754 requires infinities to be handled in a reasonable way, such as

$(+\infty) + (+7) = (+\infty)$
$(+\infty) \times (-2) = (-\infty)$
$(+\infty) \times 0 =$ NaN – there is no meaningful thing to do

NaNs

IEEE 754 specifies a special value called "Not a Number" (NaN) to be returned as the result of certain "invalid" operations, such as 0/0, $\infty\times0$ , or sqrt(−1). In general, NaNs will be propagated, i.e. most operations involving a NaN will result in a NaN, although functions that would give some defined result for any given floating-point value will do so for NaNs as well, e.g. NaN ^ 0 = 1. There are two kinds of NaNs: the default quiet NaNs and, optionally, signaling NaNs. A signaling NaN in any arithmetic operation (including numerical comparisons) will cause an "invalid operation" exception to be signaled.

The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type or source of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a runtime system to flag uninitialized variables, or extend the floating-point numbers with other special values without slowing down the computations with ordinary values, although such extensions are not common.

IEEE 754 design rationale

William Kahan. A primary architect of the Intel 80x87 floating-point coprocessor and IEEE 754 floating-point standard.

It is a common misconception that the more esoteric features of the IEEE 754 standard discussed here, such as extended formats, NaN, infinities, subnormals etc., are only of interest to numerical analysts, or for advanced numerical applications. In fact the opposite is true: these features are designed to give safe robust defaults for numerically unsophisticated programmers, in addition to supporting sophisticated numerical libraries by experts. The key designer of IEEE 754, William Kahan notes that it is incorrect to "... [deem] features of IEEE Standard 754 for Binary Floating-Point Arithmetic that ...[are] not appreciated to be features usable by none but numerical experts. The facts are quite the opposite. In 1977 those features were designed into the Intel 8087 to serve the widest possible market... Error-analysis tells us how to design floating-point arithmetic, like IEEE Standard 754, moderately tolerant of well-meaning ignorance among programmers".

The special values such as infinity and NaN ensure that the floating-point arithmetic is algebraically complete: every floating-point operation produces a well-defined result and will not—by default—throw a machine interrupt or trap. Moreover, the choices of special values returned in exceptional cases were designed to give the correct answer in many cases. For instance, under IEEE 754 arithmetic, continued fractions such as R(z) := 7 − 3/[z − 2 − 1/(z − 7 + 10/[z − 2 − 2/(z − 3)])] will give the correct answer on all inputs, as the potential divide by zero, e.g. for z = 3, is correctly handled by giving +infinity, and so such exceptions can be safely ignored. As noted by Kahan, the unhandled trap consecutive to a floating-point to 16-bit integer conversion overflow that caused the loss of an Ariane 5 rocket would not have happened under the default IEEE 754 floating-point policy.
Subnormal numbers ensure that for finite floating-point numbers x and y, x − y = 0 if and only if x = y, as expected, but which did not hold under earlier floating-point representations.
On the design rationale of the x87 80-bit format, Kahan notes: "This Extended format is designed to be used, with negligible loss of speed, for all but the simplest arithmetic with float and double operands. For example, it should be used for scratch variables in loops that implement recurrences like polynomial evaluation, scalar products, partial and continued fractions. It often averts premature Over/Underflow or severe local cancellation that can spoil simple algorithms". Computing intermediate results in an extended format with high precision and extended exponent has precedents in the historical practice of scientific calculation and in the design of scientific calculators e.g. Hewlett-Packard's financial calculators performed arithmetic and financial functions to three more significant decimals than they stored or displayed. The implementation of extended precision enabled standard elementary function libraries to be readily developed that normally gave double precision results within one unit in the last place (ULP) at high speed.
Correct rounding of values to the nearest representable value avoids systematic biases in calculations and slows the growth of errors. Rounding ties to even removes the statistical bias that can occur in adding similar figures.
Directed rounding was intended as an aid with checking error bounds, for instance in interval arithmetic. It is also used in the implementation of some functions.
The mathematical basis of the operations, in particular correct rounding, allows one to prove mathematical properties and design floating-point algorithms such as 2Sum, Fast2Sum and Kahan summation algorithm, e.g. to improve accuracy or implement multiple-precision arithmetic subroutines relatively easily.

A property of the single- and double-precision formats is that their encoding allows one to easily sort them without using floating-point hardware. Their bits interpreted as a two's-complement integer already sort the positives correctly, with the negatives reversed. With an xor to flip the sign bit for positive values and all bits for negative values, all the values become sortable as unsigned integers (with −0 < +0). It is unclear whether this property is intended.

Other notable floating-point formats

In addition to the widely used IEEE 754 standard formats, other floating-point formats are used, or have been used, in certain domain-specific areas.

The Microsoft Binary Format (MBF) was developed for the Microsoft BASIC language products, including Microsoft's first ever product the Altair BASIC (1975), TRS-80 LEVEL II, CP/M's MBASIC, IBM PC 5150's BASICA, MS-DOS's GW-BASIC and QuickBASIC prior to version 4.00. QuickBASIC version 4.00 and 4.50 switched to the IEEE 754-1985 format but can revert to the MBF format using the /MBF command option. MBF was designed and developed on a simulated Intel 8080 by Monte Davidoff, a dormmate of Bill Gates, during spring of 1975 for the MITS Altair 8800. The initial release of July 1975 supported a single-precision (32 bits) format due to cost of the MITS Altair 8800 4-kilobytes memory. In December 1975, the 8-kilobytes version added a double-precision (64 bits) format. A single-precision (40 bits) variant format was adopted for other CPU's, notably the MOS 6502 (Apple //, Commodore PET, Atari), Motorola 6800 (MITS Altair 680) and Motorola 6809 (TRS-80 Color Computer). All Microsoft language products from 1975 through 1987 used the Microsoft Binary Format until Microsoft adopted the IEEE-754 standard format in all its products starting in 1988 to their current releases. MBF consists of the MBF single-precision format (32 bits, "6-digit BASIC"), the MBF extended-precision format (40 bits, "9-digit BASIC"), and the MBF double-precision format (64 bits); each of them is represented with an 8-bit exponent, followed by a sign bit, followed by a significand of respectively 23, 31, and 55 bits.
The Bfloat16 format requires the same amount of memory (16 bits) as the IEEE 754 half-precision format, but allocates 8 bits to the exponent instead of 5, thus providing the same range as a IEEE 754 single-precision number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of machine learning models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format.
The TensorFloat-32 format combines the 8 bits of exponent of the Bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by Nvidia, which provides hardware support for it in the Tensor Cores of its GPUs based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.
The Hopper architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).

Bfloat16, TensorFloat-32 and the two FP8 formats specifications, compared with IEEE 754 half-precision and single-precision standard formats
Type	Sign	Exponent	Trailing significand field	Total bits
FP8 (E4M3)	1	4	3	8
FP8 (E5M2)	1	5	2	8
Half-precision	1	5	10	16
Bfloat16	1	8	7	16
TensorFloat-32	1	8	10	19
Single-precision	1	8	23	32

Representable numbers, conversion and rounding

By their nature, all numbers expressed in floating-point format are rational numbers with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). Irrational numbers, such as π or √2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the decimal number 123456789 cannot be exactly represented if only eight decimal digits of precision are available (it would be rounded to one of the two straddling representable values, 12345678 × 10¹ or 12345679 × 10¹), the same applies to non-terminating digits (.5 to be rounded to either .55555555 or .55555556).

When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value from the original, and the value thus adjusted is called the rounded value.

Whether or not a rational number has a terminating expansion depends on the base. For example, in base-10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333...). In base-2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers that appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0.1 is not representable in binary floating-point of any finite precision; the exact binary representation would have a "1100" sequence continuing endlessly:

e = −4; s = 1100110011001100110011001100110011...,

where, as previously, s is the significand and e is the exponent.

When rounded to 24 bits this becomes

e = −4; s = 110011001100110011001101,

which is actually 0.100000001490116119384765625 in decimal.

As a further example, the real number π, represented in binary as an infinite sequence of bits is

11.0010010000111111011010101000100010000101101000110000100011010011...

but is

11.0010010000111111011011

when approximated by rounding to a precision of 24 bits.

In binary single-precision floating-point, this is represented as s = 1.10010010000111111011011 with e = 1. This has a decimal value of

3.1415927410125732421875,

whereas a more accurate approximation of the true value of π is

3.14159265358979323846264338327950...

The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the discretization error and is limited by the machine epsilon.

The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a unit in the last place (ULP). For example, if there is no representable number lying between the representable numbers 1.45a70c22_hex and 1.45a70c24_hex, the ULP is 2×16⁻⁸, or 2⁻³¹. For numbers with a base-2 exponent part of 0, i.e. numbers with an absolute value higher than or equal to 1 but lower than 2, an ULP is exactly 2⁻²³ or about 10⁻⁷ in single precision, and exactly 2⁻⁵³ or about 10⁻¹⁶ in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of a ULP.

Rounding modes

Rounding is used when the exact result of a floating-point operation (or a conversion to floating-point format) would need more digits than there are digits in the significand. IEEE 754 requires correct rounding: that is, the rounded result is as if infinitely precise arithmetic was used to compute the value and then rounded (although in implementation only three extra bits are needed to ensure this). There are several different rounding schemes (or rounding modes). Historically, truncation was the typical approach. Since the introduction of IEEE 754, the default method (round to nearest, ties to even, sometimes called Banker's Rounding) is more commonly used. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and gives that representation as the result. In the case of a tie, the value that would make the significand end in an even digit is chosen. The IEEE 754 standard requires the same rounding to be applied to all fundamental algebraic operations, including square root and conversions, when there is a numeric (non-NaN) result. It means that the results of IEEE 754 operations are completely determined in all bits of the result, except for the representation of NaNs. ("Library" functions such as cosine and log are not mandated.)

Alternative rounding options are also available. IEEE 754 specifies the following rounding modes:

round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)
round to nearest, where ties round away from zero (optional for binary floating-point and commonly used in decimal)
round up (toward +∞; negative results thus round toward zero)
round down (toward −∞; negative results thus round away from zero)
round toward zero (truncation; it is similar to the common behavior of float-to-integer conversions, which convert −3.9 to −3 and 3.9 to 3)

Alternative modes are useful when the amount of error being introduced must be bounded. Applications that require a bounded error are multi-precision floating-point, and interval arithmetic. The alternative rounding modes are also useful in diagnosing numerical instability: if the results of a subroutine vary substantially between rounding to + and − infinity then it is likely numerically unstable and affected by round-off error.

Binary-to-decimal conversion with minimal number of digits

Converting a double-precision binary floating-point number to a decimal string is a common operation, but an algorithm producing results that are both accurate and minimal did not appear in print until 1990, with Steele and White's Dragon4. Some of the improvements since then include:

David M. Gay's dtoa.c, a practical open-source implementation of many ideas in Dragon4.
Grisu3, with a 4× speedup as it removes the use of bignums. Must be used with a fallback, as it fails for ~0.5% of cases.
Errol3, an always-succeeding algorithm similar to, but slower than, Grisu3. Apparently not as good as an early-terminating Grisu with fallback.
Ryū, an always-succeeding algorithm that is faster and simpler than Grisu3.

Many modern language runtimes use Grisu3 with a Dragon4 fallback.

Decimal-to-binary conversion

The problem of parsing a decimal string into a binary FP representation is complex, with an accurate parser not appearing until Clinger's 1990 work (implemented in dtoa.c). Further work has likewise progressed in the direction of faster parsing.

Floating-point operations

For ease of presentation and understanding, decimal radix with 7 digit precision will be used in the examples, as in the IEEE 754 decimal32 format. The fundamental principles are the same in any radix or precision, except that normalization is optional (it does not affect the numerical value of the result). Here, s denotes the significand and e denotes the exponent.

Addition and subtraction

A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by three digits, and one then proceeds with the usual addition method:

  123456.7 = 1.234567 × 10^5
  101.7654 = 1.017654 × 10^2 = 0.001017654 × 10^5

  Hence:
  123456.7 + 101.7654 = (1.234567 × 10^5) + (1.017654 × 10^2)
                      = (1.234567 × 10^5) + (0.001017654 × 10^5)
                      = (1.234567 + 0.001017654) × 10^5
                      =  1.235584654 × 10^5

In detail:

  e=5;  s=1.234567     (123456.7)
+ e=2;  s=1.017654     (101.7654)

  e=5;  s=1.234567
+ e=5;  s=0.001017654  (after shifting)
--------------------
  e=5;  s=1.235584654  (true sum: 123558.4654)

This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is

  e=5;  s=1.235585    (final sum: 123558.5)

The lowest three digits of the second operand (654) are essentially lost. This is round-off error. In extreme cases, the sum of two non-zero numbers may be equal to one of them:

  e=5;  s=1.234567
+ e=−3; s=9.876543

  e=5;  s=1.234567
+ e=5;  s=0.00000009876543 (after shifting)
----------------------
  e=5;  s=1.23456709876543 (true sum)
  e=5;  s=1.234567         (after rounding and normalization)

In the above conceptual examples it would appear that a large number of extra digits would need to be provided by the adder to ensure correct rounding; however, for binary addition or subtraction using careful implementation techniques only a guard bit, a rounding bit and one extra sticky bit need to be carried beyond the precision of the operands.

Another problem of loss of significance occurs when approximations to two nearly equal numbers are subtracted. In the following example e = 5; s = 1.234571 and e = 5; s = 1.234567 are approximations to the rationals 123457.1467 and 123456.659.

  e=5;  s=1.234571
− e=5;  s=1.234567
----------------
  e=5;  s=0.000004
  e=−1; s=4.000000 (after rounding and normalization)

The floating-point difference is computed exactly because the numbers are close—the Sterbenz lemma guarantees this, even in case of underflow when gradual underflow is supported. Despite this, the difference of the original numbers is e = −1; s = 4.877000, which differs more than 20% from the difference e = −1; s = 4.000000 of the approximations. In extreme cases, all significant digits of precision can be lost. This cancellation illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in numerical analysis; see also Accuracy problems.

Multiplication and division

To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.

  e=3;  s=4.734612
× e=5;  s=5.417242
-----------------------
  e=8;  s=25.648538980104 (true product)
  e=8;  s=25.64854        (after rounding)
  e=9;  s=2.564854        (after normalization)

Similarly, division is accomplished by subtracting the divisor's exponent from the dividend's exponent, and dividing the dividend's significand by the divisor's significand.

There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed in succession. In practice, the way these operations are carried out in digital logic can be quite complex (see Booth's multiplication algorithm and Division algorithm). For a fast, simple method, see the Horner method.

Literal syntax

Literals for floating-point numbers depend on languages. They typically use e or E to denote scientific notation. The C programming language and the IEEE 754 standard also define a hexadecimal literal syntax with a base-2 exponent instead of 10. In languages like C, when the decimal exponent is omitted, a decimal point is needed to differentiate them from integers. Other languages do not have an integer type (such as JavaScript), or allow overloading of numeric types (such as Haskell). In these cases, digit strings such as 123 may also be floating-point literals.

Examples of floating-point literals are:

99.9
-5000.12
6.02e23
-3e-45
0x1.fffffep+127 in C and IEEE 754

Dealing with exceptional cases

Floating-point computation in a computer can run into three kinds of problems:

An operation can be mathematically undefined, such as ∞/∞, or division by zero.
An operation can be legal in principle, but not supported by the specific format, for example, calculating the square root of −1 or the inverse sine of 2 (both of which result in complex numbers).
An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large), underflow (exponent too small) or denormalization (precision loss).

Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind of trap that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not portable. (The term "exception" as used in IEEE 754 is a general term meaning an exceptional condition, which is not necessarily an error, and is a different usage to that typically defined in programming languages such as a C++ or Java, in which an "exception" is an alternative flow of control, closer to what is termed a "trap" in IEEE 754 terminology.)

Here, the required default method of handling exceptions according to IEEE 754 is discussed (the IEEE 754 optional trapping and other "alternate exception handling" modes are not discussed). Arithmetic exceptions are (by default) required to be recorded in "sticky" status flag bits. That they are "sticky" means that they are not reset by the next (arithmetic) operation, but stay set until explicitly reset. The use of "sticky" flags thus allows for testing of exceptional conditions to be delayed until after a full floating-point expression or subroutine: without them exceptional conditions that could not be otherwise ignored would require explicit testing immediately after every floating-point operation. By default, an operation always returns a result according to specification without interrupting computation. For instance, 1/0 returns +∞, while also setting the divide-by-zero flag bit (this default of ∞ is designed to often return a finite result when used in subsequent operations and so be safely ignored).

The original IEEE 754 standard, however, failed to recommend operations to handle such sets of arithmetic exception flag bits. So while these were implemented in hardware, initially programming language implementations typically did not provide a means to access them (apart from assembler). Over time some programming language standards (e.g., C99/C11 and Fortran) have been updated to specify methods to access and change status flag bits. The 2008 version of the IEEE 754 standard now specifies a few operations for accessing and handling the arithmetic flag bits. The programming model is based on a single thread of execution and use of them by multiple threads has to be handled by a means outside of the standard (e.g. C11 specifies that the flags have thread-local storage).

IEEE 754 specifies five arithmetic exceptions that are to be recorded in the status flags ("sticky bits"):

inexact, set if the rounded (and returned) value is different from the mathematically exact result of the operation.
underflow, set if the rounded value is tiny (as specified in IEEE 754) and inexact (or maybe limited to if it has denormalization loss, as per the 1984 version of IEEE 754), returning a subnormal value including the zeros.
overflow, set if the absolute value of the rounded value is too large to be represented. An infinity or maximal finite value is returned, depending on which rounding is used.
divide-by-zero, set if the result is infinite given finite operands, returning an infinity, either +∞ or −∞.
invalid, set if a real-valued result cannot be returned e.g. sqrt(−1) or 0/0, returning a quiet NaN.

Fig. 1: resistances in parallel, with total resistance

R_{tot}

The default return value for each of the exceptions is designed to give the correct result in the majority of cases such that the exceptions can be ignored in the majority of codes. inexact returns a correctly rounded result, and underflow returns a denormalized small value and so can almost always be ignored. divide-by-zero returns infinity exactly, which will typically then divide a finite number and so give zero, or else will give an invalid exception subsequently if not, and so can also typically be ignored. For example, the effective resistance of n resistors in parallel (see fig. 1) is given by $R_{\text{tot}}=1/(1/R_{1}+1/R_{2}+\cdots +1/R_{n})$ . If a short-circuit develops with $R_{1}$ set to 0, $1/R_{1}$ will return +infinity which will give a final $R_{tot}$ of 0, as expected (see the continued fraction example of IEEE 754 design rationale for another example).

Overflow and invalid exceptions can typically not be ignored, but do not necessarily represent errors: for example, a root-finding routine, as part of its normal operation, may evaluate a passed-in function at values outside of its domain, returning NaN and an invalid exception flag to be ignored until finding a useful start point.

Accuracy problems

The fact that floating-point numbers cannot precisely represent all real numbers, and that floating-point operations cannot precisely represent true arithmetic operations, leads to many surprising situations. This is related to the finite precision with which computers generally represent numbers.

For example, the non-representability of 0.1 and 0.01 (in binary) means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision) representation, 0.1 (decimal) was given previously as $e = -4$ ; $s = 110011001100110011001101$ , which is

0.100000001490116119384765625 exactly.

Squaring this number gives

0.010000000298023226097399174250313080847263336181640625 exactly.

Squaring it with single-precision floating-point hardware (with rounding) gives

0.010000000707805156707763671875 exactly.

But the representable number closest to 0.01 is

0.009999999776482582092285156250 exactly.

Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow in the usual floating-point formats (assuming an accurate implementation of tan). It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:

/* Enough digits to be sure we get the correct approximation. */
double pi = 3.1415926535897932384626433832795;
double z = tan(pi/2.0);

will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be −22877332.0.

By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) 0.1225×10⁻¹⁵ in double precision, or −0.8742×10⁻⁷ in single precision.

While floating-point addition and multiplication are both commutative ( $a + b = b + a$ and $a \times b = b \times a$ ), they are not necessarily associative. That is, $(a + b) + c$ is not necessarily equal to $a + (b + c)$ . Using 7-digit significand decimal arithmetic:

 a = 1234.567, b = 45.67834, c = 0.0004

 (a + b) + c:
     1234.567   (a)
   +   45.67834 (b)
   ____________
     1280.24534   rounds to   1280.245

    1280.245  (a + b)
   +   0.0004 (c)
   ____________
    1280.2454   rounds to   1280.245  ← (a + b) + c

 a + (b + c):
   45.67834 (b)
 +  0.0004  (c)
 ____________
   45.67874

   1234.567   (a)
 +   45.67874   (b + c)
 ____________
   1280.24574   rounds to   1280.246 ← a + (b + c)

They are also not necessarily distributive. That is, $(a + b) \times c$ may not be the same as $a \times c + b \times c$ :

 1234.567 × 3.333333 = 4115.223
 1.234567 × 3.333333 = 4.115223
                       4115.223 + 4.115223 = 4119.338
 but
 1234.567 + 1.234567 = 1235.802
                       1235.802 × 3.333333 = 4119.340

In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:

Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. When we subtract two almost equal numbers we set the most significant digits to zero, leaving ourselves with just the insignificant, and most erroneous, digits. For example, when determining a derivative of a function the following formula is used:
$Q(h)={\frac {f(a+h)-f(a)}{h}}.$
Intuitively one would want an $h$ very close to zero; however, when using floating-point operations, the smallest number will not give the best approximation of a derivative. As $h$ grows smaller, the difference between $f (a + h)$ and $f (a)$ grows smaller, cancelling out the most significant and least erroneous digits and making the most erroneous digits more important. As a result the smallest number of $h$ possible will give a more erroneous approximation of a derivative than a somewhat larger number. This is perhaps the most common and serious accuracy problem.
Conversions to integer are not intuitive: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than round. Floor and ceiling functions may produce answers which are off by one from the intuitively expected value.
Limited exponent range: results might overflow yielding infinity, or underflow yielding a subnormal number or zero. In these cases precision will be lost.
Testing for safe division is problematic: Checking that the divisor is not zero does not guarantee that a division will not overflow.
Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values.

Incidents

On 25 February 1991, a loss of significance in a MIM-104 Patriot missile battery prevented it from intercepting an incoming Scud missile in Dhahran, Saudi Arabia, contributing to the death of 28 soldiers from the U.S. Army's 14th Quartermaster Detachment.

Machine precision and backward error analysis

Machine precision is a quantity that characterizes the accuracy of a floating-point system, and is used in backward error analysis of floating-point algorithms. It is also known as unit roundoff or machine epsilon. Usually denoted $Ε mach$ , its value depends on the particular rounding being used.

With rounding to zero,

\mathrm {E} _{\text{mach}}=B^{1-P},\,

whereas rounding to nearest,

\mathrm {E} _{\text{mach}}={\tfrac {1}{2}}B^{1-P},

where B is the base of the system and P is the precision of the significand (in base B).

This is important since it bounds the relative error in representing any non-zero real number $x$ within the normalized range of a floating-point system:

\left|{\frac {\operatorname {fl} (x)-x}{x}}\right|\leq \mathrm {E} _{\text{mach}}.

Backward error analysis, the theory of which was developed and popularized by James H. Wilkinson, can be used to establish that an algorithm implementing a numerical function is numerically stable. The basic approach is to show that although the calculated result, due to roundoff errors, will not be exactly correct, it is the exact solution to a nearby problem with slightly perturbed input data. If the perturbation required is small, on the order of the uncertainty in the input data, then the results are in some sense as accurate as the data "deserves". The algorithm is then defined as backward stable. Stability is a measure of the sensitivity to rounding errors of a given numerical procedure; by contrast, the condition number of a function for a given problem indicates the inherent sensitivity of the function to small perturbations in its input and is independent of the implementation used to solve the problem.

As a trivial example, consider a simple expression giving the inner product of (length two) vectors $x$ and $y$ , then

{\displaystyle {\begin{aligned}\operatorname {fl} (x\cdot y)&=\operatorname {fl} {\big (}fl(x_{1}\cdot y_{1})+\operatorname {fl} (x_{2}\cdot y_{2}){\big )},{\text{ where }}\operatorname {fl} (){\text{ indicates correctly rounded floating-point arithmetic}}\\&=\operatorname {fl} {\big (}(x_{1}\cdot y_{1})(1+\delta _{1})+(x_{2}\cdot y_{2})(1+\delta _{2}){\big )},{\text{ where }}\delta _{n}\leq \mathrm {E} _{\text{mach}},{\text{ from above}}\\&={\big (}(x_{1}\cdot y_{1})(1+\delta _{1})+(x_{2}\cdot y_{2})(1+\delta _{2}){\big )}(1+\delta _{3})\\&=(x_{1}\cdot y_{1})(1+\delta _{1})(1+\delta _{3})+(x_{2}\cdot y_{2})(1+\delta _{2})(1+\delta _{3}),\end{aligned}}}

and so

\operatorname {fl} (x\cdot y)={\hat {x}}\cdot {\hat {y}},

where

{\displaystyle {\begin{aligned}{\hat {x}}_{1}&=x_{1}(1+\delta _{1});\quad {\hat {x}}_{2}=x_{2}(1+\delta _{2});\\{\hat {y}}_{1}&=y_{1}(1+\delta _{3});\quad {\hat {y}}_{2}=y_{2}(1+\delta _{3}),\\\end{aligned}}}

where

\delta _{n}\leq \mathrm {E} _{\text{mach}}

by definition, which is the sum of two slightly perturbed (on the order of Ε_mach) input data, and so is backward stable. For more realistic examples in numerical linear algebra, see Higham 2002 and other references below.

Minimizing the effect of accuracy problems

Although, as noted previously, individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a ULP, more complicated formulae can suffer from larger errors due to round-off. The loss of accuracy can be substantial if a problem or its data are ill-conditioned, meaning that the correct result is hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can suffer from large loss of accuracy if an algorithm numerically unstable for that data is used: apparently equivalent formulations of expressions in a programming language can differ markedly in their numerical stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically stable algorithms, which is an aim of the branch of mathematics known as numerical analysis. Another approach that can protect against the risk of numerical instabilities is the computation of intermediate (scratch) values in an algorithm at a higher precision than the final result requires, which can remove, or reduce by orders of magnitude, such risk: IEEE 754 quadruple precision and extended precision are designed for this purpose when computing at double precision.

For example, the following algorithm is a direct implementation to compute the function $A (x) = (x -1) / (exp(x -1) - 1)$ which is well-conditioned at 1.0, however it can be shown to be numerically unstable and lose up to half the significant digits carried by the arithmetic when computed near 1.0.

double A(double X)
{
        double Y, Z;  // [1]
        Y = X - 1.0;
        Z = exp(Y);
        if (Z != 1.0)
                Z = Y / (Z - 1.0); // [2]
        return Z;
}

If, however, intermediate computations are all performed in extended precision (e.g. by setting line [1] to C99 long double), then up to full precision in the final double result can be maintained. Alternatively, a numerical analysis of the algorithm reveals that if the following non-obvious change to line [2] is made:

Z = log(Z) / (Z - 1.0);

then the algorithm becomes numerically stable and can compute to full double precision.

To maintain the properties of such carefully constructed numerically stable programs, careful handling by the compiler is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area: C99 is an example of a language where such optimizations are carefully specified to maintain numerical precision. See the external references at the bottom of this article.

A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to, and the other references at the bottom of this article. Kahan suggests several rules of thumb that can substantially decrease by orders of magnitude the risk of numerical anomalies, in addition to, or in lieu of, a more careful numerical analysis. These include: as noted above, computing all expressions and intermediate results in the highest precision supported in hardware (a common rule of thumb is to carry twice the precision of the desired result, i.e. compute in double precision for a final single-precision result, or in double extended or quad precision for up to double-precision results); and rounding input data and results to only the precision required and supported by the input data (carrying excess precision in the final result beyond that required and supported by the input data can be misleading, increases storage cost and decreases speed, and the excess bits can affect convergence of numerical procedures: notably, the first form of the iterative example given below converges correctly when using this rule of thumb). Brief descriptions of several additional issues and techniques follow.

As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. The "decimal" data type of the C# and Python programming languages, and the decimal formats of the IEEE 754-2008 standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.

Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that $(x+y)(x-y)=x^{2}-y^{2}\,$ , and that $\sin ^{2}{\theta }+\cos ^{2}{\theta }=1\,$ , however these facts cannot be relied on when the quantities involved are the result of floating-point computation.

The use of the equality test (if (x==y) ...) requires care when dealing with floating-point numbers. Even simple expressions like 0.6/0.2-3==0 will, on most computers, fail to be true (in IEEE 754 double precision, for example, 0.6/0.2 - 3 is approximately equal to -4.44089209850063e-16). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (if (abs(x-y) < epsilon) ..., where epsilon is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require numerical analysis to bound epsilon. Values derived from the primary data representation and their comparisons should be performed in a wider, extended, precision to minimize the risk of such inconsistencies due to round-off errors. It is often better to organize the code in such a way that such tests are unnecessary. For example, in computational geometry, exact tests of whether a point lies off or on a line or plane defined by other points can be performed using adaptive precision or exact arithmetic methods.

Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed, using numerical approaches such as iterative refinement, if they are to work well.

Summation of a vector of floating-point values is a basic algorithm in scientific computing, and so an awareness of when loss of significance can occur is essential. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. A typical addition would then be something like

3253.671
+  3.141276
-----------
3256.812

The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The Kahan summation algorithm may be used to reduce the errors.

Round-off error can affect the convergence and accuracy of iterative numerical procedures. As an example, Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. As noted above, computations may be rearranged in a way that is mathematically equivalent but less prone to error (numerical analysis). Two forms of the recurrence formula for the circumscribed polygon are:

${\textstyle t_{0}=1/{\sqrt {3}}}$
First form: ${\textstyle t_{i+1}=({\sqrt {t_{i}^{2}+1}}-1)/{t_{i}}}$
second form: ${\textstyle t_{i+1}={t_{i}}/({\sqrt {t_{i}^{2}+1}}+1)}$
$\pi \sim 6\times 2^{i}\times t_{i}$ , converging as $i\rightarrow \infty$

Here is a computation using IEEE "double" (a significand with 53 bits of precision) arithmetic:

 i   6 × 2ⁱ × t_i, first form    6 × 2ⁱ × t_i, second form
---------------------------------------------------------
 0   3.4641016151377543863      3.4641016151377543863
 1   3.2153903091734710173      3.2153903091734723496
 2   3.1596599420974940120      3.1596599420975006733
 3   3.1460862151314012979      3.1460862151314352708
 4   3.1427145996453136334      3.1427145996453689225
 5   3.1418730499801259536      3.1418730499798241950
 6   3.1416627470548084133      3.1416627470568494473
 7   3.1416101765997805905      3.1416101766046906629
 8   3.1415970343230776862      3.1415970343215275928
 9   3.1415937488171150615      3.1415937487713536668
10   3.1415929278733740748      3.1415929273850979885
11   3.1415927256228504127      3.1415927220386148377
12   3.1415926717412858693      3.1415926707019992125
13   3.1415926189011456060      3.1415926578678454728
14   3.1415926717412858693      3.1415926546593073709
15   3.1415919358822321783      3.1415926538571730119
16   3.1415926717412858693      3.1415926536566394222
17   3.1415810075796233302      3.1415926536065061913
18   3.1415926717412858693      3.1415926535939728836
19   3.1414061547378810956      3.1415926535908393901
20   3.1405434924008406305      3.1415926535900560168
21   3.1400068646912273617      3.1415926535898608396
22   3.1349453756585929919      3.1415926535898122118
23   3.1400068646912273617      3.1415926535897995552
24   3.2245152435345525443      3.1415926535897968907
25                              3.1415926535897962246
26                              3.1415926535897962246
27                              3.1415926535897962246
28                              3.1415926535897962246
              The true value is 3.14159265358979323846264338327...

While the two forms of the recurrence formula are clearly mathematically equivalent, the first subtracts 1 from a number extremely close to 1, leading to an increasingly problematic loss of significant digits. As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.

"Fast math" optimization

The aforementioned lack of associativity of floating-point operations in general means that compilers cannot as effectively reorder arithmetic expressions as they could with integer and fixed-point arithmetic, presenting a roadblock in optimizations such as common subexpression elimination and auto-vectorization. The "fast math" option on many compilers (ICC, GCC, Clang, MSVC...) turns on reassociation along with unsafe assumptions such as a lack of NaN and infinite numbers in IEEE 754. Some compilers also offer more granular options to only turn on reassociation. In either case, the programmer is exposed to many of the precision pitfalls mentioned above for the portion of the program using "fast" math.

In some compilers (GCC and Clang), turning on "fast" math may cause the program to disable subnormal floats at startup, affecting the floating-point behavior of not only the generated code, but also any program using such code as a library.

In most Fortran compilers, as allowed by the ISO/IEC 1539-1:2004 Fortran standard, reassociation is the default, with breakage largely prevented by the "protect parens" setting (also on by default). This setting stops the compiler from reassociating beyond the boundaries of parentheses. Intel Fortran Compiler is a notable outlier.

A common problem in "fast" math is that subexpressions may not be optimized identically from place to place, leading to unexpected differences. One interpretation of the issue is that "fast" math as implemented currently has a poorly defined semantics. One attempt at formalizing "fast" math optimizations is seen in Icing, a verified compiler.

Photoinhibition

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Photoinhibition

Photoinhibition of Photosystem II (PSII) leads to loss of PSII electron transfer activity. PSII is continuously repaired via degradation and synthesis of the D1 protein. Lincomycin can be used to block protein synthesis

Photoinhibition is light-induced reduction in the photosynthetic capacity of a plant, alga, or cyanobacterium. Photosystem II (PSII) is more sensitive to light than the rest of the photosynthetic machinery, and most researchers define the term as light-induced damage to PSII. In living organisms, photoinhibited PSII centres are continuously repaired via degradation and synthesis of the D1 protein of the photosynthetic reaction center of PSII. Photoinhibition is also used in a wider sense, as dynamic photoinhibition, to describe all reactions that decrease the efficiency of photosynthesis when plants are exposed to light.

History

The first measurements of photoinhibition were published in 1956 by Bessel Kok. Even in the very first studies, it was obvious that plants have a repair mechanism that continuously repairs photoinhibitory damage. In 1966, Jones and Kok measured the action spectrum of photoinhibition and found that ultraviolet light is highly photoinhibitory. The visible-light part of the action spectrum was found to have a peak in the red-light region, suggesting that chlorophylls act as photoreceptors of photoinhibition. In the 1980s, photoinhibition became a popular topic in photosynthesis research, and the concept of a damaging reaction counteracted by a repair process was re-invented. Research was stimulated by a paper by Kyle, Ohad and Arntzen in 1984, showing that photoinhibition is accompanied by selective loss of a 32-kDa protein, later identified as the PSII reaction center protein D1. The photosensitivity of PSII from which the oxygen evolving complex had been inactivated with chemical treatment was studied in the 1980s and early 1990s. A paper by Imre Vass and colleagues in 1992 described the acceptor-side mechanism of photoinhibition. Measurements of production of singlet oxygen by photoinhibited PSII provided further evidence for an acceptor-side-type mechanism. The concept of a repair cycle that continuously repairs photoinhibitory damage, evolved and was reviewed by Aro et al. in 1993. Many details of the repair cycle, including the finding that the FtsH protease plays an important role in the degradation of the D1 protein, have been discovered since. In 1996, a paper by Tyystjärvi and Aro showed that the rate constant of photoinhibition is directly proportional to light intensity, a result that opposed the former assumption that photoinhibition is caused by the fraction of light energy that exceeds the maximum capability of photosynthesis. The following year, laser pulse photoinhibition experiments done by Itzhak Ohad's group led to the suggestion that charge recombination reactions may be damaging because they can lead to production of singlet oxygen. The molecular mechanism(s) of photoinhibition are constantly under discussion. The newest candidate is the manganese mechanism suggested 2005 by the group of Esa Tyystjärvi. A similar mechanism was suggested by the group of Norio Murata, also in 2005.

What is inhibited

Cyanobacteria photosystem II, dimer, PDB 2AXT

Photoinhibition occurs in all organisms capable of oxygenic photosynthesis, from vascular plants to cyanobacteria. In both plants and cyanobacteria, blue light causes photoinhibition more efficiently than other wavelengths of visible light, and all wavelengths of ultraviolet light are more efficient than wavelengths of visible light. Photoinhibition is a series of reactions that inhibit different activities of PSII, but there is no consensus on what these steps are. The activity of the oxygen-evolving complex of PSII is often found to be lost before the rest of the reaction centre loses activity. However, inhibition of PSII membranes under anaerobic conditions leads primarily to inhibition of electron transfer on the acceptor side of PSII. Ultraviolet light causes inhibition of the oxygen-evolving complex before the rest of PSII becomes inhibited. Photosystem I (PSI) is less susceptible to light-induced damage than PSII, but slow inhibition of this photosystem has been observed. Photoinhibition of PSI occurs in chilling-sensitive plants and the reaction depends on electron flow from PSII to PSI.

How often does damage occur?

Photosystem II is damaged by light irrespective of light intensity. The quantum yield of the damaging reaction in typical leaves of higher plants exposed to visible light, as well as in isolated thylakoid membrane preparations, is in the range of 10⁻⁸ to 10⁻⁷ and independent of the intensity of light. This means that one PSII complex is damaged for every 10-100 million photons that are intercepted. Therefore, photoinhibition occurs at all light intensities and the rate constant of photoinhibition is directly proportional to light intensity. Some measurements suggest that dim light causes damage more efficiently than strong light.

Molecular mechanism(s)

The mechanism(s) of photoinhibition are under debate, several mechanisms have been suggested. Reactive oxygen species, especially singlet oxygen, have a role in the acceptor-side, singlet oxygen and low-light mechanisms. In the manganese mechanism and the donor side mechanism, reactive oxygen species do not play a direct role. Photoinhibited PSII produces singlet oxygen, and reactive oxygen species inhibit the repair cycle of PSII by inhibiting protein synthesis in the chloroplast.

Acceptor-side photoinhibition

Strong light causes the reduction of the plastoquinone pool, which leads to protonation and double reduction (and double protonation) of the Q_A electron acceptor of Photosystem II. The protonated and double-reduced forms of Q_A do not function in electron transport. Furthermore, charge recombination reactions in inhibited Photosystem II are expected to lead to the triplet state of the primary donor (P₆₈₀) more probably than same reactions in active PSII. Triplet P₆₈₀ may react with oxygen to produce harmful singlet oxygen.

Donor-side photoinhibition

If the oxygen-evolving complex is chemically inactivated, then the remaining electron transfer activity of PSII becomes very sensitive to light. It has been suggested that even in a healthy leaf, the oxygen-evolving complex does not always function in all PSII centers, and those ones are prone to rapid irreversible photoinhibition.

Manganese mechanism

A photon absorbed by the manganese ions of the oxygen-evolving complex triggers inactivation of the oxygen-evolving complex. Further inhibition of the remaining electron transport reactions occurs like in the donor-side mechanism. The mechanism is supported by the action spectrum of photoinhibition.

Singlet oxygen mechanisms

Inhibition of PSII is caused by singlet oxygen produced either by weakly coupled chlorophyll molecules or by cytochromes or iron–sulfur centers.

Low-light mechanism

Charge recombination reactions of PSII cause the production of triplet P₆₈₀ and, as a consequence, singlet oxygen. Charge recombination is more probable under dim light than under higher light intensities.

Kinetics and action spectrum

Photoinhibition follows simple first-order kinetics if measured from a lincomycin-treated leaf, cyanobacterial or algal cells, or isolated thylakoid membranes in which concurrent repair does not disturb the kinetics. Data from the group of W. S. Chow indicate that in leaves of pepper (Capsicum annuum), the first-order pattern is replaced by a pseudo-equilibrium even if the repair reaction is blocked. The deviation has been explained by assuming that photoinhibited PSII centers protect the remaining active ones. Both visible and ultraviolet light cause photoinhibition, ultraviolet wavelengths being much more damaging. Some researchers consider ultraviolet and visible light induced photoinhibition as a two different reactions, while others stress the similarities between the inhibition reactions occurring under different wavelength ranges.

PSII repair cycle

Photoinhibition occurs continuously when plants or cyanobacteria are exposed to light, and the photosynthesizing organism must, therefore, continuously repair the damage. The PSII repair cycle, occurring in chloroplasts and in cyanobacteria, consists of degradation and synthesis of the D1 protein of the PSII reaction centre, followed by activation of the reaction center. Due to the rapid repair, most PSII reaction centers are not photoinhibited even if a plant is grown in strong light. However, environmental stresses, for example, extreme temperatures, salinity, and drought, limit the supply of carbon dioxide for use in carbon fixation, which decreases the rate of repair of PSII.

In photoinhibition studies, repair is often stopped by applying an antibiotic (lincomycin or chloramphenicol) to plants or cyanobacteria, which blocks protein synthesis in the chloroplast. Protein synthesis occurs only in an intact sample, so lincomycin is not needed when photoinhibition is measured from isolated membranes. The repair cycle of PSII recirculates other subunits of PSII (except for the D1 protein) from the inhibited unit to the repaired one.

Protective mechanisms

The xanthophyll cycle is important in protecting plants from photoinhibition

Plants have mechanisms that protect against adverse effects of strong light. The most studied biochemical protective mechanism is non-photochemical quenching of excitation energy. Visible-light-induced photoinhibition is ~25% faster in an Arabidopsis thaliana mutant lacking non-photochemical quenching than in the wild type. It is also apparent that turning or folding of leaves, as occurs, e.g., in Oxalis species in response to exposure to high light, protects against photoinhibition.

The PsBs Protein

Because there are a limited number of photosystems in the electron transport chain, organisms that are photosynthetic must find a way to combat excess light and prevent photo-oxidative stress, and likewise, photoinhibition, at all costs. In an effort to avoid damage to the D1 subunit of PSII and subsequent formation of ROS, the plant cell employs accessory proteins to carry the excess excitation energy from incoming sunlight; namely, the PsBs protein. Elicited by a relatively low luminal pH, plants have developed a rapid response to excess energy by which it is given off as heat and damage is reduced.

The studies of Tibiletti et al. (2016) found that PsBs is the main protein involved in sensing the changes in the pH and can therefore rapidly accumulate in the presence of high light. This was determined by performing SDS-PAGE and immunoblot assays, locating PsBs itself in the green alga, Chlamydomonas reinhardtii. Their data concluded that the PsBs protein belongs to a multigene family termed LhcSR proteins, including the proteins that catalyze the conversion of violaxanthin to zeaxanthin, as previously mentioned. PsBs is involved in the changing the orientation of the photosystems at times of high light to prompt the arrangement of a quenching site in the light harvesting complex.

Additionally, studies conducted by Glowacka et al. (2018) show that a higher concentration of PsBs is directly correlated to inhibiting stomatal aperture. But it does this without affecting CO₂intake and it increases water use efficiency of the plant. This was determined by controlling the expression of PsBs in Nicotinana tabacum by imposing a series of genetic modifications to the plant in order to test for PsBs levels and activity including: DNA transformation and transcription followed by protein expression. Research shows that stomatal conductance is heavily dependent on the presence of the PsBs protein. Thus, when PsBs was overexpressed in a plant, water uptake efficiency was seen to significantly improve, resulting in new methods for prompting higher, more productive crop yields.

These recent discoveries tie together two of the largest mechanisms in phytobiology; these are the influences that the light reactions have upon stomatal aperture via the Calvin Benson Cycle. To elaborate, the Calvin-Benson Cycle, occurring in the stroma of the chloroplast obtains its CO₂ from the atmosphere which enters upon stomatal opening. The energy to drive the Calvin-Benson cycle is a product of the light reactions. Thus, the relationship has been discovered as such: when PsBs is silenced, as expected, the excitation pressure at PSII is increased. This in turn results in an activation of the redox state of Quinone A and there is no change in the concentration of carbon dioxide in the intracellular airspaces of the leaf; ultimately increasing stomatal conductance. The inverse relationship also holds true: when PsBs is over expressed, there is a decreased excitation pressure at PSII. Thus, the redox state of Quinone A is no longer active and there is, again, no change in the concentration of carbon dioxide in the intracellular airspaces of the leaf. All these factors work to have a net decrease of stomatal conductance.

Measurement

Effect of illumination on the ratio of variable to maximum fluorescence (F_V/F_M) of ground-ivy (Glechoma hederacea) leaves. Photon flux density was 1000 µmol m⁻²s⁻¹, corresponding to half of full sunlight. Photoinhibition damages PSII at the same rate whether the leaf stalk is in water or lincomycin, but, in the “leaf stalk in water” sample, repair is so rapid that no net decrease in (F_V/F_M) occurs

Photoinhibition can be measured from isolated thylakoid membranes or their subfractions, or from intact cyanobacterial cells by measuring the light-saturated rate of oxygen evolution in the presence of an artificial electron acceptor (quinones and dichlorophenol-indophenol have been used).

The degree of photoinhibition in intact leaves can be measured using a fluorimeter to measure the ratio of variable to maximum value of chlorophyll a fluorescence (F_V/F_M). This ratio can be used as a proxy of photoinhibition because more energy is emitted as fluorescence from Chlorophyll a when many excited electrons from PSII are not captured by the acceptor and decay back to their ground state.

When measuring F_V/F_M, the leaf must be incubated in the dark for at least 10 minutes, preferably longer, before the measurement, in order to let non-photochemical quenching relax.

Flashing light

Photoinhibition can also be induced with short flashes of light using either a pulsed laser or a xenon flash lamp. When very short flashes are used, the photoinhibitory efficiency of the flashes depends on the time difference between the flashes. This dependence has been interpreted to indicate that the flashes cause photoinhibition by inducing recombination reactions in PSII, with subsequent production of singlet oxygen. The interpretation has been criticized by noting that the photoinhibitory efficiency of xenon flashes depends on the energy of the flashes even if such strong flashes are used that they would saturate the formation of the substrate of the recombination reactions.

Dynamic photoinhibition

Some researchers prefer to define the term “photoinhibition” so that it contains all reactions that lower the quantum yield of photosynthesis when a plant is exposed to light. In this case, the term "dynamic photoinhibition" comprises phenomena that reversibly down-regulate photosynthesis in the light and the term "photodamage" or "irreversible photoinhibition" covers the concept of photoinhibition used by other researchers. The main mechanism of dynamic photoinhibition is non-photochemical quenching of excitation energy absorbed by PSII. Dynamic photoinhibition is acclimation to strong light rather than light-induced damage, and therefore "dynamic photoinhibition" may actually protect the plant against "photoinhibition".

Ecology of photoinhibition

Photoinhibition may cause coral bleaching.

Fertility and intelligence

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Fertility_and_intelligence

The relationship between fertility and intelligence has been investigated in many demographic studies. There is evidence that, on a population level, intelligence is negatively correlated with fertility rate and positively correlated with survival rate of offspring. Proponents of dysgenics postulate that, if the inverse correlation of IQ with fertility rate is stronger than the correlation of IQ with survival rate, and if the correlation between IQ and fertility can be linked to genetic factors, then the hereditary component of IQ will decrease with every new generation, eventually giving rise to a 'reversed Flynn effect', as has been observed in Norway, Denmark, Australia, Britain, the Netherlands, Sweden, Finland, France and German-speaking countries, where a slow decline in average IQ scores has been noted since the 1990s. However, detractors point out that genetic studies have shown no evidence for dysgenic effects in human populations and the theory's strong association with scientific racism and eugenics. They also note that the Flynn effect demonstrates an increase in phenotypic IQ scores over time in most other countries. Additionally, complicating any assessment of decreases in intelligence over time is the reliance on IQ as a unbiased measure of intelligence, which has been criticised by some scientists such as Stephen Jay Gould. Other correlates of IQ include income and educational attainment, which are also fertility factors that are inversely correlated with fertility rate, and are to some degree heritable.

Although fertility measures offspring per woman, if one needs to predict population-level changes, the average age of motherhood also needs to be considered, with lower age of motherhood potentially having a greater effect than fertility rate. For example, a subpopulation with fertility rate of 4 with average age of reproduction at 40 years old, generally speaking, will have relatively less genotypical growth than a subpopulation with fertility rate of three but average age of reproduction at 20 years old.

Early views and research

The negative correlation between fertility and intelligence (as measured by IQ) has been argued to have existed in many parts of the world. Early studies, however, were "superficial and illusory" and not clearly supported by the limited data they collected.

Some of the first studies into the subject were carried out on individuals living before the advent of IQ testing, in the late 19th century, by looking at the fertility of men listed in Who's Who, these individuals being presumably of high intelligence. These men, taken as a whole, had few children, implying a correlation.

More rigorous studies carried out on Americans alive after the Second World War returned different results suggesting a slight positive correlation with respect to intelligence. The findings from these investigations were consistent enough for Osborn and Bajema, writing as late as 1972, to conclude that fertility patterns were eugenic, and that "the reproductive trend toward an increase in the frequency of genes associated with higher IQ... will probably continue in the foreseeable future in the United States and will be found also in other industrial welfare-state democracies."

Several reviewers considered the findings premature, arguing that the samples were nationally unrepresentative, generally being confined to white people born between 1910 and 1940 in the Great Lakes States. Other researchers began to report a negative correlation in the 1960s after two decades of neutral or positive fertility.

In 1982, Daniel R. Vining, Jr. sought to address these issues in a large study on the fertility of over 10,000 individuals throughout the United States, who were then aged 25 to 34. The average fertility in his study was correlated at −0.86 with IQ for white women and −0.96 for black women. Vining argued that this indicated a drop in the genotypic average IQ of 1.6 points per generation for the white population, and 2.4 points per generation for the black population. In considering these results along with those from earlier researchers, Vining wrote that "in periods of rising birth rates, persons with higher intelligence tend to have fertility equal to, if not exceeding, that of the population as a whole," but, "The recent decline in fertility thus seems to have restored the dysgenic trend observed for a comparable period of falling fertility between 1850 and 1940." To address the concern that the fertility of this sample could not be considered complete, Vining carried out a follow-up study for the same sample 18 years later, reporting the same, though slightly decreased, negative correlation between IQ and fertility. Critics note Vining's involvement with the eugenicist journal Mankind Quarterly and his acceptance of grants from the Pioneer Fund.

Later research

In a 1988 study, Retherford and Sewell examined the association between the measured intelligence and fertility of over 9,000 high school graduates in Wisconsin in 1957, and confirmed the inverse relationship between IQ and fertility for both sexes, but much more so for females. If children had, on average, the same IQ as their parents, IQ would decline by .81 points per generation. Taking .71 for the additive heritability of IQ as given by Jinks and Fulker, they calculated a dysgenic decline of .57 IQ points per generation.

Another way of checking the negative relationship between IQ and fertility is to consider the relationship which educational attainment has to fertility, since education is known to be a reasonable proxy for IQ, correlating with IQ at .55; in a 1999 study examining the relationship between IQ and education in a large national sample, David Rowe and others found not only that achieved education had a high heritability (.68) and that half of the variance in education was explained by an underlying genetic component shared by IQ, education, and SES. One study investigating fertility and education carried out in 1991 found that high school dropouts in the United States had the most children (2.5 on average), with high school graduates having fewer children, and college graduates having the fewest children (1.56 on average).

The Bell Curve (1994) argued that the average genotypic IQ of the United States was declining due to both dysgenetic fertility and large scale immigration of groups with low average IQ.

In a 1999 study Richard Lynn examined the relationship between the intelligence of adults aged 40 and above and their numbers of children and their siblings. Data was collected from a 1994 National Opinion Research Center survey among a representative sample of 2992 English-speaking individuals aged 18 years. He found negative correlations between the intelligence of American adults and the number of children and siblings that they had, but only for females. He also reported that there was virtually no correlation between women's intelligence and the number of children they considered ideal.

In 2004 Lynn and Marian Van Court attempted a straightforward replication of Vining's work. Their study returned similar results, with the genotypic decline measuring at 0.9 IQ points per generation for the total sample and 0.75 IQ points for whites only.

Boutwell et al. (2013) reported a strong negative association between county-level IQ and county-level fertility rates in the United States.

A 2014 study by Satoshi Kanazawa using data from the National Child Development Study found that more intelligent women and men were more likely to want to be childless, but that only more intelligent women – not men – were more likely to actually be childless.

International research

Map of countries by fertility rate (2020), according to the Population Reference Bureau

Although much of the research into intelligence and fertility has been restricted to individuals within a single nation (usually the United States), Steven Shatz (2008) extended the research internationally; he finds that "There is a strong tendency for countries with lower national IQ scores to have higher fertility rates and for countries with higher national IQ scores to have lower fertility rates."

Lynn and Harvey (2008) found a correlation of −0.73 between national IQ and fertility. They estimated that the effect had been "a decline in the world's genotypic IQ of 0.86 IQ points for the years 1950–2000. A further decline of 1.28 IQ points in the world's genotypic IQ is projected for the years 2000–2050." In the first period this effect had been compensated for by the Flynn effect causing a rise in phenotypic IQ but recent studies in four developed nations had found it has now ceased or gone into reverse. They thought it probable that both genotypic and phenotypic IQ will gradually start to decline for the whole world.

Possible causes

Income

A theory to explain the fertility-intelligence relationship is that while income and IQ are positively correlated, income is also in itself a fertility factor that correlates inversely with fertility, that is, the higher the incomes, the lower the fertility rates and vice versa. There is thus an inverse correlation between income and fertility within and between nations. The higher the level of education and GDP per capita of a human population, sub-population or social stratum, the fewer children are born. In a 1974 UN population conference in Bucharest, Karan Singh, a former minister of population in India, encapsulated this relationship by stating "Development is the best contraceptive".

Education

In most countries, education is inversely correlated to childbearing. People often delay childbearing in order to spend more time getting education, and thus have fewer children. Conversely, early childbearing can interfere with education, so people with early or frequent childbearing are likely to be less educated. While education and childbearing place competing demands on a person's resources, education is positively correlated with IQ.

While there is less research into men's fertility and education, in developed countries evidence suggests that highly-educated men display higher levels of childbearing compared to less-educated men.

As a country becomes more developed, education rates increase and fertility rates decrease for both men and women. Fertility has fallen faster for both less-educated men and women than it has for highly-educated men and women. In the Nordic countries of Denmark, Norway, and Sweden, fertility for less-educated women has now fallen enough that childlessness is now highest among the least educated women just as it is for men.

Birth control and intelligence

Among a sample of women using birth control methods of comparable theoretical effectiveness, success rates were related to IQ, with the percentages of high, medium and low IQ women having unwanted births during a three-year interval being 3%, 8% and 11%, respectively. Since the effectiveness of many methods of birth control is directly correlated with proper usage, an alternative interpretation of the data would indicate lower IQ women were less likely to use birth control consistently and correctly. Another study found that after an unwanted pregnancy has occurred, higher IQ couples are more likely to obtain abortions; and unmarried teenage girls who become pregnant are found to be more likely to carry their babies to term if they are doing poorly in school.

Conversely, while desired family size in the United States is apparently the same for women of all IQ levels, highly educated women are found to be more likely to say that they desire more children than they have, indicating a "deficit fertility" in the highly intelligent. In her review of reproductive trends in the United States, Van Court argues that "each factor – from initially employing some form of contraception, to successful implementation of the method, to termination of an accidental pregnancy when it occurs – involves selection against intelligence."

Criticisms

While it may seem obvious that such differences in fertility would result in a progressive change of IQ, Preston and Campbell (1993) argued that this is a mathematical fallacy that applies only when looking at closed subpopulations. In their mathematical model, with constant differences in fertility, since children's IQ can be more or less than that of their parents, a steady-state equilibrium is argued to be established between different subpopulations with different IQ. The mean IQ will not change in the absence of a change of the fertility differences. The steady-state IQ distribution will be lower for negative differential fertility than for positive, but these differences are small. For the extreme and unrealistic assumption of endogamous mating in IQ subgroups, a differential fertility change of 2.5/1.5 to 1.5/2.5 (high IQ/low IQ) causes a maximum shift of four IQ points. For random mating, the shift is less than one IQ point. James S. Coleman, however, argues that Preston and Campbell's model depends on assumptions which are unlikely to be true.

The general increase in IQ test scores, the Flynn effect, has been argued to be evidence against dysgenic arguments. Geneticist Steve Connor wrote that Lynn, writing in Dysgenics: Genetic Deterioration in Modern Populations, "misunderstood modern ideas of genetics." "A flaw in his argument of genetic deterioration in intelligence was the widely accepted fact that intelligence as measured by IQ tests has actually increased over the past 50 years." If the genes causing IQ have been adversely affected, IQ scores should reasonably be expected to change in the same direction, yet the reverse has occurred. However, it has been argued that genotypic IQ may decrease even while phenotypic IQ rises throughout the population due to environmental effects such as better nutrition and education. The Flynn effect may now have ended or reversed in some developed nations.

Some of the studies looking at relation between IQ and fertility cover the fertility of individuals who have attained a particular age, thereby ignoring positive correlation between IQ and survival. To make conclusions about effects on IQ of future populations, such effects would have to be taken into account.

Recent research has shown that education and socioeconomic status are better indicators of fertility and suggests that the relationship between intelligence and number of children may be spurious. When controlling for education and socioeconomic status, the relationship between intelligence and number of children, intelligence and number of siblings, and intelligence and ideal number of children reduces to statistical insignificance. Among women, a post-hoc analysis revealed that the lowest and highest intelligence scores did not differ significantly by number of children. However, socioeconomic status and (obviously) education are themselves not independent of intelligence.

Most research involves studying female fertility, while male fertility is ignored. When male fertility rates are compared to education attainment men with more education father more children.

Other research suggest that siblings born further apart achieve higher educational outcomes. Therefore, sibling density, not number of siblings, may explain the negative association between IQ and number of siblings.

Other traits

A study by the Institute of Psychiatry determined that men with higher IQ's tend to have better quality sperm than lower IQ males, even when considering age and lifestyle, stating that the genes underlying intelligence may be multi-factored.

Search This Blog

Thursday, September 1, 2022

Floating-point arithmetic

Overview

Floating-point numbers

Alternatives to floating-point numbers

History

Range of floating-point numbers

IEEE 754: floating point in modern computers

Internal representation

Special values

Signed zero

Subnormal numbers

Infinities

NaNs

IEEE 754 design rationale

Other notable floating-point formats

Representable numbers, conversion and rounding

Rounding modes

Binary-to-decimal conversion with minimal number of digits

Decimal-to-binary conversion

Floating-point operations

Addition and subtraction

Multiplication and division

Literal syntax

Dealing with exceptional cases

Accuracy problems

Incidents

Machine precision and backward error analysis

Minimizing the effect of accuracy problems

"Fast math" optimization

Photoinhibition

History

What is inhibited

How often does damage occur?

Molecular mechanism(s)

Acceptor-side photoinhibition

Donor-side photoinhibition

Manganese mechanism

Singlet oxygen mechanisms

Low-light mechanism

Kinetics and action spectrum

PSII repair cycle

Protective mechanisms

The PsBs Protein

Measurement

Flashing light

Dynamic photoinhibition

Ecology of photoinhibition

Fertility and intelligence

Early views and research

Later research

International research

Possible causes

Income

Education

Birth control and intelligence

Criticisms

Other traits

Position and momentum spaces