Search This Blog

Sunday, August 6, 2023

Entropy (information theory)

From Wikipedia, the free encyclopedia

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

where denotes the sum over the variable's possible values. The choice of base for , the logarithm, varies for different applications. Base 2 gives the unit of bits (or "shannons"), while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.

Two bits of entropy: In the case of two fair coin tosses, the information entropy in bits is the base-2 logarithm of the number of possible outcomes; with two coins there are four possible outcomes, and two bits of entropy. Generally, information entropy is the average amount of information conveyed by an event, when considering all possible outcomes.

The concept of information entropy was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication", and is also referred to as Shannon entropy. Shannon's theory defines a data communication system composed of three elements: a source of data, a communication channel, and a receiver. The "fundamental problem of communication" – as expressed by Shannon – is for the receiver to be able to identify what data was generated by the source, based on the signal it receives through the channel. Shannon considered various ways to encode, compress, and transmit messages from a data source, and proved in his famous source coding theorem that the entropy represents an absolute mathematical limit on how well data from the source can be losslessly compressed onto a perfectly noiseless channel. Shannon strengthened this result considerably for noisy channels in his noisy-channel coding theorem.

Entropy in information theory is directly analogous to the entropy in statistical thermodynamics. The analogy results when the values of the random variable designate energies of microstates, so Gibbs formula for the entropy is formally identical to Shannon's formula. Entropy has relevance to other areas of mathematics such as combinatorics and machine learning. The definition can be derived from a set of axioms establishing that entropy should be a measure of how informative the average outcome of a variable is. For a continuous random variable, differential entropy is analogous to entropy.

Introduction

The core idea of information theory is that the "informational value" of a communicated message depends on the degree to which the content of the message is surprising. If a highly likely event occurs, the message carries very little information. On the other hand, if a highly unlikely event occurs, the message is much more informative. For instance, the knowledge that some particular number will not be the winning number of a lottery provides very little information, because any particular chosen number will almost certainly not win. However, knowledge that a particular number will win a lottery has high informational value because it communicates the outcome of a very low probability event.

The information content, also called the surprisal or self-information, of an event is a function which increases as the probability of an event decreases. When is close to 1, the surprisal of the event is low, but if is close to 0, the surprisal of the event is high. This relationship is described by the function

where is the logarithm, which gives 0 surprise when the probability of the event is 1. In fact, log is the only function that satisfies а specific set of conditions defined in section § Characterization.

Hence, we can define the information, or surprisal, of an event by

or equivalently,

Entropy measures the expected (i.e., average) amount of information conveyed by identifying the outcome of a random trial. This implies that casting a die has higher entropy than tossing a coin because each outcome of a die toss has smaller probability (about ) than each outcome of a coin toss ().

Consider a coin with probability p of landing on heads and probability 1 − p of landing on tails. The maximum surprise is when p = 1/2, for which one outcome is not expected over the other. In this case a coin flip has an entropy of one bit. (Similarly, one trit with equiprobable values contains (about 1.58496) bits of information because it can have one of three values.) The minimum surprise is when p = 0 or p = 1, when the event outcome is known ahead of time, and the entropy is zero bits. When the entropy is zero bits, this is sometimes referred to as unity, where there is no uncertainty at all – no freedom of choice – no information. Other values of p give entropies between zero and one bits.

Information theory is useful to calculate the smallest amount of information required to convey a message, as in data compression. For example, consider the transmission of sequences comprising the 4 characters 'A', 'B', 'C', and 'D' over a binary channel. If all 4 letters are equally likely (25%), one can not do better than using two bits to encode each letter. 'A' might code as '00', 'B' as '01', 'C' as '10', and 'D' as '11'. However, if the probabilities of each letter are unequal, say 'A' occurs with 70% probability, 'B' with 26%, and 'C' and 'D' with 2% each, one could assign variable length codes. In this case, 'A' would be coded as '0', 'B' as '10', 'C' as '110', and D as '111'. With this representation, 70% of the time only one bit needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On average, fewer than 2 bits are required since the entropy is lower (owing to the high prevalence of 'A' followed by 'B' – together 96% of characters). The calculation of the sum of probability-weighted log probabilities measures and captures this effect. English text, treated as a string of characters, has fairly low entropy; i.e. it is fairly predictable. We can be fairly certain that, for example, 'e' will be far more common than 'z', that the combination 'qu' will be much more common than any other combination with a 'q' in it, and that the combination 'th' will be more common than 'z', 'q', or 'qu'. After the first few letters one can often guess the rest of the word. English text has between 0.6 and 1.3 bits of entropy per character of the message.

Definition

Named after Boltzmann's Η-theorem, Shannon defined the entropy Η (Greek capital letter eta) of a discrete random variable , which takes values in the alphabet and is distributed according to such that :

Here is the expected value operator, and I is the information content of X. is itself a random variable.

The entropy can explicitly be written as:

where b is the base of the logarithm used. Common values of b are 2, Euler's number e, and 10, and the corresponding units of entropy are the bits for b = 2, nats for b = e, and bans for b = 10.

In the case of for some , the value of the corresponding summand 0 logb(0) is taken to be 0, which is consistent with the limit:

One may also define the conditional entropy of two variables and taking values from sets and respectively, as:

where and . This quantity should be understood as the remaining randomness in the random variable given the random variable .

Measure theory

Entropy can be formally defined in the language of measure theory as follows:[11] Let be a probability space. Let be an event. The surprisal of is

The expected surprisal of is

A -almost partition is a set family such that and for all distinct . (This is a relaxation of the usual conditions for a partition.) The entropy of is

Let be a sigma-algebra on . The entropy of is

Finally, the entropy of the probability space is , that is, the entropy with respect to of the sigma-algebra of all measurable subsets of .

Ellerman definition

David Ellerman wanted to explain why conditional entropy and other functions had properties similar to functions in probability theory. He claims that previous definitions based on measure theory only worked with powers of 2.

Ellerman created a "logic of partitions" that is the dual of subsets of a universal set. Information is quantified as "dits" (distinctions), a measure on partitions. "Dits" can be converted into Shannon's bits, to get the formulas for conditional entropy, etc..

Example

Entropy Η(X) (i.e. the expected surprisal) of a coin flip, measured in bits, graphed versus the bias of the coin Pr(X = 1), where X = 1 represents a result of heads.

Here, the entropy is at most 1 bit, and to communicate the outcome of a coin flip (2 possible values) will require an average of at most 1 bit (exactly 1 bit for a fair coin). The result of a fair die (6 possible values) would have entropy log26 bits.

Consider tossing a coin with known, not necessarily fair, probabilities of coming up heads or tails; this can be modelled as a Bernoulli process.

The entropy of the unknown result of the next toss of the coin is maximized if the coin is fair (that is, if heads and tails both have equal probability 1/2). This is the situation of maximum uncertainty as it is most difficult to predict the outcome of the next toss; the result of each toss of the coin delivers one full bit of information. This is because

However, if we know the coin is not fair, but comes up heads or tails with probabilities p and q, where pq, then there is less uncertainty. Every time it is tossed, one side is more likely to come up than the other. The reduced uncertainty is quantified in a lower entropy: on average each toss of the coin delivers less than one full bit of information. For example, if p = 0.7, then

Uniform probability yields maximum uncertainty and therefore maximum entropy. Entropy, then, can only decrease from the value associated with uniform probability. The extreme case is that of a double-headed coin that never comes up tails, or a double-tailed coin that never results in a head. Then there is no uncertainty. The entropy is zero: each toss of the coin delivers no new information as the outcome of each coin toss is always certain.

Entropy can be normalized by dividing it by information length. This ratio is called metric entropy and is a measure of the randomness of the information.

Characterization

To understand the meaning of −Σ pi log(pi), first define an information function I in terms of an event i with probability pi. The amount of information acquired due to the observation of event i follows from Shannon's solution of the fundamental properties of information:

  1. I(p) is monotonically decreasing in p: an increase in the probability of an event decreases the information from an observed event, and vice versa.
  2. I(1) = 0: events that always occur do not communicate information.
  3. I(p1·p2) = I(p1) + I(p2): the information learned from independent events is the sum of the information learned from each event.

Given two independent events, if the first event can yield one of n equiprobable outcomes and another has one of m equiprobable outcomes then there are mn equiprobable outcomes of the joint event. This means that if log2(n) bits are needed to encode the first value and log2(m) to encode the second, one needs log2(mn) = log2(m) + log2(n) to encode both.

Shannon discovered that a suitable choice of is given by:

In fact, the only possible values of are for . Additionally, choosing a value for k is equivalent to choosing a value for , so that x corresponds to the base for the logarithm. Thus, entropy is characterized by the above four properties.

The different units of information (bits for the binary logarithm log2, nats for the natural logarithm ln, bans for the decimal logarithm log10 and so on) are constant multiples of each other. For instance, in case of a fair coin toss, heads provides log2(2) = 1 bit of information, which is approximately 0.693 nats or 0.301 decimal digits. Because of additivity, n tosses provide n bits of information, which is approximately 0.693n nats or 0.301n decimal digits.

The meaning of the events observed (the meaning of messages) does not matter in the definition of entropy. Entropy only takes into account the probability of observing a specific event, so the information it encapsulates is information about the underlying probability distribution, not the meaning of the events themselves.

Alternative characterization

Another characterization of entropy uses the following properties. We denote pi = Pr(X = xi) and Ηn(p1, ..., pn) = Η(X).

  1. Continuity: H should be continuous, so that changing the values of the probabilities by a very small amount should only change the entropy by a small amount.
  2. Symmetry: H should be unchanged if the outcomes xi are re-ordered. That is, for any permutation of .
  3. Maximum: should be maximal if all the outcomes are equally likely i.e. .
  4. Increasing number of outcomes: for equiprobable events, the entropy should increase with the number of outcomes i.e.
  5. Additivity: given an ensemble of n uniformly distributed elements that are divided into k boxes (sub-systems) with b1, ..., bk elements each, the entropy of the whole ensemble should be equal to the sum of the entropy of the system of boxes and the individual entropies of the boxes, each weighted with the probability of being in that particular box.

The rule of additivity has the following consequences: for positive integers bi where b1 + ... + bk = n,

Choosing k = n, b1 = ... = bn = 1 this implies that the entropy of a certain outcome is zero: Η1(1) = 0. This implies that the efficiency of a source alphabet with n symbols can be defined simply as being equal to its n-ary entropy. See also Redundancy (information theory).

Alternative characterization via additivity and subadditivity

Another succinct axiomatic characterization of Shannon entropy was given by Aczél, Forte and Ng, via the following properties:

  1. Subadditivity: for jointly distributed random variables .
  2. Additivity: when the random variables are independent.
  3. Expansibility: , i.e., adding an outcome with probability zero does not change the entropy.
  4. Symmetry: is invariant under permutation of .
  5. Small for small probabilities: .

It was shown that any function satisfying the above properties must be a constant multiple of Shannon entropy, with a non-negative constant. Compared to the previously mentioned characterizations of entropy, this characterization focuses on the properties of entropy as a function of random variables (subadditivity and additivity), rather than the properties of entropy as a function of the probability vector .

It is worth noting that if we drop the "small for small probabilities" property, then must be a non-negative linear combination of the Shannon entropy and the Hartley entropy.

Further properties

The Shannon entropy satisfies the following properties, for some of which it is useful to interpret entropy as the expected amount of information learned (or uncertainty eliminated) by revealing the value of a random variable X:

  • Adding or removing an event with probability zero does not contribute to the entropy:
.
.
This maximal entropy of logb(n) is effectively attained by a source alphabet having a uniform probability distribution: uncertainty is maximal when all possible events are equiprobable.
  • The entropy or the amount of information revealed by evaluating (X,Y) (that is, evaluating X and Y simultaneously) is equal to the information revealed by conducting two consecutive experiments: first evaluating the value of Y, then revealing the value of X given that you know the value of Y. This may be written as:
  • If where is a function, then . Applying the previous formula to yields
so , the entropy of a variable can only decrease when the latter is passed through a function.
  • If X and Y are two independent random variables, then knowing the value of Y doesn't influence our knowledge of the value of X (since the two don't influence each other by independence):
  • More generally, for any random variables X and Y, we have
.
  • The entropy of two simultaneous events is no more than the sum of the entropies of each individual event i.e., , with equality if and only if the two events are independent.
  • The entropy is concave in the probability mass function , i.e.
for all probability mass functions and .

Aspects

Relationship to thermodynamic entropy

The inspiration for adopting the word entropy in information theory came from the close resemblance between Shannon's formula and very similar known formulae from statistical mechanics.

In statistical thermodynamics the most general formula for the thermodynamic entropy S of a thermodynamic system is the Gibbs entropy,

where kB is the Boltzmann constant, and pi is the probability of a microstate. The Gibbs entropy was defined by J. Willard Gibbs in 1878 after earlier work by Boltzmann (1872).

The Gibbs entropy translates over almost unchanged into the world of quantum physics to give the von Neumann entropy, introduced by John von Neumann in 1927,

where ρ is the density matrix of the quantum mechanical system and Tr is the trace.

At an everyday practical level, the links between information entropy and thermodynamic entropy are not evident. Physicists and chemists are apt to be more interested in changes in entropy as a system spontaneously evolves away from its initial conditions, in accordance with the second law of thermodynamics, rather than an unchanging probability distribution. As the minuteness of the Boltzmann constant kB indicates, the changes in S / kB for even tiny amounts of substances in chemical and physical processes represent amounts of entropy that are extremely large compared to anything in data compression or signal processing. In classical thermodynamics, entropy is defined in terms of macroscopic measurements and makes no reference to any probability distribution, which is central to the definition of information entropy.

The connection between thermodynamics and what is now known as information theory was first made by Ludwig Boltzmann and expressed by his famous equation:

where is the thermodynamic entropy of a particular macrostate (defined by thermodynamic parameters such as temperature, volume, energy, etc.), W is the number of microstates (various combinations of particles in various energy states) that can yield the given macrostate, and kB is the Boltzmann constant. It is assumed that each microstate is equally likely, so that the probability of a given microstate is pi = 1/W. When these probabilities are substituted into the above expression for the Gibbs entropy (or equivalently kB times the Shannon entropy), Boltzmann's equation results. In information theoretic terms, the information entropy of a system is the amount of "missing" information needed to determine a microstate, given the macrostate.

In the view of Jaynes (1957), thermodynamic entropy, as explained by statistical mechanics, should be seen as an application of Shannon's information theory: the thermodynamic entropy is interpreted as being proportional to the amount of further Shannon information needed to define the detailed microscopic state of the system, that remains uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics, with the constant of proportionality being just the Boltzmann constant. Adding heat to a system increases its thermodynamic entropy because it increases the number of possible microscopic states of the system that are consistent with the measurable values of its macroscopic variables, making any complete state description longer. (See article: maximum entropy thermodynamics). Maxwell's demon can (hypothetically) reduce the thermodynamic entropy of a system by using information about the states of individual molecules; but, as Landauer (from 1961) and co-workers have shown, to function the demon himself must increase thermodynamic entropy in the process, by at least the amount of Shannon information he proposes to first acquire and store; and so the total thermodynamic entropy does not decrease (which resolves the paradox). Landauer's principle imposes a lower bound on the amount of heat a computer must generate to process a given amount of information, though modern computers are far less efficient.

Data compression

Shannon's definition of entropy, when applied to an information source, can determine the minimum channel capacity required to reliably transmit the source as encoded binary digits. Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc. The minimum channel capacity can be realized in theory by using the typical set or in practice using Huffman, Lempel–Ziv or arithmetic coding. (See also Kolmogorov complexity.) In practice, compression algorithms deliberately include some judicious redundancy in the form of checksums to protect against errors. The entropy rate of a data source is the average number of bits per symbol needed to encode it. Shannon's experiments with human predictors show an information rate between 0.6 and 1.3 bits per character in English; the PPM compression algorithm can achieve a compression ratio of 1.5 bits per character in English text.

If a compression scheme is lossless – one in which you can always recover the entire original message by decompression – then a compressed message has the same quantity of information as the original but communicated in fewer characters. It has more information (higher entropy) per character. A compressed message has less redundancy. Shannon's source coding theorem states a lossless compression scheme cannot compress messages, on average, to have more than one bit of information per bit of message, but that any value less than one bit of information per bit of message can be attained by employing a suitable coding scheme. The entropy of a message per bit multiplied by the length of that message is a measure of how much total information the message contains. Shannon's theorem also implies that no lossless compression scheme can shorten all messages. If some messages come out shorter, at least one must come out longer due to the pigeonhole principle. In practical use, this is generally not a problem, because one is usually only interested in compressing certain types of messages, such as a document in English, as opposed to gibberish text, or digital photographs rather than noise, and it is unimportant if a compression algorithm makes some unlikely or uninteresting sequences larger.

A 2011 study in Science estimates the world's technological capacity to store and communicate optimally compressed information normalized on the most effective compression algorithms available in the year 2007, therefore estimating the entropy of the technologically available sources.

All figures in entropically compressed exabytes
Type of Information 1986 2007
Storage 2.6 295
Broadcast 432 1900
Telecommunications 0.281 65

The authors estimate humankind technological capacity to store information (fully entropically compressed) in 1986 and again in 2007. They break the information into three categories—to store information on a medium, to receive information through one-way broadcast networks, or to exchange information through two-way telecommunication networks.

Entropy as a measure of diversity

Entropy is one of several ways to measure biodiversity, and is applied in the form of the Shannon index. A diversity index is a quantitative statistical measure of how many different types exist in a dataset, such as species in a community, accounting for ecological richness, evenness, and dominance. Specifically, Shannon entropy is the logarithm of 1D, the true diversity index with parameter equal to 1. The Shannon index is related to the proportional abundances of types.

Limitations of entropy

There are a number of entropy-related concepts that mathematically quantify information content in some way:

(The "rate of self-information" can also be defined for a particular sequence of messages or symbols generated by a given stochastic process: this will always be equal to the entropy rate in the case of a stationary process.) Other quantities of information are also used to compare or relate different sources of information.

It is important not to confuse the above concepts. Often it is only clear from context which one is meant. For example, when someone says that the "entropy" of the English language is about 1 bit per character, they are actually modeling the English language as a stochastic process and talking about its entropy rate. Shannon himself used the term in this way.

If very large blocks are used, the estimate of per-character entropy rate may become artificially low because the probability distribution of the sequence is not known exactly; it is only an estimate. If one considers the text of every book ever published as a sequence, with each symbol being the text of a complete book, and if there are N published books, and each book is only published once, the estimate of the probability of each book is 1/N, and the entropy (in bits) is −log2(1/N) = log2(N). As a practical code, this corresponds to assigning each book a unique identifier and using it in place of the text of the book whenever one wants to refer to the book. This is enormously useful for talking about books, but it is not so useful for characterizing the information content of an individual book, or of language in general: it is not possible to reconstruct the book from its identifier without knowing the probability distribution, that is, the complete text of all the books. The key idea is that the complexity of the probabilistic model must be considered. Kolmogorov complexity is a theoretical generalization of this idea that allows the consideration of the information content of a sequence independent of any particular probability model; it considers the shortest program for a universal computer that outputs the sequence. A code that achieves the entropy rate of a sequence for a given model, plus the codebook (i.e. the probabilistic model), is one such program, but it may not be the shortest.

The Fibonacci sequence is 1, 1, 2, 3, 5, 8, 13, .... treating the sequence as a message and each number as a symbol, there are almost as many symbols as there are characters in the message, giving an entropy of approximately log2(n). The first 128 symbols of the Fibonacci sequence has an entropy of approximately 7 bits/symbol, but the sequence can be expressed using a formula [F(n) = F(n−1) + F(n−2) for n = 3, 4, 5, ..., F(1) =1, F(2) = 1] and this formula has a much lower entropy and applies to any length of the Fibonacci sequence.

Limitations of entropy in cryptography

In cryptanalysis, entropy is often roughly used as a measure of the unpredictability of a cryptographic key, though its real uncertainty is unmeasurable. For example, a 128-bit key that is uniformly and randomly generated has 128 bits of entropy. It also takes (on average) guesses to break by brute force. Entropy fails to capture the number of guesses required if the possible keys are not chosen uniformly. Instead, a measure called guesswork can be used to measure the effort required for a brute force attack.

Other problems may arise from non-uniform distributions used in cryptography. For example, a 1,000,000-digit binary one-time pad using exclusive or. If the pad has 1,000,000 bits of entropy, it is perfect. If the pad has 999,999 bits of entropy, evenly distributed (each individual bit of the pad having 0.999999 bits of entropy) it may provide good security. But if the pad has 999,999 bits of entropy, where the first bit is fixed and the remaining 999,999 bits are perfectly random, the first bit of the ciphertext will not be encrypted at all.

Data as a Markov process

A common way to define entropy for text is based on the Markov model of text. For an order-0 source (each character is selected independent of the last characters), the binary entropy is:

where pi is the probability of i. For a first-order Markov source (one in which the probability of selecting a character is dependent only on the immediately preceding character), the entropy rate is:

where i is a state (certain preceding characters) and is the probability of j given i as the previous character.

For a second order Markov source, the entropy rate is

Efficiency (normalized entropy)

A source alphabet with non-uniform distribution will have less entropy than if those symbols had uniform distribution (i.e. the "optimized alphabet"). This deficiency in entropy can be expressed as a ratio called efficiency:

Applying the basic properties of the logarithm, this quantity can also be expressed as:

Efficiency has utility in quantifying the effective use of a communication channel. This formulation is also referred to as the normalized entropy, as the entropy is divided by the maximum entropy . Furthermore, the efficiency is indifferent to choice of (positive) base b, as indicated by the insensitivity within the final logarithm above thereto.

Entropy for continuous random variables

Differential entropy

The Shannon entropy is restricted to random variables taking discrete values. The corresponding formula for a continuous random variable with probability density function f(x) with finite or infinite support on the real line is defined by analogy, using the above form of the entropy as an expectation:

This is the differential entropy (or continuous entropy). A precursor of the continuous entropy h[f] is the expression for the functional Η in the H-theorem of Boltzmann.

Although the analogy between both functions is suggestive, the following question must be set: is the differential entropy a valid extension of the Shannon discrete entropy? Differential entropy lacks a number of properties that the Shannon discrete entropy has – it can even be negative – and corrections have been suggested, notably limiting density of discrete points.

To answer this question, a connection must be established between the two functions:

In order to obtain a generally finite measure as the bin size goes to zero. In the discrete case, the bin size is the (implicit) width of each of the n (finite or infinite) bins whose probabilities are denoted by pn. As the continuous domain is generalized, the width must be made explicit.

To do this, start with a continuous function f discretized into bins of size . By the mean-value theorem there exists a value xi in each bin such that

the integral of the function f can be approximated (in the Riemannian sense) by
where this limit and "bin size goes to zero" are equivalent.

We will denote

and expanding the logarithm, we have

As Δ → 0, we have

Note; log(Δ) → −∞ as Δ → 0, requires a special definition of the differential or continuous entropy:

which is, as said before, referred to as the differential entropy. This means that the differential entropy is not a limit of the Shannon entropy for n → ∞. Rather, it differs from the limit of the Shannon entropy by an infinite offset (see also the article on information dimension).

Limiting density of discrete points

It turns out as a result that, unlike the Shannon entropy, the differential entropy is not in general a good measure of uncertainty or information. For example, the differential entropy can be negative; also it is not invariant under continuous co-ordinate transformations. This problem may be illustrated by a change of units when x is a dimensioned variable. f(x) will then have the units of 1/x. The argument of the logarithm must be dimensionless, otherwise it is improper, so that the differential entropy as given above will be improper. If Δ is some "standard" value of x (i.e. "bin size") and therefore has the same units, then a modified differential entropy may be written in proper form as:

and the result will be the same for any choice of units for x. In fact, the limit of discrete entropy as would also include a term of , which would in general be infinite. This is expected: continuous variables would typically have infinite entropy when discretized. The limiting density of discrete points is really a measure of how much easier a distribution is to describe than a distribution that is uniform over its quantization scheme.

Relative entropy

Another useful measure of entropy that works equally well in the discrete and the continuous case is the relative entropy of a distribution. It is defined as the Kullback–Leibler divergence from the distribution to a reference measure m as follows. Assume that a probability distribution p is absolutely continuous with respect to a measure m, i.e. is of the form p(dx) = f(x)m(dx) for some non-negative m-integrable function f with m-integral 1, then the relative entropy can be defined as

In this form the relative entropy generalizes (up to change in sign) both the discrete entropy, where the measure m is the counting measure, and the differential entropy, where the measure m is the Lebesgue measure. If the measure m is itself a probability distribution, the relative entropy is non-negative, and zero if p = m as measures. It is defined for any measure space, hence coordinate independent and invariant under co-ordinate reparameterizations if one properly takes into account the transformation of the measure m. The relative entropy, and (implicitly) entropy and differential entropy, do depend on the "reference" measure m.

Use in combinatorics

Entropy has become a useful quantity in combinatorics.

Loomis–Whitney inequality

A simple example of this is an alternative proof of the Loomis–Whitney inequality: for every subset AZd, we have

where Pi is the orthogonal projection in the ith coordinate:

The proof follows as a simple corollary of Shearer's inequality: if X1, ..., Xd are random variables and S1, ..., Sn are subsets of {1, ..., d} such that every integer between 1 and d lies in exactly r of these subsets, then

where is the Cartesian product of random variables Xj with indexes j in Si (so the dimension of this vector is equal to the size of Si).

We sketch how Loomis–Whitney follows from this: Indeed, let X be a uniformly distributed random variable with values in A and so that each point in A occurs with equal probability. Then (by the further properties of entropy mentioned above) Η(X) = log|A|, where |A| denotes the cardinality of A. Let Si = {1, 2, ..., i−1, i+1, ..., d}. The range of is contained in Pi(A) and hence . Now use this to bound the right side of Shearer's inequality and exponentiate the opposite sides of the resulting inequality you obtain.

Approximation to binomial coefficient

For integers 0 < k < n let q = k/n. Then

where

A nice interpretation of this is that the number of binary strings of length n with exactly k many 1's is approximately .

Use in machine learning

Machine learning techniques arise largely from statistics and also information theory. In general, entropy is a measure of uncertainty and the objective of machine learning is to minimize uncertainty.

Decision tree learning algorithms use relative entropy to determine the decision rules that govern the data at each node. The Information gain in decision trees , which is equal to the difference between the entropy of and the conditional entropy of given , quantifies the expected information, or the reduction in entropy, from additionally knowing the value of an attribute . The information gain is used to identify which attributes of the dataset provide the most information and should be used to split the nodes of the tree optimally.

Bayesian inference models often apply the Principle of maximum entropy to obtain Prior probability distributions. The idea is that the distribution that best represents the current state of knowledge of a system is the one with the largest entropy, and is therefore suitable to be the prior.

Classification in machine learning performed by logistic regression or artificial neural networks often employs a standard loss function, called cross entropy loss, that minimizes the average cross entropy between ground truth and predicted distributions. In general, cross entropy is a measure of the differences between two datasets similar to the KL divergence (also known as relative entropy).

Inquiry

From Wikipedia, the free encyclopedia
A question mark

An inquiry (also spelled as enquiry in British English) is any process that has the aim of augmenting knowledge, resolving doubt, or solving a problem. A theory of inquiry is an account of the various types of inquiry and a treatment of the ways that each type of inquiry achieves its aim.

Inquiry theories

Deduction

When three terms are so related to one another that the last is wholly contained in the middle and the middle is wholly contained in or excluded from the first, the extremes must admit of perfect syllogism. By 'middle term' I mean that which both is contained in another and contains another in itself, and which is the middle by its position also; and by 'extremes' (a) that which is contained in another, and (b) that in which another is contained. For if A is predicated of all B, and B of all C, A must necessarily be predicated of all C. ... I call this kind of figure the First. (Aristotle, Prior Analytics, 1.4)

Induction

Inductive reasoning consists in establishing a relation between one extreme term and the middle term by means of the other extreme; for example, if B is the middle term of A and C, in proving by means of C that A applies to B; for this is how we effect inductions. (Aristotle, Prior Analytics, 2.23)

Abduction

The locus classicus for the study of abductive reasoning is found in Aristotle's Prior Analytics, Book 2, Chapt. 25. It begins this way:

We have Reduction (απαγωγη, abduction):

  1. When it is obvious that the first term applies to the middle, but that the middle applies to the last term is not obvious, yet is nevertheless more probable or not less probable than the conclusion;
  2. Or if there are not many intermediate terms between the last and the middle;

For in all such cases the effect is to bring us nearer to knowledge.

By way of explanation, Aristotle supplies two very instructive examples, one for each of the two varieties of abductive inference steps that he has just described in the abstract:

  1. For example, let A stand for "that which can be taught", B for "knowledge", and C for "morality". Then that knowledge can be taught is evident; but whether virtue is knowledge is not clear. Then if BC is not less probable or is more probable than AC, we have reduction; for we are nearer to knowledge for having introduced an additional term, whereas before we had no knowledge that AC is true.
  2. Or again we have reduction if there are not many intermediate terms between B and C; for in this case too we are brought nearer to knowledge. For example, suppose that D is "to square", E "rectilinear figure", and F "circle". Assuming that between E and F there is only one intermediate term — that the circle becomes equal to a rectilinear figure by means of lunules — we should approximate to knowledge. (Aristotle, "Prior Analytics", 2.25, with minor alterations)

Aristotle's latter variety of abductive reasoning, though it will take some explaining in the sequel, is well worth our contemplation, since it hints already at streams of inquiry that course well beyond the syllogistic source from which they spring, and into regions that Peirce will explore more broadly and deeply.

Inquiry in the pragmatic paradigm

In the pragmatic philosophies of Charles Sanders Peirce, William James, John Dewey, and others, inquiry is closely associated with the normative science of logic. In its inception, the pragmatic model or theory of inquiry was extracted by Peirce from its raw materials in classical logic, with a little bit of help from Kant, and refined in parallel with the early development of symbolic logic by Boole, De Morgan, and Peirce himself to address problems about the nature and conduct of scientific reasoning. Borrowing a brace of concepts from Aristotle, Peirce examined three fundamental modes of reasoning that play a role in inquiry, commonly known as abductive, deductive, and inductive inference.

In rough terms, abduction is what we use to generate a likely hypothesis or an initial diagnosis in response to a phenomenon of interest or a problem of concern, while deduction is used to clarify, to derive, and to explicate the relevant consequences of the selected hypothesis, and induction is used to test the sum of the predictions against the sum of the data. It needs to be observed that the classical and pragmatic treatments of the types of reasoning, dividing the generic territory of inference as they do into three special parts, arrive at a different characterization of the environs of reason than do those accounts that count only two.

These three processes typically operate in a cyclic fashion, systematically operating to reduce the uncertainties and the difficulties that initiated the inquiry in question, and in this way, to the extent that inquiry is successful, leading to an increase in knowledge or in skills.

In the pragmatic way of thinking everything has a purpose, and the purpose of each thing is the first thing we should try to note about it. The purpose of inquiry is to reduce doubt and lead to a state of belief, which a person in that state will usually call knowledge or certainty. As they contribute to the end of inquiry, we should appreciate that the three kinds of inference describe a cycle that can be understood only as a whole, and none of the three makes complete sense in isolation from the others. For instance, the purpose of abduction is to generate guesses of a kind that deduction can explicate and that induction can evaluate. This places a mild but meaningful constraint on the production of hypotheses, since it is not just any wild guess at explanation that submits itself to reason and bows out when defeated in a match with reality. In a similar fashion, each of the other types of inference realizes its purpose only in accord with its proper role in the whole cycle of inquiry. No matter how much it may be necessary to study these processes in abstraction from each other, the integrity of inquiry places strong limitations on the effective modularity of its principal components.

In Logic: The Theory of Inquiry, John Dewey defined inquiry as "the controlled or directed transformation of an indeterminate situation into one that is so determinate in its constituent distinctions and relations as to convert the elements of the original situation into a unified whole". Dewey and Peirce's conception of inquiry extended beyond a system of thinking and incorporated the social nature of inquiry. These ideas are summarize in the notion Community of inquiry.

Art and science of inquiry

For our present purposes, the first feature to note in distinguishing the three principal modes of reasoning from each other is whether each of them is exact or approximate in character. In this light, deduction is the only one of the three types of reasoning that can be made exact, in essence, always deriving true conclusions from true premises, while abduction and induction are unavoidably approximate in their modes of operation, involving elements of fallible judgment in practice and inescapable error in their application.

The reason for this is that deduction, in the ideal limit, can be rendered a purely internal process of the reasoning agent, while the other two modes of reasoning essentially demand a constant interaction with the outside world, a source of phenomena and problems that will no doubt continue to exceed the capacities of any finite resource, human or machine, to master. Situated in this larger reality, approximations can be judged appropriate only in relation to their context of use and can be judged fitting only with regard to a purpose in view.

A parallel distinction that is often made in this connection is to call deduction a demonstrative form of inference, while abduction and induction are classed as non-demonstrative forms of reasoning. Strictly speaking, the latter two modes of reasoning are not properly called inferences at all. They are more like controlled associations of words or ideas that just happen to be successful often enough to be preserved as useful heuristic strategies in the repertoire of the agent. But non-demonstrative ways of thinking are inherently subject to error, and must be constantly checked out and corrected as needed in practice.

In classical terminology, forms of judgment that require attention to the context and the purpose of the judgment are said to involve an element of "art", in a sense that is judged to distinguish them from "science", and in their renderings as expressive judgments to implicate arbiters in styles of rhetoric, as contrasted with logic.

In a figurative sense, this means that only deductive logic can be reduced to an exact theoretical science, while the practice of any empirical science will always remain to some degree an art.

Zeroth order inquiry

Many aspects of inquiry can be recognized and usefully studied in very basic logical settings, even simpler than the level of syllogism, for example, in the realm of reasoning that is variously known as Boolean algebra, propositional calculus, sentential calculus, or zeroth-order logic. By way of approaching the learning curve on the gentlest availing slope, we may well begin at the level of zeroth-order inquiry, in effect, taking the syllogistic approach to inquiry only so far as the propositional or sentential aspects of the associated reasoning processes are concerned. One of the bonuses of doing this in the context of Peirce's logical work is that it provides us with doubly instructive exercises in the use of his logical graphs, taken at the level of his so-called "alpha graphs".

In the case of propositional calculus or sentential logic, deduction comes down to applications of the transitive law for conditional implications and the approximate forms of inference hang on the properties that derive from these. In describing the various types of inference I will employ a few old "terms of art" from classical logic that are still of use in treating these kinds of simple problems in reasoning.

Deduction takes a Case, the minor premise
and combines it with a Rule, the major premise
to arrive at a Fact, the demonstrative conclusion
Induction takes a Case of the form
and matches it with a Fact of the form
to infer a Rule of the form
Abduction takes a Fact of the form
and matches it with a Rule of the form
to infer a Case of the form

For ease of reference, Figure 1 and the Legend beneath it summarize the classical terminology for the three types of inference and the relationships among them.

o-------------------------------------------------o
|                                                 |
|                   Z                             |
|                   o                             |
|                   |\                            |
|                   | \                           |
|                   |  \                          |
|                   |   \                         |
|                   |    \                        |
|                   |     \   R U L E             |
|                   |      \                      |
|                   |       \                     |
|               F   |        \                    |
|                   |         \                   |
|               A   |          \                  |
|                   |           o Y               |
|               C   |          /                  |
|                   |         /                   |
|               T   |        /                    |
|                   |       /                     |
|                   |      /                      |
|                   |     /   C A S E             |
|                   |    /                        |
|                   |   /                         |
|                   |  /                          |
|                   | /                           |
|                   |/                            |
|                   o                             |
|                   X                             |
|                                                 |
| Deduction takes a Case of the form X → Y,       |
| matches it with a Rule of the form Y → Z,       |
| then adverts to a Fact of the form X → Z.       |
|                                                 |
| Induction takes a Case of the form X → Y,       |
| matches it with a Fact of the form X → Z,       |
| then adverts to a Rule of the form Y → Z.       |
|                                                 |
| Abduction takes a Fact of the form X → Z,       |
| matches it with a Rule of the form Y → Z,       |
| then adverts to a Case of the form X → Y.       |
|                                                 |
| Even more succinctly:                           |
|                                                 |
|           Abduction Deduction Induction         |
|                                                 |
| Premise:     Fact Case Case                     |
| Premise:     Rule Rule Fact                     |
| Outcome:     Case Fact Rule                     |
|                                                 |
o-------------------------------------------------o
Figure 1.  Elementary Structure and Terminology

In its original usage a statement of Fact has to do with a deed done or a record made, that is, a type of event that is openly observable and not riddled with speculation as to its very occurrence. In contrast, a statement of Case may refer to a hidden or a hypothetical cause, that is, a type of event that is not immediately observable to all concerned. Obviously, the distinction is a rough one and the question of which mode applies can depend on the points of view that different observers adopt over time. Finally, a statement of a Rule is called that because it states a regularity or a regulation that governs a whole class of situations, and not because of its syntactic form. So far in this discussion, all three types of constraint are expressed in the form of conditional propositions, but this is not a fixed requirement. In practice, these modes of statement are distinguished by the roles that they play within an argument, not by their style of expression. When the time comes to branch out from the syllogistic framework, we will find that propositional constraints can be discovered and represented in arbitrary syntactic forms.

Example of inquiry

Examples of inquiry, that illustrate the full cycle of its abductive, deductive, and inductive phases, and yet are both concrete and simple enough to be suitable for a first (or zeroth) exposition, are somewhat rare in Peirce's writings, and so let us draw one from the work of fellow pragmatician John Dewey, analyzing it according to the model of zeroth-order inquiry that we developed above.

A man is walking on a warm day. The sky was clear the last time he observed it; but presently he notes, while occupied primarily with other things, that the air is cooler. It occurs to him that it is probably going to rain; looking up, he sees a dark cloud between him and the sun, and he then quickens his steps. What, if anything, in such a situation can be called thought? Neither the act of walking nor the noting of the cold is a thought. Walking is one direction of activity; looking and noting are other modes of activity. The likelihood that it will rain is, however, something suggested. The pedestrian feels the cold; he thinks of clouds and a coming shower. (John Dewey, How We Think, 1910, pp. 6-7).

Once over quickly

Let's first give Dewey's example of inquiry in everyday life the quick once over, hitting just the high points of its analysis into Peirce's three kinds of reasoning.

Abductive phase

In Dewey's "Rainy Day" or "Sign of Rain" story, we find our peripatetic hero presented with a surprising Fact:

  • Fact: C → A, In the Current situation the Air is cool.

Responding to an intellectual reflex of puzzlement about the situation, his resource of common knowledge about the world is impelled to seize on an approximate Rule:

  • Rule: B → A, Just Before it rains, the Air is cool.

This Rule can be recognized as having a potential relevance to the situation because it matches the surprising Fact, C → A, in its consequential feature A.

All of this suggests that the present Case may be one in which it is just about to rain:

  • Case: C → B, The Current situation is just Before it rains.

The whole mental performance, however automatic and semi-conscious it may be, that leads up from a problematic Fact and a previously settled knowledge base of Rules to the plausible suggestion of a Case description, is what we are calling an abductive inference.

Deductive phase

The next phase of inquiry uses deductive inference to expand the implied consequences of the abductive hypothesis, with the aim of testing its truth. For this purpose, the inquirer needs to think of other things that would follow from the consequence of his precipitate explanation. Thus, he now reflects on the Case just assumed:

  • Case: C → B, The Current situation is just Before it rains.

He looks up to scan the sky, perhaps in a random search for further information, but since the sky is a logical place to look for details of an imminent rainstorm, symbolized in our story by the letter B, we may safely suppose that our reasoner has already detached the consequence of the abduced Case, C → B, and has begun to expand on its further implications. So let us imagine that our up-looker has a more deliberate purpose in mind, and that his search for additional data is driven by the new-found, determinate Rule:

  • Rule: B → D, Just Before it rains, Dark clouds appear.

Contemplating the assumed Case in combination with this new Rule leads him by an immediate deduction to predict an additional Fact:

  • Fact: C → D, In the Current situation Dark clouds appear.

The reconstructed picture of reasoning assembled in this second phase of inquiry is true to the pattern of deductive inference.

Inductive phase

Whatever the case, our subject observes a Dark cloud, just as he would expect on the basis of the new hypothesis. The explanation of imminent rain removes the discrepancy between observations and expectations and thereby reduces the shock of surprise that made this process of inquiry necessary.

Looking more closely

Seeding hypotheses

Figure 4 gives a graphical illustration of Dewey's example of inquiry, isolating for the purposes of the present analysis the first two steps in the more extended proceedings that go to make up the whole inquiry.

o-----------------------------------------------------------o
|                                                           |
|     A                                               D     |
|      o                                             o      |
|       \ *                                       * /       |
|        \  *                                   *  /        |
|         \   *                               *   /         |
|          \    *                           *    /          |
|           \     *                       *     /           |
|            \   R u l e             R u l e   /            |
|             \       *               *       /             |
|              \        *           *        /              |
|               \         *       *         /               |
|                \          * B *          /                |
|              F a c t        o        F a c t              |
|                  \          *          /                  |
|                   \         *         /                   |
|                    \        *        /                    |
|                     \       *       /                     |
|                      \   C a s e   /                      |
|                       \     *     /                       |
|                        \    *    /                        |
|                         \   *   /                         |
|                          \  *  /                          |
|                           \ * /                           |
|                            \*/                            |
|                             o                             |
|                             C                             |
|                                                           |
| A  =  the Air is cool                                     |
| B  =  just Before it rains                                |
| C  =  the Current situation                               |
| D  =  a Dark cloud appears                                |
|                                                           |
| A is a major term                                         |
| B is a middle term                                        |
| C is a minor term                                         |
| D is a major term, associated with A                      |
|                                                           |
o-----------------------------------------------------------o
Figure 4.  Dewey's 'Rainy Day' Inquiry

In this analysis of the first steps of Inquiry, we have a complex or a mixed form of inference that can be seen as taking place in two steps:

  • The first step is an Abduction that abstracts a Case from the consideration of a Fact and a Rule.
Fact: C → A, In the Current situation the Air is cool.
Rule: B → A, Just Before it rains, the Air is cool.
Case: C → B, The Current situation is just Before it rains.
  • The final step is a Deduction that admits this Case to another Rule and so arrives at a novel Fact.
Case: C → B, The Current situation is just Before it rains.
Rule: B → D, Just Before it rains, a Dark cloud will appear.
Fact: C → D, In the Current situation, a Dark cloud will appear.

This is nowhere near a complete analysis of the Rainy Day inquiry, even insofar as it might be carried out within the constraints of the syllogistic framework, and it covers only the first two steps of the relevant inquiry process, but maybe it will do for a start.

One other thing needs to be noticed here, the formal duality between this expansion phase of inquiry and the argument from analogy. This can be seen most clearly in the propositional lattice diagrams shown in Figures 3 and 4, where analogy exhibits a rough "A" shape and the first two steps of inquiry exhibit a rough "V" shape, respectively. Since we find ourselves repeatedly referring to this expansion phase of inquiry as a unit, let's give it a name that suggests its duality with analogy—"catalogy" will do for the moment. This usage is apt enough if one thinks of a catalogue entry for an item as a text that lists its salient features. Notice that analogy has to do with the examples of a given quality, while catalogy has to do with the qualities of a given example. Peirce noted similar forms of duality in many of his early writings, leading to the consummate treatment in his 1867 paper "On a New List of Categories" (CP 1.545-559, W 2, 49-59).

Weeding hypotheses

In order to comprehend the bearing of inductive reasoning on the closing phases of inquiry there are a couple of observations that we need to make:

  • First, we need to recognize that smaller inquiries are typically woven into larger inquiries, whether we view the whole pattern of inquiry as carried on by a single agent or by a complex community.
  • Further, we need to consider the different ways in which the particular instances of inquiry can be related to ongoing inquiries at larger scales. Three modes of inductive interaction between the micro-inquiries and the macro-inquiries that are salient here can be described under the headings of the "Learning", the "Transfer", and the "Testing" of rules.

Analogy of experience

Throughout inquiry the reasoner makes use of rules that have to be transported across intervals of experience, from the masses of experience where they are learned to the moments of experience where they are applied. Inductive reasoning is involved in the learning and the transfer of these rules, both in accumulating a knowledge base and in carrying it through the times between acquisition and application.

  • Learning. The principal way that induction contributes to an ongoing inquiry is through the learning of rules, that is, by creating each of the rules that goes into the knowledge base, or ever gets used along the way.
  • Transfer. The continuing way that induction contributes to an ongoing inquiry is through the exploit of analogy, a two-step combination of induction and deduction that serves to transfer rules from one context to another.
  • Testing. Finally, every inquiry that makes use of a knowledge base constitutes a "field test" of its accumulated contents. If the knowledge base fails to serve any live inquiry in a satisfactory manner, then there is a prima facie reason to reconsider and possibly to amend some of its rules.

Let's now consider how these principles of learning, transfer, and testing apply to John Dewey's "Sign of Rain" example.

Learning

Rules in a knowledge base, as far as their effective content goes, can be obtained by any mode of inference.

For example, a rule like:

  • Rule: B → A, Just Before it rains, the Air is cool,

is usually induced from a consideration of many past events, in a manner that can be rationally reconstructed as follows:

  • Case: C → B, In Certain events, it is just Before it rains,
  • Fact: C → A, In Certain events, the Air is cool,
------------------------------------------------------------------------------------------
  • Rule: B → A, Just Before it rains, the Air is cool.

However, the very same proposition could also be abduced as an explanation of a singular occurrence or deduced as a conclusion of a presumptive theory.

Transfer

What is it that gives a distinctively inductive character to the acquisition of a knowledge base? It is evidently the "analogy of experience" that underlies its useful application. Whenever we find ourselves prefacing an argument with the phrase "If past experience is any guide..." then we can be sure that this principle has come into play. We are invoking an analogy between past experience, considered as a totality, and present experience, considered as a point of application. What we mean in practice is this: "If past experience is a fair sample of possible experience, then the knowledge gained in it applies to present experience". This is the mechanism that allows a knowledge base to be carried across gulfs of experience that are indifferent to the effective contents of its rules.

Here are the details of how this notion of transfer works out in the case of the "Sign of Rain" example:

Let K(pres) be a portion of the reasoner's knowledge base that is logically equivalent to the conjunction of two rules, as follows:

  • K(pres) = (B → A) and (B → D).

K(pres) is the present knowledge base, expressed in the form of a logical constraint on the present universe of discourse.

It is convenient to have the option of expressing all logical statements in terms of their logical models, that is, in terms of the primitive circumstances or the elements of experience over which they hold true.

  • Let E(past) be the chosen set of experiences, or the circumstances that we have in mind when we refer to "past experience".
  • Let E(poss) be the collective set of experiences, or the projective total of possible circumstances.
  • Let E(pres) be the present experience, or the circumstances that are present to the reasoner at the current moment.

If we think of the knowledge base K(pres) as referring to the "regime of experience" over which it is valid, then all of these sets of models can be compared by the simple relations of set inclusion or logical implication.

Figure 5 schematizes this way of viewing the "analogy of experience".

o-----------------------------------------------------------o
|                                                           |
|                          K(pres)                          |
|                             o                             |
|                            /|\                            |
|                           / | \                           |
|                          /  |  \                          |
|                         /   |   \                         |
|                        /  Rule   \                        |
|                       /     |     \                       |
|                      /      |      \                      |
|                     /       |       \                     |
|                    /     E(poss)     \                    |
|              Fact /         o         \ Fact              |
|                  /        *   *        \                  |
|                 /       *       *       \                 |
|                /      *           *      \                |
|               /     *               *     \               |
|              /    *                   *    \              |
|             /   *  Case           Case  *   \             |
|            /  *                           *  \            |
|           / *                               * \           |
|          /*                                   *\          |
|         o<<<---------------<<<---------------<<<o         |
|      E(past)        Analogy Morphism         E(pres)      |
|    More Known                              Less Known     |
|                                                           |
o-----------------------------------------------------------o
Figure 5.  Analogy of Experience

In these terms, the "analogy of experience" proceeds by inducing a Rule about the validity of a current knowledge base and then deducing a Fact, its applicability to a current experience, as in the following sequence:

Inductive Phase:

  • Given Case: E(past) → E(poss), Chosen events fairly sample Collective events.
  • Given Fact: E(past) → K(pres), Chosen events support the Knowledge regime.
-----------------------------------------------------------------------------------------------------------------------------
  • Induce Rule: E(poss) → K(pres), Collective events support the Knowledge regime.

Deductive Phase:

  • Given Case: E(pres) → E(poss), Current events fairly sample Collective events.
  • Given Rule: E(poss) → K(pres), Collective events support the Knowledge regime.
--------------------------------------------------------------------------------------------------------------------------------
  • Deduce Fact: E(pres) → K(pres), Current events support the Knowledge regime.
Testing
If the observer looks up and does not see dark clouds, or if he runs for shelter but it does not rain, then there is fresh occasion to question the utility or the validity of his knowledge base. But we must leave our foulweather friend for now and defer the logical analysis of this testing phase to another occasion.

End-of-life care

From Wikipedia, the free encyclopedia

End-of-life care refers to health care provided in the time leading up to a person's death. End-of-life care can be provided in the hours, days, or months before a person dies and encompasses care and support for a person's mental and emotional needs, physical comfort, spiritual needs, and practical tasks.

EoLC is most commonly provided at home, in the hospital, or in a long-term care facility with care being provided by family members, nurses, social workers, physicians, and other support staff. Facilities may also have palliative or hospice care teams that will provide end-of-life care services. Decisions about end-of-life care are often informed by medical, financial and ethical considerations.

In most advanced countries, medical spending on people in the last twelve months of life makes up roughly 10% of total aggregate medical spending, while those in the last three years of life can cost up to 25%.

Medical

Advanced care planning

Advances in medicine in the last few decades have provided us with an increasing number of options to extend a person's life and highlighted the importance of ensuring that an individual's preferences and values for end-of-life care are honored. Advanced care planning is the process by which a person of any age is able to provide their preferences and ensure that their future medical treatment aligns with their personal values and life goals. It is typically a continual process, with ongoing discussions about a patient's current prognosis and conditions as well as conversations about medical dilemmas and options. A person will typically have these conversations with their doctor and ultimately record their preferences in an advance healthcare directive. An advance healthcare directive is a legal document that either documents a person's decisions about desired treatment or indicates who a person has entrusted to make their care decisions for them. The two main types of advanced directives are a living will and durable power of attorney for healthcare. A living will includes a person's decisions regarding their future care, a majority of which address resuscitation and life support but may also delve into a patients’ preferences regarding hospitalization, pain control, and specific treatments that they may undergo in the future. The living will will typically take effect when a patient is terminally ill with low chances of recovery. A durable power of attorney for healthcare allows a person to appoint another individual to make healthcare decisions for them under a specified set of circumstances. Combined directives, such as the "Five Wishes", that include components of both the living will and durable power of attorney for healthcare, are being increasingly utilized. Advanced care planning often includes preferences for CPR initiation, nutrition (tube feeding), as well as decisions about the use of machines to keep a person breathing, or support their heart or kidneys. Many studies have reported benefits to patients who complete advanced care planning, specifically noting the improved patient and surrogate satisfaction with communication and decreased clinician distress. However, there is a notable lack of empirical data about what outcome improvements patients experience, as there are considerable discrepancies in what constitutes as advanced care planning and heterogeneity in the outcomes measured. Advanced care planning remains an underutilized tool for patients. Researchers have published data to support the use of new relationship-based and supported decision making models that can increase the use and maximize the benefit of advanced care planning.

End-of-life care conversations

End-of-life care conversations are part of the treatment planning process for terminally ill patients requiring palliative care involving a discussion of a patient's prognosis, specification of goals of care, and individualized treatment planning. Current studies suggest that many patients prioritize proper symptom management, avoidance of suffering, and care that aligns with ethical and cultural standards. Specific conversations can include discussions about cardiopulmonary resuscitation (ideally occurring before the active dying phase as to not force the conversation during a medical crisis/emergency), place of death, organ donation, and cultural/religious traditions. As there are many factors involved in the end-of-life care decision-making process, the attitudes and perspectives of patients and families may vary. For example, family members may differ over whether life extension or life quality is the main goal of treatment. As it can be challenging for families in the grieving process to make timely decisions that respect the patient's wishes and values, having an established advanced care directive in place can prevent over-treatment, under-treatment, or further complications in treatment management.

Patients and families may also struggle to grasp the inevitability of death, and the differing risks and effects of medical and non-medical interventions available for end-of-life care. A systematic literature review reviewing the frequency of end-of-life care conversations between COPD patients and clinicians found that conversations regarding end-of-life care often occur when a patient has advanced stage disease and occur at a low frequency. To prevent interventions that are not in accordance with the patient's wishes, end-of-life care conversations and advanced care directives can allow for the care they desire, as well as help prevent confusion and strain for family members.

In the case of critically ill babies, parents are able to participate more in decision making if they are presented with options to be discussed rather than recommendations by the doctor. Utilizing this style of communication also leads to less conflict with doctors and might help the parents cope better with the eventual outcomes.

Signs of dying

The National Cancer Institute in the United States (US) advises that the presence of some of the following signs may indicate that death is approaching:

  • Drowsiness, increased sleep, and/or unresponsiveness (caused by changes in the patient's metabolism).
  • Confusion about time, place, and/or identity of loved ones; restlessness; visions of people and places that are not present; pulling at bed linen or clothing (caused in part by changes in the patient's metabolism).
  • Decreased socialization and withdrawal (caused by decreased oxygen to the brain, decreased blood flow, and mental preparation for dying).
  • Changes in breathing (indicate neurologic compromise and impending death) and accumulation of upper airway secretions (resulting in crackling and gurgling breath sounds) 
  • Decreased need for food and fluids, and loss of appetite (caused by the body's need to conserve energy and its decreasing ability to use food and fluids properly).
  • Decreased oral intake and impaired swallowing (caused by general physical weakness and metabolic disturbances, including but not limited to hypercalcemia
  • Loss of bladder or bowel control (caused by the relaxing of muscles in the pelvic area).
  • Darkened urine or decreased amount of urine (caused by slowing of kidney function and/or decreased fluid intake).
  • Skin becoming cool to the touch, particularly the hands and feet; skin may become bluish in color, especially on the underside of the body (caused by decreased circulation to the extremities).
  • Rattling or gurgling sounds while breathing, which may be loud (death rattle); breathing that is irregular and shallow; decreased number of breaths per minute; breathing that alternates between rapid and slow (caused by congestion from decreased fluid consumption, a buildup of waste products in the body, and/or a decrease in circulation to the organs).
  • Turning of the head toward a light source (caused by decreasing vision).
  • Increased difficulty controlling pain (caused by progression of the disease).
  • Involuntary movements (called myoclonus)
  • Increased heart rate
  • Hypertension followed by hypotension
  • Loss of reflexes in the legs and arms

Symptoms management

The following are some of the most common potential problems that can arise in the last days and hours of a patient's life:

Pain
Typically controlled with opioids, like morphine, fentanyl, hydromorphone or, in the United Kingdom, diamorphine. High doses of opioids can cause respiratory depression, and this risk increases with concomitant use of alcohol and other sedatives. Careful use of opioids is important to improve the patient's quality of life while avoiding overdoses.
Agitation
Delirium, terminal anguish, restlessness (e.g. thrashing, plucking, or twitching). Typically controlled using clonazepam or midazolam, antipsychotics such as haloperidol or levomepromazine may also be used instead of, or concomitantly with benzodiazepines. Symptoms may also sometimes be alleviated by rehydration, which may reduce the effects of some toxic drug metabolites.
Respiratory tract secretions
Saliva and other fluids can accumulate in the oropharynx and upper airways when patients become too weak to clear their throats, leading to a characteristic gurgling or rattle-like sound ("death rattle"). While apparently not painful for the patient, the association of this symptom with impending death can create fear and uncertainty for those at the bedside. The secretions may be controlled using drugs such as hyoscine butylbromide, glycopyrronium, or atropine. Rattle may not be controllable if caused by deeper fluid accumulation in the bronchi or the lungs, such as occurs with pneumonia or some tumours.
Nausea and vomiting
Typically controlled using haloperidol, metoclopramide, ondansetron, cyclizine; or other anti-emetics.
Dyspnea (breathlessness)
Typically controlled with opioids, like morphine, fentanyl or, in the United Kingdom, diamorphine

Constipation

Low food intake and opioid use can lead to constipation which can then result in agitation, pain, and delirium. Laxatives and stool softeners are used to prevent constipation. In patients with constipation, the dose of laxatives will be increased to relieve symptoms. Methylnaltrexone is approved to treat constipation due to opioid use.

Other symptoms that may occur, and may be mitigated to some extent, include cough, fatigue, fever, and in some cases bleeding.

Medication administration

Subcutaneous injections are one preferred means of delivery when it has become difficult for patients to swallow or to take pills orally, and if repeated medication is needed, a syringe driver (or infusion pump in the US) is often likely to be used, to deliver a steady low dose of medication. In some settings, such as the home or hospice, sublingual routes of administration may be used for most prescriptions and medications.

Another means of medication delivery, available for use when the oral route is compromised, is a specialized catheter designed to provide comfortable and discreet administration of ongoing medications via the rectal route. The catheter was developed to make rectal access more practical and provide a way to deliver and retain liquid formulations in the distal rectum so that health practitioners can leverage the established benefits of rectal administration. Its small flexible silicone shaft allows the device to be placed safely and remain comfortably in the rectum for repeated administration of medications or liquids. The catheter has a small lumen, allowing for small flush volumes to get medication to the rectum. Small volumes of medications (under 15mL) improve comfort by not stimulating the defecation response of the rectum and can increase the overall absorption of a given dose by decreasing pooling of medication and migration of medication into more proximal areas of the rectum where absorption can be less effective.

Integrated pathways

Integrated care pathways are an organizational tool used by healthcare professionals to clearly define the roles of each team-member and coordinate how and when care will be provided. These pathways are utilized to ensure best practices are being utilized for end-of-life care, such as evidence-based and accepted health care protocols, and to list the required features of care for a specific diagnosis or clinical problem. Many institutions have a predetermined pathway for end of life care, and clinicians should be aware of and make use of these plans when possible. In the United Kingdom, end-of-life care pathways are based on the Liverpool Care Pathway. Originally developed to provide evidence based care to dying cancer patients, this pathway has been adapted and used for a variety of chronic conditions at clinics in the UK and internationally. Despite its increasing popularity, the 2016 Cochrane Review, which only analyzed one trial, showed limited evidence in the form of high-quality randomized clinical trials to measure the effectiveness of end-of-life care pathways on clinical outcomes, physical outcomes, and emotional/psychological outcomes. The BEACON Project group developed an integrated care pathway entitled the Comfort Care Order Set, which delineates care for the last days of life in either a hospice or acute care inpatient setting. This order set was implemented and evaluated in a multisite system throughout six United States Veterans Affairs Medical Centers, and the study found increased orders for opioid medication post-pathway implementation, as well as more orders for antipsychotic medications, more patients undergoing palliative care consultations, more advance directives, and increased sublingual drug administration. The intervention did not, however, decrease the proportion of deaths that occurred in an ICU setting or the utilization of restraints around death.

Home-based end-of-life care

While not possible for every person needing care, surveys of the general public suggest most people would prefer to die at home. In the period from 2003 to 2017, the number of deaths at home in the United States increased from 23.8% to 30.7%, while the number of deaths in the hospital decreased from 39.7% to 29.8%. Home-based end-of-life care may be delivered in a number of ways, including by an extension of a primary care practice, by a palliative care practice, and by home care agencies such as Hospice. High-certainty evidence indicates that implementation of home-based end-of-life care programs increases the number of adults who will die at home and slightly improves their satisfaction at a one-month follow-up. There is low-certainty evidence that there may be very little or no difference in satisfaction of the person needing care for longer term (6 months). The number of people who are admitted to hospital during an end-of-life care program is not known. In addition, the impact of home-based end-of-life care on caregivers, healthcare staff, and health service costs is not clear, however, there is weak evidence to suggest that this intervention may reduce health care costs by a small amount.

Disparities in end-of-life care

Not all groups in society have good access to end-of-life care. A systematic review conducted in 2021 investigated the end of life care experiences of people with severe mental illness, including those with schizophrenia, bipolar disorder, and major depressive disorder. The research found that individuals with a severe mental illness were unlikely to receive the most appropriate end of life care. The review recommended that there needs to be close partnerships and communication between mental health and end of life care systems, and these teams need to find ways to support people to die where they choose. More training, support and supervision needs to be available for professionals working in end of life care; this could also decrease prejudice and stigma against individuals with severe mental illness at the end of life, notably in those who are homeless. In addition, studies have shown that minority patients face several additional barriers to receiving quality end-of-life care. Minority patients are prevented from accessing care at an equitable rate for a variety of reasons including: individual discrimination from caregivers, cultural insensitivity, racial economic disparities, as well as medical mistrust.

Non-medical

Family and friends

Family members are often uncertain as to what they should be doing when a person is dying. Many gentle, familiar daily tasks, such as combing hair, putting lotion on delicate skin, and holding hands, are comforting and provide a meaningful method of communicating love to a dying person.

Family members may be suffering emotionally due to the impending death. Their own fear of death may affect their behavior. They may feel guilty about past events in their relationship with the dying person or feel that they have been neglectful. These common emotions can result in tension, fights between family members over decisions, worsened care, and sometimes (in what medical professionals call the "Daughter from California syndrome") a long-absent family member arrives while a patient is dying to demand inappropriately aggressive care.

Family members may also be coping with unrelated problems, such as physical or mental illness, emotional and relationship issues, or legal difficulties. These problems can limit their ability to be involved, civil, helpful, or present.

Spirituality and religion

Spirituality is thought to be of increased importance to an individual's wellbeing during a terminal illness or toward the end-of-life. Pastoral/spiritual care has a particular significance in end of life care, and is considered an essential part of palliative care by the WHO. In palliative care, responsibility for spiritual care is shared by the whole team, with leadership given by specialist practitioners such as pastoral care workers. The palliative care approach to spiritual care may, however, be transferred to other contexts and to individual practice.

Spiritual, cultural, and religious beliefs may influence or guide patient preferences regarding end-of-life care. Healthcare providers caring for patients at the end of life can engage family members and encourage conversations about spiritual practices to better address the different needs of diverse patient populations. Studies have shown that people who identify as religious also report higher levels of well-being. Religion has also been shown to be inversely correlated with depression and suicide. While religion provides some benefits to patients, there is some evidence of increased anxiety and other negative outcomes in some studies. While spirituality has been associated with less aggressive end-of-life care, religion has been associated with an increased desire for aggressive care in some patients. Despite these varied outcomes, spiritual and religious care remains an important aspect of care for patients. Studies have shown that barriers to providing adequate spiritual and religious care include a lack of cultural understanding, limited time, and a lack of formal training or experience.

Many hospitals, nursing homes, and hospice centers have chaplains who provide spiritual support and grief counseling to patients and families of all religious and cultural backgrounds.

Ageism

The World Health Organization defines ageism as "the stereotypes (how we think), prejudice (how we feel) and discrimination (how we act) towards others or ourselves based on age." A systematic review in 2017 showed that negative attitudes amongst nurses towards older individuals were related to the characteristics of the older adults and their demands. This review also highlighted how nurses who had difficulty giving care to their older patients perceived them as "weak, disabled, inflexible, and lacking cognitive or mental ability". Another systematic review considering structural and individual-level effects of ageism found that ageism led to significantly worse health outcomes in 95.5% of the studies and 74.0% of the 1,159 ageism-health associations examined. Studies have also shown that one’s own perception of aging and internalized ageism negatively impacts their health. In the same systematic review, they included this factor as part of their research. It was concluded that 93.4% of their total 142 associations about self-perceptions of aging show significant associations between ageism and worse health.

Attitudes of healthcare professionals

End-of-life care is an interdisciplinary endeavor involving physicians, nurses, physical therapists, occupational therapists, pharmacists and social workers. Depending on the facility and level of care needed, the composition of the interprofessional team can vary. Health professional attitudes about end-of-life care depend in part on the provider's role in the care team.

Physicians generally have favorable attitudes towards Advance Directives, which are a key facet of end-of-life care. Medical doctors who have more experience and training in end-of-life care are more likely to cite comfort in having end-of-life-care discussions with patients. Those physicians who have more exposure to end-of-life care also have a higher likelihood of involving nurses in their decision-making process.

A systematic review assessing end-of-life conversations between heart failure patients and healthcare professionals evaluated physician attitudes and preferences towards end-of-life care conversations. The study found that physicians found difficulty initiating end-of-life conversations with their heart failure patients, due to physician apprehension over inducing anxiety in patients, the uncertainty in a patient's prognosis, and physicians awaiting patient cues to initiate end-of-life care conversations.

Although physicians make official decisions about end-of-life care, nurses spend more time with patients and often know more about patient desires and concerns. In a Dutch national survey study of attitudes of nursing staff about involvement in medical end-of-life decisions, 64% of respondents thought patients preferred talking with nurses than physicians and 75% desired to be involved in end-of-life decision making.

By country

Canada

In 2012, Statistics Canada's General Social Survey on Caregiving and care receiving found that 13% of Canadians (3.7 million) aged 15 and older reported that at some point in their lives they had provided end-of-life or palliative care to a family member or friend. For those in their 50s and 60s, the percentage was higher, with about 20% reporting having provided palliative care to a family member or friend. Women were also more likely to have provided palliative care over their lifetimes, with 16% of women reporting having done so, compared with 10% of men. These caregivers helped terminally ill family members or friends with personal or medical care, food preparation, managing finances or providing transportation to and from medical appointments.

United Kingdom

End of life care has been identified by the UK Department of Health as an area where quality of care has previously been "very variable," and which has not had a high profile in the NHS and social care. To address this, a national end of life care programme was established in 2004 to identify and propagate best practice, and a national strategy document published in 2008. The Scottish Government has also published a national strategy.

In 2006 just over half a million people died in England, about 99% of them adults over the age of 18, and almost two-thirds adults over the age of 75. About three-quarters of deaths could be considered "predictable" and followed a period of chronic illness – for example heart disease, cancer, stroke, or dementia. In all, 58% of deaths occurred in an NHS hospital, 18% at home, 17% in residential care homes (most commonly people over the age of 85), and about 4% in hospices. However, a majority of people would prefer to die at home or in a hospice, and according to one survey less than 5% would rather die in hospital. A key aim of the strategy therefore is to reduce the needs for dying patients to have to go to hospital and/or to have to stay there; and to improve provision for support and palliative care in the community to make this possible. One study estimated that 40% of the patients who had died in hospital had not had medical needs that required them to be there.

In 2015 and 2010, the UK ranked highest globally in a study of end-of-life care. The 2015 study said "Its ranking is due to comprehensive national policies, the extensive integration of palliative care into the National Health Service, a strong hospice movement, and deep community engagement on the issue." The studies were carried out by the Economist Intelligence Unit and commissioned by the Lien Foundation, a Singaporean philanthropic organisation.

The 2015 National Institute for Health and Care Excellence guidelines introduced religion and spirituality among the factors which physicians shall take into account for assessing palliative care needs. In 2016, the UK Minister of Health signed a document which declared people "should have access to personalised care which focuses on the preferences, beliefs and spiritual needs of the individual." As of 2017, more than 47% of the 500,000 deaths in the UK occurred in hospitals.

In 2021 the National Palliative and End of Life Care Partnership published their six ambitions for 2021–26. These include fair access to end of life care for everyone regardless of who they are, where they live or their circumstances, and the need to maximise comfort and wellbeing. Informed and timely conversations are also highlighted.

Research funded by the UK's National Institute for Health and Care Research (NIHR) has addressed these areas of need. Examples highlight inequalities faced by several groups and offers recommendations. These include the need for close partnership between services caring for people with severe mental illness, improved understanding of barriers faced by Gypsy, Traveller and Roma communities, the provision of flexible palliative care services for children from ethnic minorities or deprived areas.

Other research suggests that giving nurses and pharmacists easier access to electronic patient records about prescribing could help people manage their symptoms at home. A named professional to support and guide patients and carers through the healthcare system could also improve the experience of care at home at the end of life. A synthesised review looking at palliative care in the UK created a resource showing which services were available and grouped them according to their intended purpose and benefit to the patient. They also stated that currently in the UK palliative services are only available to patients with a timeline to death, usually 12 months or less. They found these timelines to often be inaccurate and created barriers to patients accessing appropriate services. They call for a more holistic approach to end of life care which is not restricted by arbitrary timelines.

United States

As of 2019, physician-assisted dying is legal in eight states (California, Colorado, Hawaii, Maine, New Jersey, Oregon, Vermont, Washington) and Washington D.C.

Spending on those in the last twelve months accounts for 8.5% of total aggregate medical spending in the United States.

When considering only those aged 65 and older, estimates show that about 27% of Medicare's annual $327 billion budget ($88 billion) in 2006 goes to care for patients in their final year of life. For the over-65s, between 1992 and 1996, spending on those in their last year of life represented 22% of all medical spending, 18% of all non-Medicare spending, and 25 percent of all Medicaid spending for the poor. These percentages appears to be falling over time, as in 2008, 16.8% of all medical spending on the over 65s went on those in their last year of life.

Predicting death is difficult, which has affected estimates of spending in the last year of life; when controlling for spending on patients who were predicted as likely to die, Medicare spending was estimated at 5% of the total.

Archetype

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Archetype The concept of an archetyp...