The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge is the one with largest entropy, in the context of precisely stated prior data (such as a proposition that expresses testable information).
Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. According to this principle, the distribution with maximal information entropy is the best choice.
History
Overview
In most practical cases, the stated prior data or testable information is given by a set of conserved quantities (average values of some moment functions), associated with the probability distribution in question. This is the way the maximum entropy principle is most often used in statistical thermodynamics. Another possibility is to prescribe some symmetries of the probability distribution. The equivalence between conserved quantities and corresponding symmetry groups implies a similar equivalence for these two ways of specifying the testable information in the maximum entropy method.The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods, statistical mechanics and logical inference in particular.
The maximum entropy principle makes explicit our freedom in using different forms of prior data. As a special case, a uniform prior probability density (Laplace's principle of indifference, sometimes called the principle of insufficient reason), may be adopted. Thus, the maximum entropy principle is not merely an alternative way to view the usual methods of inference of classical statistics, but represents a significant conceptual generalization of those methods.
However these statements do not imply that thermodynamical systems need not be shown to be ergodic to justify treatment as a statistical ensemble.
In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.
Testable information
The principle of maximum entropy is useful explicitly only when applied to testable information. Testable information is a statement about a probability distribution whose truth or falsity is well-defined. For example, the statements- the expectation of the variable x is 2.87
- p2 + p3 > 0.6
Given testable information, the maximum entropy procedure consists of seeking the probability distribution which maximizes information entropy, subject to the constraints of the information. This constrained optimization problem is typically solved using the method of Lagrange multipliers.
Entropy maximization with no testable information respects the universal "constraint" that the sum of the probabilities is one. Under this constraint, the maximum entropy discrete probability distribution is the uniform distribution,
Applications
The principle of maximum entropy is commonly applied in two ways to inferential problems:Prior probabilities
The principle of maximum entropy is often used to obtain prior probability distributions for Bayesian inference. Jaynes was a strong advocate of this approach, claiming the maximum entropy distribution represented the least informative distribution. A large amount of literature is now dedicated to the elicitation of maximum entropy priors and links with channel coding.Posterior probabilities
Maximum entropy is a sufficient updating rule for radical probabilism. Richard Jeffrey's probability kinematics is a special case of maximum entropy inference. However, maximum entropy is not a generalisation of all such sufficient updating rules.Maximum entropy models
Alternatively, the principle is often invoked for model specification: in this case the observed data itself is assumed to be the testable information. Such models are widely used in natural language processing. An example of such a model is logistic regression, which corresponds to the maximum entropy classifier for independent observations.Probability Density Estimation
One of the main applications of the maximum entropy principle is in discrete and continuous density estimation. Similar to support vector machine estimators, the maximum entropy principle may require the solution to a quadratic programming and thus provide a sparse mixture model as the optimal density estimator. One important advantage of the method is able to incorporate prior information in the density estimation.General solution for the maximum entropy distribution with linear constraints
Discrete case
We have some testable information I about a quantity x taking values in {x1, x2,..., xn}. We assume this information has the form of m constraints on the expectations of the functions fk; that is, we require our probability distribution to satisfyThe λk parameters are Lagrange multipliers whose particular values are determined by the constraints according to
Continuous case
For continuous distributions, the Shannon entropy cannot be used, as it is only defined for discrete probability spaces. Instead Edwin Jaynes (1963, 1968, 2003) gave the following formula, which is closely related to the relative entropy.A closely related quantity, the relative entropy, is usually defined as the Kullback–Leibler divergence of m from p (although it is sometimes, confusingly, defined as the negative of this). The inference principle of minimizing this, due to Kullback, is known as the Principle of Minimum Discrimination Information.
We have some testable information I about a quantity x which takes values in some interval of the real numbers (all integrals below are over this interval). We assume this information has the form of m constraints on the expectations of the functions fk, i.e. we require our probability density function to satisfy
Examples
For several examples of maximum entropy distributions, see the article on maximum entropy probability distributions.Justifications for the principle of maximum entropy
Proponents of the principle of maximum entropy justify its use in assigning probabilities in several ways, including the following two arguments. These arguments take the use of Bayesian probability as given, and are thus subject to the same postulates.Information entropy as a measure of 'uninformativeness'
Consider a discrete probability distribution among m mutually exclusive propositions. The most informative distribution would occur when one of the propositions was known to be true. In that case, the information entropy would be equal to zero. The least informative distribution would occur when there is no reason to favor any one of the propositions over the others. In that case, the only reasonable probability distribution would be uniform, and then the information entropy would be equal to its maximum possible value, log m. The information entropy can therefore be seen as a numerical measure which describes how uninformative a particular probability distribution is, ranging from zero (completely informative) to log m (completely uninformative).By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess. Thus the maximum entropy distribution is the only reasonable distribution.
The Wallis derivation
The following argument is the result of a suggestion made by Graham Wallis to E. T. Jaynes in 1962. It is essentially the same mathematical argument used for the Maxwell–Boltzmann statistics in statistical mechanics, although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept. The information entropy function is not assumed a priori, but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.Suppose an individual wishes to make a probability assignment among m mutually exclusive propositions. She has some testable information, but is not sure how to go about including this information in her probability assessment. She therefore conceives of the following random experiment. She will distribute N quanta of probability (each worth 1/N) at random among the m possibilities. (One might imagine that she will throw N balls into m buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, she will check if the probability assignment thus obtained is consistent with her information. (For this step to be successful, the information must be a constraint given by an open set in the space of probability measures). If it is inconsistent, she will reject it and try again. If it is consistent, her assessment will be
Now, in order to reduce the 'graininess' of the probability assignment, it will be necessary to use quite a large number of quanta of probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, the protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the multinomial distribution,
The most probable result is the one which maximizes the multiplicity W. Rather than maximizing W directly, the protagonist could equivalently maximize any monotonic increasing function of W. She decides to maximize
Compatibility with Bayes' theorem
Giffin and Caticha (2007) state that Bayes' theorem and the principle of maximum entropy are completely compatible and can be seen as special cases of the "method of maximum relative entropy". They state that this method reproduces every aspect of orthodox Bayesian inference methods. In addition this new method opens the door to tackling problems that could not be addressed by either the maximal entropy principle or orthodox Bayesian methods individually. Moreover, recent contributions (Lazar 2003, and Schennach 2005) show that frequentist relative-entropy-based inference approaches (such as empirical likelihood and exponentially tilted empirical likelihood – see e.g. Owen 2001 and Kitamura 2006) can be combined with prior information to perform Bayesian posterior analysis.Jaynes stated Bayes' theorem was a way to calculate a probability, while maximum entropy was a way to assign a prior probability distribution.
It is however, possible in concept to solve for a posterior distribution directly from a stated prior distribution using the principle of minimum cross entropy (or the Principle of Maximum Entropy being a special case of using a uniform distribution as the given prior), independently of any Bayesian considerations by treating the problem formally as a constrained optimisation problem, the Entropy functional being the objective function. For the case of given average values as testable information (averaged over the sought after probability distribution), the sought after distribution is formally the Gibbs (or Boltzmann) distribution the parameters of which must be solved for in order to achieve minimum cross entropy and satisfy the given testable information.