A Medley of Potpourri

Wednesday, August 31, 2022

Mathematics education

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Mathematics_education

A mathematics lecture at Aalto University School of Science and Technology

In contemporary education, mathematics education is the practice of teaching and learning mathematics, along with the associated scholarly research.

Researchers in mathematics education are primarily concerned with the tools, methods and approaches that facilitate practice or the study of practice; however, mathematics education research, known on the continent of Europe as the didactics or pedagogy of mathematics, has developed into an extensive field of study, with its concepts, theories, methods, national and international organisations, conferences and literature. This article describes some of the history, influences and recent controversies.

History

Elementary mathematics was part of the education system in most ancient civilisations, including Ancient Greece, the Roman Empire, Vedic society and ancient Egypt. In most cases, formal education was only available to male children with sufficiently high status, wealth or caste.

Illustration at the beginning of the 14th-century translation of Euclid's Elements.

In Plato's division of the liberal arts into the trivium and the quadrivium, the quadrivium included the mathematical fields of arithmetic and geometry. This structure was continued in the structure of classical education that was developed in medieval Europe. The teaching of geometry was almost universally based on Euclid's Elements. Apprentices to trades such as masons, merchants and money-lenders could expect to learn such practical mathematics as was relevant to their profession.

In the Renaissance, the academic status of mathematics declined, because it was strongly associated with trade and commerce, and considered somewhat un-Christian. Although it continued to be taught in European universities, it was seen as subservient to the study of Natural, Metaphysical and Moral Philosophy. The first modern arithmetic curriculum (starting with addition, then subtraction, multiplication, and division) arose at reckoning schools in Italy in the 1300s. Spreading along trade routes, these methods were designed to be used in commerce. They contrasted with Platonic math taught at universities, which was more philosophical and concerned numbers as concepts rather than calculating methods. They also contrasted with mathematical methods learned by artisan apprentices, which were specific to the tasks and tools at hand. For example, the division of a board into thirds can be accomplished with a piece of string, instead of measuring the length and using the arithmetic operation of division.

The first mathematics textbooks to be written in English and French were published by Robert Recorde, beginning with The Grounde of Artes in 1543. However, there are many different writings on mathematics and mathematics methodology that date back to 1800 BCE. These were mostly located in Mesopotamia where the Sumerians were practicing multiplication and division. There are also artifacts demonstrating their methodology for solving equations like the quadratic equation. After the Sumerians, some of the most famous ancient works on mathematics come from Egypt in the form of the Rhind Mathematical Papyrus and the Moscow Mathematical Papyrus. The more famous Rhind Papyrus has been dated to approximately 1650 BCE but it is thought to be a copy of an even older scroll. This papyrus was essentially an early textbook for Egyptian students.

The social status of mathematical study was improving by the seventeenth century, with the University of Aberdeen creating a Mathematics Chair in 1613, followed by the Chair in Geometry being set up in University of Oxford in 1619 and the Lucasian Chair of Mathematics being established by the University of Cambridge in 1662.

In the 18th and 19th centuries, the Industrial Revolution led to an enormous increase in urban populations. Basic numeracy skills, such as the ability to tell the time, count money and carry out simple arithmetic, became essential in this new urban lifestyle. Within the new public education systems, mathematics became a central part of the curriculum from an early age.

By the twentieth century, mathematics was part of the core curriculum in all developed countries.

During the twentieth century, mathematics education was established as an independent field of research. Here are some of the main events in this development:

In 1893, a Chair in mathematics education was created at the University of Göttingen, under the administration of Felix Klein
The International Commission on Mathematical Instruction (ICMI) was founded in 1908, and Felix Klein became the first president of the organisation
The professional periodical literature on mathematics education in the U.S.A. had generated more than 4000 articles after 1920, so in 1941 William L. Schaaf published a classified index, sorting them into their various subjects.
A renewed interest in mathematics education emerged in the 1960s, and the International Commission was revitalised
In 1968, the Shell Centre for Mathematical Education was established in Nottingham
The first International Congress on Mathematical Education (ICME) was held in Lyon in 1969. The second congress was in Exeter in 1972, and after that, it has been held every four years

In the 20th century, the cultural impact of the "electronic age" (McLuhan) was also taken up by educational theory and the teaching of mathematics. While previous approach focused on "working with specialized 'problems' in arithmetic", the emerging structural approach to knowledge had "small children meditating about number theory and 'sets'."

Objectives

Boy doing sums, Guinea-Bissau, 1974.

At different times and in different cultures and countries, mathematics education has attempted to achieve a variety of different objectives. These objectives have included:

The teaching and learning of basic numeracy skills to all students
The teaching of practical mathematics (arithmetic, elementary algebra, plane and solid geometry, trigonometry) to most students, to equip them to follow a trade or craft
The teaching of abstract mathematical concepts (such as set and function) at an early age
The teaching of selected areas of mathematics (such as Euclidean geometry) as an example of an axiomatic system and a model of deductive reasoning
The teaching of selected areas of mathematics (such as calculus) as an example of the intellectual achievements of the modern world
The teaching of advanced mathematics to those students who wish to follow a career in science, technology, engineering, and mathematics (STEM) fields
The teaching of heuristics and other problem-solving strategies to solve non-routine problems
The teaching of mathematics in social sciences and actuarial sciences, as well as in some selected arts under liberal arts education in liberal arts colleges or universities

Methods

The method or methods used in any particular context are largely determined by the objectives that the relevant educational system is trying to achieve. Methods of teaching mathematics include the following:

Classical education: the teaching of mathematics within the quadrivium, part of the classical education curriculum of the Middle Ages, which was typically based on Euclid's Elements taught as a paradigm of deductive reasoning.

Games can motivate students to improve skills that are usually learned by rote. In "Number Bingo," players roll 3 dice, then perform basic mathematical operations on those numbers to get a new number, which they cover on the board trying to cover 4 squares in a row. This game was played at a "Discovery Day" organized by Big Brother Mouse in Laos.

Computer-based math an approach based around the use of mathematical software as the primary tool of computation.
Computer-based mathematics education involving the use of computers to teach mathematics. Mobile applications have also been developed to help students learn mathematics.
Conventional approach: the gradual and systematic guiding through the hierarchy of mathematical notions, ideas and techniques. Starts with arithmetic and is followed by Euclidean geometry and elementary algebra taught concurrently. Requires the instructor to be well informed about elementary mathematics since didactic and curriculum decisions are often dictated by the logic of the subject rather than pedagogical considerations. Other methods emerge by emphasizing some aspects of this approach.
Discovery math: a constructivist method of teaching (discovery learning) mathematics which centres around problem-based or inquiry-based learning, with the use of open-ended questions and manipulative tools. This type of mathematics education was implemented in various parts of Canada beginning in 2005. Discovery-based mathematics is at the forefront of the Canadian Math Wars debate with many criticizing its effectiveness due to declining math scores, in comparison to traditional teaching models that value direct instruction, rote learning, and memorization.
Exercises: the reinforcement of mathematical skills by completing large numbers of exercises of a similar type, such as adding vulgar fractions or solving quadratic equations.
Historical method: teaching the development of mathematics within a historical, social and cultural context. Provides more human interest than the conventional approach.
Mastery: an approach in which most students are expected to achieve a high level of competence before progressing.
New Math: a method of teaching mathematics which focuses on abstract concepts such as set theory, functions and bases other than ten. Adopted in the US as a response to the challenge of early Soviet technical superiority in space, it began to be challenged in the late 1960s. One of the most influential critiques of the New Math was Morris Kline's 1973 book Why Johnny Can't Add. The New Math method was the topic of one of Tom Lehrer's most popular parody songs, with his introductory remarks to the song: "...in the new approach, as you know, the important thing is to understand what you're doing, rather than to get the right answer."
Problem solving: the cultivation of mathematical ingenuity, creativity and heuristic thinking by setting students open-ended, unusual, and sometimes unsolved problems. The problems can range from simple word problems to problems from international mathematics competitions such as the International Mathematical Olympiad. Problem-solving is used as a means to build new mathematical knowledge, typically by building on students' prior understandings.
Recreational mathematics: Mathematical problems that are fun can motivate students to learn mathematics and can increase enjoyment of mathematics.
Standards-based mathematics: a vision for pre-college mathematics education in the US and Canada, focused on deepening student understanding of mathematical ideas and procedures, and formalized by the National Council of Teachers of Mathematics which created the Principles and Standards for School Mathematics.
Relational approach: Uses class topics to solve everyday problems and relates the topic to current events. This approach focuses on the many uses of mathematics and helps students understand why they need to know it as well as helping them to apply mathematics to real-world situations outside of the classroom.
Rote learning: the teaching of mathematical results, definitions and concepts by repetition and memorisation typically without meaning or supported by mathematical reasoning. A derisory term is drill and kill. In traditional education, rote learning is used to teach multiplication tables, definitions, formulas, and other aspects of mathematics.

Content and age levels

Different levels of mathematics are taught at different ages and in somewhat different sequences in different countries. Sometimes a class may be taught at an earlier age than typical as a special or honors class.

Elementary mathematics in most countries is taught similarly, though there are differences. Most countries tend to cover fewer topics in greater depth than in the United States. During the primary school years, children learn about whole numbers and arithmetic, including addition, subtraction, multiplication, and division. Comparisons and measurement are taught, in both numeric and pictorial form, as well as fractions and proportionality, patterns, and various topics related to geometry.

At high school level, in most of the U.S., algebra, geometry and analysis (pre-calculus and calculus) are taught as separate courses in different years. Mathematics in most other countries (and in a few U.S. states) is integrated, with topics from all branches of mathematics studied every year. Students in many countries choose an option or pre-defined course of study rather than choosing courses à la carte as in the United States. Students in science-oriented curricula typically study differential calculus and trigonometry at age 16–17 and integral calculus, complex numbers, analytic geometry, exponential and logarithmic functions, and infinite series in their final year of secondary school. Probability and statistics may be taught in secondary education classes. In some countries, these topics are available as "advanced" or "additional" mathematics.

At college and university, science- and engineering students will be required to take multivariable calculus, differential equations, and linear algebra; at several US colleges, the minor or AS in mathematics substantively comprises these courses. Mathematics majors continue, to study various other areas within pure mathematics - and often in applied mathematics - with the requirement of specified advanced courses in analysis and modern algebra. Applied mathematics may be taken as a major subject in its own right, while specific topics are taught within other courses: for example, civil engineers may be required to study fluid mechanics, and "math for computer science" might include graph theory, permutation, probability, and formal mathematical proofs. Pure and applied math degrees often include modules in probability theory / mathematical statistics; while a course in numerical methods is a common requirement for applied math. (Theoretical) physics is mathematics intensive, often overlapping substantively with the pure or applied math degree. ("Business mathematics" is usually limited to introductory calculus and, sometimes, matrix calculations. Economics programs additionally cover optimization, often differential equations and linear algebra, sometimes analysis.)

Standards

Throughout most of history, standards for mathematics education were set locally, by individual schools or teachers, depending on the levels of achievement that were relevant to, realistic for, and considered socially appropriate for their pupils.

In modern times, there has been a move towards regional or national standards, usually under the umbrella of a wider standard school curriculum. In England, for example, standards for mathematics education are set as part of the National Curriculum for England, while Scotland maintains its own educational system. Many other countries have centralized ministries which set national standards or curricula, and sometimes even textbooks.

Ma (2000) summarised the research of others who found, based on nationwide data, that students with higher scores on standardised mathematics tests had taken more mathematics courses in high school. This led some states to require three years of mathematics instead of two. But because this requirement was often met by taking another lower-level mathematics course, the additional courses had a “diluted” effect in raising achievement levels.

In North America, the National Council of Teachers of Mathematics (NCTM) published the Principles and Standards for School Mathematics in 2000 for the US and Canada, which boosted the trend towards reform mathematics. In 2006, the NCTM released Curriculum Focal Points, which recommend the most important mathematical topics for each grade level through grade 8. However, these standards were guidelines to implement as American states and Canadian provinces chose. In 2010, the National Governors Association Center for Best Practices and the Council of Chief State School Officers published the Common Core State Standards for US states, which were subsequently adopted by most states. Adoption of the Common Core State Standards in mathematics is at the discretion of each state, and is not mandated by the federal government. "States routinely review their academic standards and may choose to change or add onto the standards to best meet the needs of their students." The NCTM has state affiliates that have different education standards at the state level. For example, Missouri has the Missouri Council of Teachers of Mathematics (MCTM) which has its pillars and standards of education listed on its website. The MCTM also offers membership opportunities to teachers and future teachers so they can stay up to date on the changes in math educational standards.

The Programme for International Student Assessment (PISA), created by the Organisation for the Economic Co-operation and Development (OECD), is a global program studying the reading, science and mathematic abilities of 15-year-old students. The first assessment was conducted in the year 2000 with 43 countries participating. PISA has repeated this assessment every three years to provide comparable data, helping to guide global education to better prepare youth for future economies. There have been many ramifications following the results of triennial PISA assessments due to implicit and explicit responses of stakeholders, which have led to education reform and policy change.

Research

"Robust, useful theories of classroom teaching do not yet exist". However, there are useful theories on how children learn mathematics and much research has been conducted in recent decades to explore how these theories can be applied to teaching. The following results are examples of some of the current findings in the field of mathematics education:

Important results: One of the strongest results in recent research is that the most important feature of effective teaching is giving students "opportunity to learn". Teachers can set expectations, time, kinds of tasks, questions, acceptable answers, and type of discussions that will influence students' opportunity to learn. This must involve both skill efficiency and conceptual understanding.

Conceptual understanding: Two of the most important features of teaching in the promotion of conceptual understanding are attending explicitly to concepts and allowing students to struggle with important mathematics. Both of these features have been confirmed through a wide variety of studies. Explicit attention to concepts involves making connections between facts, procedures and ideas. (This is often seen as one of the strong points in mathematics teaching in East Asian countries, where teachers typically devote about half of their time to making connections. At the other extreme is the U.S.A., where essentially no connections are made in school classrooms.) These connections can be made through explanation of the meaning of a procedure, questions comparing strategies and solutions of problems, noticing how one problem is a special case of another, reminding students of the main point, discussing how lessons connect, and so on.

Deliberate, productive struggle with mathematical ideas refers to the fact that when students exert effort with important mathematical ideas, even if this struggle initially involves confusion and errors, the result is greater learning. This is true whether the struggle is due to challenging, well-implemented teaching, or due to faulty teaching, the students must struggle to make sense of.

Formative assessment: Formative assessment is both the best and cheapest way to boost student achievement, student engagement and teacher professional satisfaction. Results surpass those of reducing class size or increasing teachers' content knowledge. Effective assessment is based on clarifying what students should know, creating appropriate activities to obtain the evidence needed, giving good feedback, encouraging students to take control of their learning and letting students be resources for one another.

Homework: Homework which leads students to practice past lessons or prepare future lessons is more effective than those going over today's lesson. Students benefit from feedback. Students with learning disabilities or low motivation may profit from rewards. For younger children, homework helps simple skills, but not broader measures of achievement.

Students with difficulties: Students with genuine difficulties (unrelated to motivation or past instruction) struggle with basic facts, answer impulsively, struggle with mental representations, have poor number sense and have poor short-term memory. Techniques that have been found productive for helping such students include peer-assisted learning, explicit teaching with visual aids, instruction informed by formative assessment and encouraging students to think aloud.

Algebraic reasoning: Elementary school children need to spend a long time learning to express algebraic properties without symbols before learning algebraic notation. When learning symbols, many students believe letters always represent unknowns and struggle with the concept of variable. They prefer arithmetic reasoning to algebraic equations for solving word problems. It takes time to move from arithmetic to algebraic generalizations to describe patterns. Students often have trouble with the minus sign and understand the equals sign to mean "the answer is....".

Methodology

As with other educational research (and the social sciences in general), mathematics education research depends on both quantitative and qualitative studies. Quantitative research includes studies that use inferential statistics to answer specific questions, such as whether a certain teaching method gives significantly better results than the status quo. The best quantitative studies involve randomized trials where students or classes are randomly assigned different methods to test their effects. They depend on large samples to obtain statistically significant results.

Qualitative research, such as case studies, action research, discourse analysis, and clinical interviews, depend on small but focused samples in an attempt to understand student learning and to look at how and why a given method gives the results it does. Such studies cannot conclusively establish that one method is better than another, as randomized trials can, but unless it is understood why treatment X is better than treatment Y, application of results of quantitative studies will often lead to "lethal mutations" of the finding in actual classrooms. Exploratory qualitative research is also useful for suggesting new hypotheses, which can eventually be tested by randomized experiments. Both qualitative and quantitative studies, therefore, are considered essential in education—just as in the other social sciences. Many studies are “mixed”, simultaneously combining aspects of both quantitative and qualitative research, as appropriate.

Randomized trials

There has been some controversy over the relative strengths of different types of research. Because randomized trials provide clear, objective evidence on “what works”, policymakers often consider only those studies. Some scholars have pushed for more random experiments in which teaching methods are randomly assigned to classes. In other disciplines concerned with human subjects, like biomedicine, psychology, and policy evaluation, controlled, randomized experiments remain the preferred method of evaluating treatments. Educational statisticians and some mathematics educators have been working to increase the use of randomized experiments to evaluate teaching methods. On the other hand, many scholars in educational schools have argued against increasing the number of randomized experiments, often because of philosophical objections, such as the ethical difficulty of randomly assigning students to various treatments when the effects of such treatments are not yet known to be effective, or the difficulty of assuring rigid control of the independent variable in fluid, real school settings.

In the United States, the National Mathematics Advisory Panel (NMAP) published a report in 2008 based on studies, some of which used randomized assignment of treatments to experimental units, such as classrooms or students. The NMAP report's preference for randomized experiments received criticism from some scholars. In 2010, the What Works Clearinghouse (essentially the research arm for the Department of Education) responded to ongoing controversy by extending its research base to include non-experimental studies, including regression discontinuity designs and single-case studies.

Organizations

Maximum entropy probability distribution

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Definition of entropy and differential entropy

If $X$ is a discrete random variable with distribution given by

\operatorname {Pr} (X=x_{k})=p_{k}\quad {\mbox{ for }}k=1,2,\ldots

then the entropy of $X$ is defined as

H(X)=-\sum _{k\geq 1}p_{k}\log p_{k}.

If $X$ is a continuous random variable with probability density $p(x)$ , then the differential entropy of $X$ is defined as

H(X)=-\int _{-\infty }^{\infty }p(x)\log p(x)\,dx.

The quantity $p(x)\log p(x)$ is understood to be zero whenever $p(x)=0$ .

This is a special case of more general forms described in the articles Entropy (information theory), Principle of maximum entropy, and differential entropy. In connection with maximum entropy distributions, this is the only one needed, because maximizing $H(X)$ will also maximize the more general forms.

The base of the logarithm is not important as long as the same one is used consistently: change of base merely results in a rescaling of the entropy. Information theorists may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists will often prefer the natural logarithm, resulting in a unit of nats for the entropy.

The choice of the measure $dx$ is however crucial in determining the entropy and the resulting maximum entropy distribution, even though the usual recourse to the Lebesgue measure is often defended as "natural".

Distributions with measured constants

Many statistical distributions of applicable interest are those for which the moments or other measurable quantities are constrained to be constants. The following theorem by Ludwig Boltzmann gives the form of the probability density under these constraints.

Continuous case

Suppose S is a closed subset of the real numbers R and we choose to specify n measurable functions f₁,...,f_n and n numbers a₁,...,a_n. We consider the class C of all real-valued random variables which are supported on S (i.e. whose density function is zero outside of S) and which satisfy the n moment conditions:

\operatorname {E} (f_{j}(X))\geq a_{j}\quad {\mbox{ for }}j=1,\ldots ,n

If there is a member in C whose density function is positive everywhere in S, and if there exists a maximal entropy distribution for C, then its probability density p(x) has the following shape:

p(x)=\exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x)\right)\quad {\mbox{ for all }}x\in S

where we assume that $f_{0}(x)=1$ . The constant $\lambda _{0}$ and the n Lagrange multipliers ${\boldsymbol {\lambda }}=(\lambda _{1},\ldots ,\lambda _{n})$ solve the constrained optimization problem with $a_{0}=1$ (this condition ensures that $p$ integrates to unity):

{\displaystyle \max _{\lambda _{0};{\boldsymbol {\lambda }}}\left\{\sum _{j=0}^{n}\lambda _{j}a_{j}-\int \exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x)\right)dx\right\}\quad \mathrm {subject\;to:\;\;} {\boldsymbol {\lambda }}\geq \mathbf {0} }

Using the Karush–Kuhn–Tucker conditions, it can be shown that the optimization problem has a unique solution, because the objective function in the optimization is concave in ${\boldsymbol {\lambda }}$ .

Note that if the moment conditions are equalities (instead of inequalities), that is,

\operatorname {E} (f_{j}(X))=a_{j}\quad {\mbox{ for }}j=1,\ldots ,n,

then the constraint condition ${\boldsymbol {\lambda }}\geq \mathbf {0}$ is dropped, making the optimization over the Lagrange multipliers unconstrained.

Discrete case

Suppose $S=\{x_{1},x_{2},...\}$ is a (finite or infinite) discrete subset of the reals and we choose to specify $n$ functions f₁,...,f_n and n numbers a₁,...,a_n. We consider the class C of all discrete random variables X which are supported on S and which satisfy the n moment conditions

\operatorname {E} (f_{j}(X))\geq a_{j}\quad {\mbox{ for }}j=1,\ldots ,n

If there exists a member of C which assigns positive probability to all members of S and if there exists a maximum entropy distribution for C, then this distribution has the following shape:

{\displaystyle \operatorname {Pr} (X=x_{k})=\exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x_{k})\right)\quad {\mbox{ for }}k=1,2,\ldots }

where we assume that $f_{0}=1$ and the constants $\lambda _{0},\;{\boldsymbol {\lambda }}=(\lambda _{1},\ldots ,\lambda _{n})$ solve the constrained optimization problem with $a_{0}=1$ :

{\displaystyle \max _{\lambda _{0};{\boldsymbol {\lambda }}}\left\{\sum _{j=0}^{n}\lambda _{j}a_{j}-\sum _{k\geq 1}\exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x_{k})\right)\right\}\quad \mathrm {subject\;to:\;\;} {\boldsymbol {\lambda }}\geq \mathbf {0} }

Again, if the moment conditions are equalities (instead of inequalities), then the constraint condition ${\boldsymbol {\lambda }}\geq \mathbf {0}$ is not present in the optimization.

Proof in the case of equality constraints

In the case of equality constraints, this theorem is proved with the calculus of variations and Lagrange multipliers. The constraints can be written as

\int _{-\infty }^{\infty }f_{j}(x)p(x)dx=a_{j}

We consider the functional

{\displaystyle J(p)=\int _{-\infty }^{\infty }p(x)\ln {p(x)}dx-\eta _{0}\left(\int _{-\infty }^{\infty }p(x)dx-1\right)-\sum _{j=1}^{n}\lambda _{j}\left(\int _{-\infty }^{\infty }f_{j}(x)p(x)dx-a_{j}\right)}

where $\eta _{0}$ and $\lambda _{j},j\geq 1$ are the Lagrange multipliers. The zeroth constraint ensures the second axiom of probability. The other constraints are that the measurements of the function are given constants up to order $n$ . The entropy attains an extremum when the functional derivative is equal to zero:

{\frac {\delta J}{\delta p}}\left(p\right)=\ln {p(x)}+1-\eta _{0}-\sum _{j=1}^{n}\lambda _{j}f_{j}(x)=0

It is an exercise for the reader that this extremum is indeed a maximum. Therefore, the maximum entropy probability distribution in this case must be of the form ( $\lambda _{0}:=\eta _{0}-1$ )

{\displaystyle p(x)=e^{-1+\eta _{0}}\cdot e^{\sum _{j=1}^{n}\lambda _{j}f_{j}(x)}=\exp \left(\sum _{j=0}^{n}\lambda _{j}f_{j}(x)\right)\;.}

The proof of the discrete version is essentially the same.

Uniqueness of the maximum

Suppose $p$ , $p'$ are distributions satisfying the expectation-constraints. Letting $\alpha \in (0,1)$ and considering the distribution $q=\alpha \cdot p+(1-\alpha )\cdot p'$ it is clear that this distribution satisfies the expectation-constraints and furthermore has as support $\mathrm {supp} (q)=\mathrm {supp} (p)\cup \mathrm {supp} (p')$ . From basic facts about entropy, it holds that ${\mathcal {H}}(q)\geq \alpha {\mathcal {H}}(p)+(1-\alpha ){\mathcal {H}}(p')$ . Taking limits $\alpha \longrightarrow 1$ and $\alpha \longrightarrow 0$ respectively yields ${\mathcal {H}}(q)\geq {\mathcal {H}}(p),{\mathcal {H}}(p')$ .

It follows that a distribution satisfying the expectation-constraints and maximising entropy must necessarily have full support — i. e. the distribution is almost everywhere positive. It follows that the maximising distribution must be an internal point in the space of distributions satisfying the expectation-constraints, that is, it must be a local extreme. Thus it suffices to show that the local extreme is unique, in order to show both that the entropy-maximising distribution is unique (and this also shows that the local extreme is the global maximum).

Suppose $p,p'$ are local extremes. Reformulating the above computations these are characterised by parameters ${\vec {\lambda }},{\vec {\lambda }}'\in \mathbb {R} ^{n}$ via $p(x)={\frac {e^{\langle {\vec {\lambda }},{\vec {f}}(x)\rangle }}{C({\vec {\lambda }})}}$ and similarly for $p'$ , where $C({\vec {\lambda }})=\int _{x\in \mathbb {R} }e^{\langle {\vec {\lambda }},{\vec {f}}(x)\rangle }~dx$ . We now note a series of identities: Via the satisfaction of the expectation-constraints and utilising gradients/directional derivatives, one has ${\displaystyle D\log(C(\cdot ))\vert _{\vec {\lambda }}=\left.{\frac {DC(\cdot )}{C(\cdot )}}\right|_{\vec {\lambda }}=\mathbb {E} _{p}[{\vec {f}}(X)]={\vec {a}}}$ and similarly for ${\vec {\lambda }}'$ . Letting $u={\vec {\lambda }}'-{\vec {\lambda }}\in \mathbb {R} ^{n}$ one obtains:

{\displaystyle 0=\langle u,{\vec {a}}-{\vec {a}}\rangle =D_{u}\log(C(\cdot ))\vert _{{\vec {\lambda }}'}-D_{u}\log(C(\cdot ))\vert _{\vec {\lambda }}=D_{u}^{2}\log(C(\cdot ))\vert _{\vec {\gamma }}}

where ${\vec {\gamma }}=\theta {\vec {\lambda }}+(1-\theta ){\vec {\lambda }}'$ for some $\theta \in (0,1)$ . Computing further one has

{\displaystyle {\begin{array}{rcl}0&=&D_{u}^{2}\log(C(\cdot ))\vert _{\vec {\gamma }}\\&=&\left.D_{u}\left({\frac {D_{u}C(\cdot )}{C(\cdot )}}\right)\right|_{\vec {\gamma }}\\&=&\left.{\frac {D_{u}^{2}C(\cdot )}{C(\cdot )}}\right|_{\vec {\gamma }}-\left.{\frac {(D_{u}C(\cdot ))^{2}}{C(\cdot )^{2}}}\right|_{\vec {\gamma }}\\&=&\mathbb {E} _{q}[(\langle u,{\vec {f}}(X)\rangle )^{2}]-\left(\mathbb {E} _{q}[\langle u,{\vec {f}}(X)\rangle ]\right)^{2}=\mathrm {Var} _{q}(\langle u,{\vec {f}}(X)\rangle )\\\end{array}}}

where $q$ is similar to the distribution above, only parameterised by ${\vec {\gamma }}$ . Assuming that no non-trivial linear combination of the observables is almost everywhere (a.e.) constant, (which e.g. holds if the observables are independent and not a.e. constant), it holds that $\langle u,{\vec {f}}(X)\rangle$ has non-zero variance, unless $u=0$ . By the above equation it is thus clear, that the latter must be the case. Hence ${\vec {\lambda }}'-{\vec {\lambda }}=u=0$ , so the parameters characterising the local extrema $p,p'$ are identical, which means that the distributions themselves are identical. Thus, the local extreme is unique and by the above discussion, the maximum is unique—provided a local extreme actually exists.

Caveats

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy. It is also possible that the expected value restrictions for the class C force the probability distribution to be zero in certain subsets of S. In that case our theorem doesn't apply, but one can work around this by shrinking the set S.

Examples

Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution has its own entropy. To see this, rewrite the density as $p(x)=\exp {(\ln {p(x)})}$ and compare to the expression of the theorem above. By choosing $\ln {p(x)}\rightarrow f(x)$ to be the measurable function and

\int \exp {(f(x))}f(x)dx=-H

to be the constant, $p(x)$ is the maximum entropy probability distribution under the constraint

\int p(x)f(x)dx=-H

Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy. These are often found by starting with the same procedure $\ln {p(x)}\rightarrow f(x)$ and finding that $f(x)$ can be separated into parts.

A table of examples of maximum entropy distributions is given in Lisman (1972) and Park & Bera (2009).

Uniform and piecewise uniform distributions

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b], and thus the probability density is 0 outside of the interval. This uniform density can be related to Laplace's principle of indifference, sometimes called the principle of insufficient reason. More generally, if we are given a subdivision a=a₀ < a₁ < ... < a_k = b of the interval [a,b] and probabilities p₁,...,p_k that add up to one, then we can consider the class of all continuous distributions such that

\operatorname {Pr} (a_{j-1}\leq X<a_{j})=p_{j}\quad {\mbox{ for }}j=1,\ldots ,k

The density of the maximum entropy distribution for this class is constant on each of the intervals [a_j-1,a_j). The uniform distribution on the finite set {x₁,...,x_n} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

Positive and specified mean: the exponential distribution

The exponential distribution, for which the density function is

p(x|\lambda )={\begin{cases}\lambda e^{-\lambda x}&x\geq 0,\\0&x<0,\end{cases}}

is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a specified mean of 1/λ.

Specified mean and variance: the normal distribution

The normal distribution N(μ,σ²), for which the density function is

p(x|\mu ,\sigma )={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}},

has maximum entropy among all real-valued distributions supported on (−∞,∞) with a specified variance σ² (a particular moment). Therefore, the assumption of normality imposes the minimal prior structural constraint beyond this moment. (See the differential entropy article for a derivation.)

In the case of distributions supported on [0,∞), the maximum entropy distribution depends on relationships between the first and second moments. In specific cases, it may be the exponential distribution, or may be another distribution, or may be undefinable.

Discrete distributions with specified mean

Among all the discrete distributions supported on the set {x₁,...,x_n} with a specified mean μ, the maximum entropy distribution has the following shape:

\operatorname {Pr} (X=x_{k})=Cr^{x_{k}}\quad {\mbox{ for }}k=1,\ldots ,n

where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

For example, if a large number N of dice are thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x₁,...,x₆} = {1,...,6} and μ = S/N.

Finally, among all the discrete distributions supported on the infinite set $\{x_{1},x_{2},...\}$ with mean μ, the maximum entropy distribution has the shape:

\operatorname {Pr} (X=x_{k})=Cr^{x_{k}}\quad {\mbox{ for }}k=1,2,\ldots ,

where again the constants C and r were determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ. For example, in the case that x_k = k, this gives

C={\frac {1}{\mu -1}},\quad \quad r={\frac {\mu -1}{\mu }},

such that respective maximum entropy distribution is the geometric distribution.

Circular random variables

For a continuous random variable $\theta _{i}$ distributed about the unit circle, the Von Mises distribution maximizes the entropy when the real and imaginary parts of the first circular moment are specified or, equivalently, the circular mean and circular variance are specified.

When the mean and variance of the angles $\theta _{i}$ modulo $2\pi$ are specified, the wrapped normal distribution maximizes the entropy.

Maximizer for specified mean, variance and skew

There exists an upper bound on the entropy of continuous random variables on $\mathbb {R}$ with a specified mean, variance, and skew. However, there is no distribution which achieves this upper bound, because $p(x)=c\exp {(\lambda _{1}x+\lambda _{2}x^{2}+\lambda _{3}x^{3})}$ is unbounded when $\lambda _{3}\neq 0$ (see Cover & Thomas (2006: chapter 12)).

However, the maximum entropy is $ε$ -achievable: a distribution's entropy can be arbitrarily close to the upper bound. Start with a normal distribution of the specified mean and variance. To introduce a positive skew, perturb the normal distribution upward by a small amount at a value many $σ$ larger than the mean. The skewness, being proportional to the third moment, will be affected more than the lower order moments.

This is a special case of the general case in which the exponential of any odd-order polynomial in x will be unbounded on $\mathbb {R}$ . For example, $ce^{\lambda x}$ will likewise be unbounded on $\mathbb {R}$ , but when the support is limited to a bounded or semi-bounded interval the upper entropy bound may be achieved (e.g. if x lies in the interval [0,∞] and λ< 0, the exponential distribution will result).

Maximizer for specified mean and deviation risk measure

Every distribution with log-concave density is a maximal entropy distribution with specified mean μ and Deviation risk measure D.

In particular, the maximal entropy distribution with specified mean $E(x)=\mu$ and deviation $D(x)=d$ is:

The normal distribution $N(m,d^{2})$ , if $D(x)={\sqrt {E[(x-\mu )^{2}]}}$ is the standard deviation;
The Laplace distribution, if $D(x)=E(|x-\mu |)$ is the average absolute deviation;
The distribution with density of the form $f(x)=c\exp(ax+b{[x-\mu ]_{-}}^{2})$ if $D(x)={\sqrt {E[{(x-\mu )_{-}}^{2}]}}$ is the standard lower semi-deviation, where $[x]_{-}:=\max\{0,-x\}$ , and a,b,c are constants.

Other examples

In the table below, each listed distribution maximizes the entropy for a particular set of functional constraints listed in the third column, and the constraint that x be included in the support of the probability density, which is listed in the fourth column. Several examples (Bernoulli, geometric, exponential, Laplace, Pareto) listed are trivially true because their associated constraints are equivalent to the assignment of their entropy. They are included anyway because their constraint is related to a common or easily measured quantity. For reference, $\Gamma (x)=\int _{0}^{\infty }e^{-t}t^{x-1}dt$ is the gamma function, $\psi (x)={\frac {d}{dx}}\ln \Gamma (x)={\frac {\Gamma '(x)}{\Gamma (x)}}$ is the digamma function, $B(p,q)={\frac {\Gamma (p)\Gamma (q)}{\Gamma (p+q)}}$ is the beta function, and $γ E$ is the Euler-Mascheroni constant.

Table of probability distributions and corresponding maximum entropy constraints
Distribution Name	Probability density/mass function	Maximum Entropy Constraint	Support
Uniform (discrete)	$f(k)={\frac {1}{b-a+1}}$	None	$\{a,a+1,...,b-1,b\}\,$
Uniform (continuous)	$f(x)={\frac {1}{b-a}}$	None	$[a,b]\,$
Bernoulli	$f(k)=p^{k}(1-p)^{1-k}$	$\operatorname {E} (k)=p\,$	$\{0,1\}\,$
Geometric	$f(k)=(1-p)^{k-1}\,p$	$\operatorname {E} (k)={\frac {1}{p}}\,$	$\mathbb {N} \setminus \left\{0\right\}=\{1,2,3,...\}$
Exponential	$f(x)=\lambda \exp \left(-\lambda x\right)$	$\operatorname {E} (x)={\frac {1}{\lambda }}\,$	$[0,\infty )\,$
Laplace	$f(x)={\frac {1}{2b}}\exp \left(-{\frac {\|x-\mu \|}{b}}\right)$	$\operatorname {E} (\|x-\mu \|)=b\,$	$(-\infty ,\infty )\,$
Asymmetric Laplace	$f(x)={\frac {\lambda \,e^{-(x-m)\lambda s\kappa ^{s}}}{\kappa +1/\kappa }}\,(s\!=\!\operatorname {sgn}(x\!-\!m))$	$\operatorname {E} ((x-m)s\kappa ^{s})=1/\lambda \,$	$(-\infty ,\infty )\,$
Pareto	$f(x)={\frac {\alpha x_{m}^{\alpha }}{x^{\alpha +1}}}$	$\operatorname {E} (\ln(x))={\frac {1}{\alpha }}+\ln(x_{m})\,$	$[x_{m},\infty )\,$
Normal	$f(x)={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right)$	$\operatorname {E} (x)=\mu ,\,\operatorname {E} ((x-\mu )^{2})=\sigma ^{2}$	$(-\infty ,\infty )\,$
Truncated normal	(see article)	$\operatorname {E} (x)=\mu _{T},\,\operatorname {E} ((x-\mu _{T})^{2})=\sigma _{T}^{2}$	$[a,b]$
von Mises	$f(\theta )={\frac {1}{2\pi I_{0}(\kappa )}}\exp {(\kappa \cos {(\theta -\mu )})}$	${\displaystyle \operatorname {E} (\cos \theta )={\frac {I_{1}(\kappa )}{I_{0}(\kappa )}}\cos \mu ,\,\operatorname {E} (\sin \theta )={\frac {I_{1}(\kappa )}{I_{0}(\kappa )}}\sin \mu }$	$[0,2\pi )\,$
Rayleigh	$f(x)={\frac {x}{\sigma ^{2}}}\exp \left(-{\frac {x^{2}}{2\sigma ^{2}}}\right)$	${\displaystyle \operatorname {E} (x^{2})=2\sigma ^{2},\operatorname {E} (\ln(x))={\frac {\ln(2\sigma ^{2})-\gamma _{\mathrm {E} }}{2}}\,}$	$[0,\infty )\,$
Beta	$f(x)={\frac {x^{\alpha -1}(1-x)^{\beta -1}}{B(\alpha ,\beta )}}$ for $0\leq x\leq 1$	$\operatorname {E} (\ln(x))=\psi (\alpha )-\psi (\alpha +\beta )\,$ $\operatorname {E} (\ln(1-x))=\psi (\beta )-\psi (\alpha +\beta )\,$	$[0,1]\,$
Cauchy	$f(x)={\frac {1}{\pi (1+x^{2})}}$	$\operatorname {E} (\ln(1+x^{2}))=2\ln 2$	$(-\infty ,\infty )\,$
Chi	$f(x)={\frac {2}{2^{k/2}\Gamma (k/2)}}x^{k-1}\exp \left(-{\frac {x^{2}}{2}}\right)$	${\displaystyle \operatorname {E} (x^{2})=k,\,\operatorname {E} (\ln(x))={\frac {1}{2}}\left[\psi \left({\frac {k}{2}}\right)\!+\!\ln(2)\right]}$	$[0,\infty )\,$
Chi-squared	$f(x)={\frac {1}{2^{k/2}\Gamma (k/2)}}x^{{\frac {k}{2}}\!-\!1}\exp \left(-{\frac {x}{2}}\right)$	$\operatorname {E} (x)=k,\,\operatorname {E} (\ln(x))=\psi \left({\frac {k}{2}}\right)+\ln(2)$	$[0,\infty )\,$
Erlang	$f(x)={\frac {\lambda ^{k}}{(k-1)!}}x^{k-1}\exp(-\lambda x)$	$\operatorname {E} (x)=k/\lambda ,\,\operatorname {E} (\ln(x))=\psi (k)-\ln(\lambda )$	$[0,\infty )\,$
Gamma	$f(x)={\frac {x^{k-1}\exp(-{\frac {x}{\theta }})}{\theta ^{k}\Gamma (k)}}$	$\operatorname {E} (x)=k\theta ,\,\operatorname {E} (\ln(x))=\psi (k)+\ln(\theta )$	$[0,\infty )\,$
Lognormal	$f(x)={\frac {1}{\sigma x{\sqrt {2\pi }}}}\exp \left(-{\frac {(\ln x-\mu )^{2}}{2\sigma ^{2}}}\right)$	$\operatorname {E} (\ln(x))=\mu ,\operatorname {E} ((\ln(x)-\mu )^{2})=\sigma ^{2}\,$	$(0,\infty )\,$
Maxwell–Boltzmann	$f(x)={\frac {1}{a^{3}}}{\sqrt {\frac {2}{\pi }}}\,x^{2}\exp \left(-{\frac {x^{2}}{2a^{2}}}\right)$	${\displaystyle \operatorname {E} (x^{2})=3a^{2},\,\operatorname {E} (\ln(x))\!=\!1\!+\!\ln \left({\frac {a}{\sqrt {2}}}\right)\!-\!{\frac {\gamma _{\mathrm {E} }}{2}}}$	$[0,\infty )\,$
Weibull	$f(x)={\frac {k}{\lambda ^{k}}}x^{k-1}\exp \left(-{\frac {x^{k}}{\lambda ^{k}}}\right)$	${\displaystyle \operatorname {E} (x^{k})=\lambda ^{k},\operatorname {E} (\ln(x))=\ln(\lambda )-{\frac {\gamma _{\mathrm {E} }}{k}}\,}$	$[0,\infty )\,$
Multivariate normal	$f_{X}({\vec {x}})=$ ${\displaystyle {\frac {\exp \left(-{\frac {1}{2}}({\vec {x}}-{\vec {\mu }})^{\top }\Sigma ^{-1}\cdot ({\vec {x}}-{\vec {\mu }})\right)}{(2\pi )^{N/2}\left\|\Sigma \right\|^{1/2}}}}$	${\displaystyle \operatorname {E} ({\vec {x}})={\vec {\mu }},\,\operatorname {E} (({\vec {x}}-{\vec {\mu }})({\vec {x}}-{\vec {\mu }})^{T})=\Sigma \,}$	$\mathbb {R} ^{n}$
Binomial	$f(k)={n \choose k}p^{k}(1-p)^{n-k}$	$\operatorname {E} (x)=\mu ,f\in {\text{n-generalized binomial distribution}}$	$\left\{0,{\ldots },n\right\}$
Poisson	$f(k)={\frac {\lambda ^{k}\exp(-\lambda )}{k!}}$	$\operatorname {E} (x)=\lambda ,f\in {\infty }{\text{-generalized binomial distribution}}$	$\mathbb {N} =\left\{0,1,{\ldots }\right\}$
Logistic	$f(x)={\frac {e^{-x}}{(1+e^{-x})^{2}}}$	$\operatorname {E} (x)=0,\operatorname {E} (\ln(1+e^{-x}))=1$	$\left\{-\infty ,\infty \right\}$