With the publication of "A Dynamical Theory of the Electromagnetic Field" in 1865, Maxwell demonstrated that electric and magnetic fields travel through space as waves moving at the speed of light. He proposed that light is an undulation in the same medium that is the cause of electric and magnetic phenomena. The unification of light and electrical phenomena led to his prediction of the existence of radio waves, and the paper contained his final version of his equations, which he had been working on since 1856. As a result of his equations, and other contributions such as
introducing an effective method to deal with network problems and linear
conductors, he is regarded as a founder of the modern field of electrical engineering. In 1871, Maxwell became the first Cavendish Professor of Physics, serving until his death in 1879.
Maxwell was the first to derive the Maxwell–Boltzmann distribution, a statistical means of describing aspects of the kinetic theory of gases, which he worked on sporadically throughout his career. He is also known for presenting the first durable colour photograph
in 1861, and showed that any colour can be produced with a mixture of
any three primary colours, those being red, green, and blue, the basis
for colour television. He also worked on analysing the rigidity of rod-and-joint frameworks (trusses) like those in many bridges. He devised modern dimensional analysis and helped to established the CGS system of measurement. He is credited with being the first to understand chaos, and the first to emphasize the butterfly effect. He correctly proposed that the rings of Saturn were made up of many unattached small fragments. His 1863 paper On Governors serves as an important foundation for control theory and cybernetics, and was also the earliest mathematical analysis on control systems. In 1867, he proposed the thought experiment known as Maxwell's demon. In his seminal 1867 paper On the Dynamical Theory of Gases he introduced the Maxwell model for describing the behavior of a viscoelastic material and originated the Maxwell-Cattaneo equation for describing the transport of heat in a medium.
His discoveries helped usher in the era of modern physics, laying the foundations for such fields as relativity, also being the one to introduce the term into physics, and quantum mechanics. Many physicists regard Maxwell as the 19th-century scientist having the
greatest influence on 20th-century physics. His contributions to the
science are considered by many to be of the same magnitude as those of
Isaac Newton and Albert Einstein. On the centenary of Maxwell's birthday, his work was described by
Einstein as the "most profound and the most fruitful that physics has
experienced since the time of Newton". When Einstein visited the University of Cambridge
in 1922, he was told by his host that he had done great things because
he stood on Newton's shoulders; Einstein replied: "No I don't. I stand
on the shoulders of Maxwell." Tom Siegfried described Maxwell as "one of those once-in-a-century
geniuses who perceived the physical world with sharper senses than those
around him".
James Clerk Maxwell was born on 13 June 1831 at 14 India Street, Edinburgh, to John Clerk Maxwell of Middlebie, an advocate, and Frances Cay, daughter of Robert Hodshon Cay and sister of John Cay. (His birthplace now houses a museum operated by the James Clerk Maxwell Foundation.) His father was a man of comfortable means of the Clerk family of Penicuik, holders of the baronetcy of Clerk of Penicuik. His father's brother was the 6th baronet. He had been born "John Clerk", adding "Maxwell" to his own after he
inherited (as an infant in 1793) the Middlebie estate, a Maxwell
property in Dumfriesshire. James was a first cousin of both the artist Jemima Blackburn (the daughter of his father's sister) and the civil engineer William Dyce Cay (the son of his mother's brother). Cay and Maxwell were close friends and Cay acted as his best man when Maxwell married.
Maxwell's parents met and married when they were well into their thirties; his mother was nearly 40 when he was born. They had had one earlier child, a daughter named Elizabeth, who died in infancy.
When Maxwell was young his family moved to Glenlair, in Kirkcudbrightshire, which his parents had built on the estate which comprised 1,500 acres (610 ha). All indications suggest that Maxwell had maintained an unquenchable curiosity from an early age. By the age of three, everything that moved, shone, or made a noise drew the question: "what's the go o' that?" In a passage added to a letter from his father to his sister-in-law
Jane Cay in 1834, his mother described this innate sense of
inquisitiveness:
He is a very happy man, and has
improved much since the weather got moderate; he has great work with
doors, locks, keys, etc., and "show me how it doos" is never out of his
mouth. He also investigates the hidden course of streams and bell-wires,
the way the water gets from the pond through the wall....
Education, 1839–1847
Recognising the boy's potential, Maxwell's mother Frances took responsibility for his early education, which in the Victorian era was largely the job of the woman of the house. At eight he could recite long passages of John Milton and the whole of the 119th psalm (176 verses). Indeed, his knowledge of scripture was already detailed; he could give chapter and verse for almost any quotation from the Psalms. His mother was taken ill with abdominal cancer
and, after an unsuccessful operation, died in December 1839 when he was
eight years old. His education was then overseen by his father and his
father's sister-in-law Jane, both of whom played pivotal roles in his
life. His formal schooling began unsuccessfully under the guidance of a
16-year-old hired tutor. Little is known about the young man hired to
instruct Maxwell, except that he treated the younger boy harshly,
chiding him for being slow and wayward. The tutor was dismissed in November 1841. James' father took him to Robert Davidson's
demonstration of electric propulsion and magnetic force on 12 February
1842, an experience with profound implications for the boy.
In 1841, at age ten, Maxwell was sent to the prestigious Edinburgh Academy. He lodged during term times at the house of his aunt Isabella. During
this time his passion for drawing was encouraged by his older cousin
Jemima. The young Maxwell, having been raised in isolation on his father's countryside estate, did not fit in well at school. The first year had been full, obliging him to join the second year with classmates a year his senior. His mannerisms and Galloway
accent struck the other boys as rustic. Having arrived on his first day
of school wearing a pair of homemade shoes and a tunic, he earned the
unkind nickname of "Daftie". He never seemed to resent the epithet, bearing it without complaint for many years. Social isolation at the Academy ended when he met Lewis Campbell and Peter Guthrie Tait, two boys of a similar age who were to become notable scholars later in life. They remained lifelong friends.
Maxwell was fascinated by geometry at an early age, rediscovering the regular polyhedra before he received any formal instruction. Despite his winning the school's scripture biography prize in his second year, his academic work remained unnoticed until, at the age of 13, he won the school's mathematical medal and first prize for both English and poetry.
Maxwell's interests ranged far beyond the school syllabus and he did not pay particular attention to examination performance. He wrote his first scientific paper at the age of 14. In it, he described a mechanical means of drawing mathematical curves with a piece of twine, and the properties of ellipses, Cartesian ovals, and related curves with more than two foci. The work, of 1846, "On the description of oval curves and those having a plurality of foci" was presented to the Royal Society of Edinburgh by James Forbes, a professor of natural philosophy at the University of Edinburgh, because Maxwell was deemed too young to present the work himself. The work was not entirely original, since René Descartes had also examined the properties of such multifocal ellipses in the 17th century, but Maxwell had simplified their construction.
University of Edinburgh, 1847–1850
Old College, University of Edinburgh
Maxwell left the Academy in 1847 at age 16 and began attending classes at the University of Edinburgh. He had the opportunity to attend the University of Cambridge,
but decided, after his first term, to complete the full course of his
undergraduate studies at Edinburgh. The academic staff of the university
included some highly regarded names; his first-year tutors included Sir William Hamilton, who lectured him on logic and metaphysics, Philip Kelland on mathematics, and James Forbes on natural philosophy. He did not find his classes demanding, and was, therefore, able to immerse himself in private study during
free time at the university and particularly when back home at Glenlair. There he would experiment with improvised chemical, electric, and
magnetic apparatus; however, his chief concerns regarded the properties
of polarised light. He constructed shaped blocks of gelatine, subjected them to various stresses, and with a pair of polarising prisms given to him by William Nicol, viewed the coloured fringes that had developed within the jelly. Through this practice he discovered photoelasticity, which is a means of determining the stress distribution within physical structures.
At age 18, Maxwell contributed two papers for the Transactions of the Royal Society of Edinburgh.
One of these, "On the Equilibrium of Elastic Solids", laid the
foundation for an important discovery later in his life, which was the
temporary double refraction produced in viscous liquids by shear stress. His other paper was "Rolling Curves" and, just as with the paper "Oval
Curves" that he had written at the Edinburgh Academy, he was again
considered too young to stand at the rostrum to present it himself. The
paper was delivered to the Royal Society by his tutor Kelland instead.
In October 1850, already an accomplished mathematician, Maxwell left Scotland for the University of Cambridge. He initially attended Peterhouse, but before the end of his first term transferred to Trinity, where he believed it would be easier to obtain a fellowship. At Trinity he was elected to the elite secret society known as the Cambridge Apostles. Maxwell's intellectual understanding of his Christian faith and of
science grew rapidly during his Cambridge years. He joined the
"Apostles", an exclusive debating society of the intellectual elite,
where through his essays he sought to work out this understanding.
Now my great plan, which was
conceived of old, ... is to let nothing be wilfully left unexamined.
Nothing is to be holy ground consecrated to Stationary Faith, whether
positive or negative. All fallow land is to be ploughed up and a regular
system of rotation followed. ... Never hide anything, be it weed or no,
nor seem to wish it hidden. ... Again I assert the Right of Trespass on
any plot of Holy Ground which any man has set apart. ... Now I am
convinced that no one but a Christian can actually purge his land of
these holy spots. ... I do not say that no Christians have enclosed
places of this sort. Many have a great deal, and every one has some. But
there are extensive and important tracts in the territory of the
Scoffer, the Pantheist, the Quietist, Formalist, Dogmatist, Sensualist,
and the rest, which are openly and solemnly Tabooed. ..."
Christianity—that is, the religion of the Bible—is the only
scheme or form of belief which disavows any possessions on such a
tenure. Here alone all is free. You may fly to the ends of the world and
find no God but the Author of Salvation. You may search the Scriptures
and not find a text to stop you in your explorations. ...
The Old Testament and the Mosaic Law and Judaism are commonly supposed
to be "Tabooed" by the orthodox. Sceptics pretend to have read them and
have found certain witty objections ... which too many of the orthodox
unread admit, and shut up the subject as haunted. But a Candle is coming
to drive out all Ghosts and Bugbears. Let us follow the light.
In the summer of his third year, Maxwell spent some time at the Suffolk home of the Rev. C. B. Tayler,
the uncle of a classmate, G. W. H. Tayler. The love of God shown by the
family impressed Maxwell, particularly after he was nursed back from
ill health by the minister and his wife.
On his return to Cambridge, Maxwell writes to his recent host a
chatty and affectionate letter including the following testimony,
... I have the capacity of being
more wicked than any example that man could set me, and ... if I escape,
it is only by God's grace helping me to get rid of myself, partially in
science, more completely in society, —but not perfectly except by
committing myself to God ...
In November 1851, Maxwell studied under William Hopkins, whose success in nurturing mathematical genius had earned him the nickname of "senior wrangler-maker".
In 1854, Maxwell graduated from Trinity with a degree in
mathematics. He scored second highest in the final examination, coming
behind Edward Routh and earning himself the title of Second Wrangler. He was later declared equal with Routh in the more exacting ordeal of the Smith's Prize examination. Immediately after earning his degree, Maxwell read his paper "On the Transformation of Surfaces by Bending" to the Cambridge Philosophical Society. This is one of the few purely mathematical papers he had written, demonstrating his growing stature as a mathematician. Maxwell decided to remain at Trinity after graduating and applied for a
fellowship, which was a process that he could expect to take a couple
of years. Buoyed by his success as a research student, he would be free, apart
from some tutoring and examining duties, to pursue scientific interests
at his own leisure.
The nature and perception of colour was one such interest which
he had begun at the University of Edinburgh while he was a student of
Forbes. With the coloured spinning tops invented by Forbes, Maxwell was able to demonstrate that white light would result from a mixture of red, green, and blue light. His paper "Experiments on Colour" laid out the principles of colour
combination and was presented to the Royal Society of Edinburgh in March
1855. Maxwell was this time able to deliver it himself.
Maxwell was made a fellow of Trinity on 10 October 1855, sooner than was the norm, and was asked to prepare lectures on hydrostatics and optics and to set examination papers. The following February he was urged by Forbes to apply for the newly vacant Chair of Natural Philosophy at Marischal College, Aberdeen. His father assisted him in the task of preparing the necessary
references, but died on 2 April at Glenlair before either knew the
result of Maxwell's candidacy. He accepted the professorship at Aberdeen, leaving Cambridge in November 1856.
Marischal College, Aberdeen, 1856–1860
Maxwell proved that the rings of Saturn were made of numerous small particles.
The 25-year-old Maxwell was a good 15 years younger than any other
professor at Marischal. He engaged himself with his new responsibilities
as head of a department, devising the syllabus and preparing lectures. He committed himself to lecturing 15 hours a week, including a weekly pro bono lecture to the local working men's college. He lived in Aberdeen with his cousin William Dyce Cay,
a Scottish civil engineer, during the six months of the academic year
and spent the summers at Glenlair, which he had inherited from his
father.
Later, his former student described Maxwell as follows:
In the late 1850s shortly before 9 am any winter’s
morning you might well have seen the young James Clerk Maxwell, in his
mid to late 20s, a man of middling height, with frame strongly knit, and
a certain spring and elasticity in his gait; dressed for comfortable
ease rather than elegance; a face expressive at once of sagacity and
good humour, but overlaid with a deep shade of thoughtfulness; features
boldly put pleasingly marked; eyes dark and glowing; hair and beard
perfectly black, and forming a strong contrast to the pallor of his
complexion.
He focused his attention on a problem that had eluded scientists for 200 years: the nature of Saturn's rings. It was unknown how they could remain stable without breaking up, drifting away or crashing into Saturn. The problem took on a particular resonance at that time because St John's College, Cambridge, had chosen it as the topic for the 1857 Adams Prize. Maxwell devoted two years to studying the problem, proving that a
regular solid ring could not be stable, while a fluid ring would be
forced by wave action to break up into blobs. Since neither was
observed, he concluded that the rings must be composed of numerous small
particles he called "brick-bats", each independently orbiting Saturn. Maxwell was awarded the £130 Adams Prize in 1859 for his essay "On the stability of the motion of Saturn's rings"; he was the only entrant to have made enough headway to submit an entry. His work was so detailed and convincing that when George Biddell Airy read it he commented, "It is one of the most remarkable applications of mathematics to physics that I have ever seen." It was considered the final word on the issue until direct observations by the Voyager flybys of the 1980s confirmed Maxwell's prediction that the rings were composed of particles. It is now understood, however, that the rings' particles are not
totally stable, being pulled by gravity onto Saturn. The rings are
expected to vanish entirely over the next 300 million years.
In 1857 Maxwell befriended the Reverend Daniel Dewar, who was then the Principal of Marischal. Through him Maxwell met Dewar's daughter, Katherine Mary Dewar.
They were engaged in February 1858 and married in Aberdeen on 2 June
1858. On the marriage record, Maxwell is listed as Professor of Natural
Philosophy in Marischal College, Aberdeen. Katherine was seven years Maxwell's senior. Comparatively little is
known of her, although it is known that she helped in his lab and worked
on experiments in viscosity. Maxwell's biographer and friend, Lewis Campbell, adopted an
uncharacteristic reticence on the subject of Katherine, though
describing their married life as "one of unexampled devotion".
In 1860 Marischal College merged with the neighbouring King's College to form the University of Aberdeen.
There was no room for two professors of Natural Philosophy, so Maxwell,
despite his scientific reputation, found himself laid off. He was
unsuccessful in applying for Forbes's recently vacated chair at
Edinburgh, the post instead going to Tait. Maxwell was granted the Chair of Natural Philosophy at King's College, London, instead. After recovering from a near-fatal bout of smallpox in 1860, he moved to London with his wife.
King's College, London, 1860–1865
Commemoration of Maxwell's equations at King's College. Two identical IEEE Milestone Plaques are at Maxwell's birthplace in Edinburgh and the family home at Glenlair.
Maxwell's time at King's was probably the most productive of his career. He was awarded the Royal Society'sRumford Medal in 1860 for his work on colour and was later elected to the Society in 1861. This period of his life would see him display the world's first light-fast colour photograph, further develop his ideas on the viscosity of gases, and propose a system of defining physical quantities—now known as dimensional analysis. Maxwell would often attend lectures at the Royal Institution, where he came into regular contact with Michael Faraday.
The relationship between the two men could not be described as close,
because Faraday was 40 years Maxwell's senior and showed signs of senility. They nevertheless maintained a strong respect for each other's talents.
This time is especially noteworthy for the advances Maxwell made
in the fields of electricity and magnetism. He examined the nature of
both electric and magnetic fields in his two-part paper "On physical lines of force", which was published in 1861. In it, he provided a conceptual model for electromagnetic induction, consisting of tiny spinning cells of magnetic flux.
Two more parts were later added to and published in that same paper in
early 1862. In the first additional part, he discussed the nature of electrostatics and displacement current. In the second additional part, he dealt with the rotation of the plane of the polarisation of light in a magnetic field, a phenomenon that had been discovered by Faraday and is now known as the Faraday effect.
Later years, 1865–1879
In 1865 Maxwell resigned the chair at King's College, London, and
returned to Glenlair with Katherine. In his paper "On governors" (1868)
he mathematically described the behaviour of governors—devices that control the speed of steam engines—thereby establishing the theoretical basis of control engineering. In his paper "On reciprocal figures, frames and diagrams of forces"
(1870) he discussed the rigidity of various designs of lattice. He wrote the textbook Theory of Heat (1871) and the treatise Matter and Motion (1876). Maxwell was also the first to make explicit use of dimensional analysis, in 1871.
Maxwell has been credited as being the first to grasp the concept of chaos, as he acknowledged the significance of systems that exhibit "sensitive dependence on initial conditions." He was also the first to emphasize the "butterfly effect" in the 1870s in two discussions.
In 1871 he returned to Cambridge to become the first Cavendish Professor of Physics. Maxwell was put in charge of the development of the Cavendish Laboratory, supervising every step in the progress of the building and of the purchase of the collection of apparatus. One of Maxwell's last great contributions to science was the editing (with copious original notes) of the research of Henry Cavendish, from which it appeared that Cavendish researched, amongst other things, such questions as the density of the Earth and the composition of water. He was elected as a member to the American Philosophical Society in 1876.
Death
The gravestone at Parton Kirk (Galloway) of James Clerk Maxwell, his parents and his wife
In April 1879 Maxwell began to have difficulty in swallowing, the first symptom of his fatal illness.
Maxwell died in Cambridge of abdominal cancer on 5 November 1879 at the age of 48. His mother had died at the same age of the same type of cancer. The minister who regularly visited him in his last weeks was astonished
at his lucidity and the immense power and scope of his memory, but
comments more particularly,
... his illness drew out the whole
heart and soul and spirit of the man: his firm and undoubting faith in
the Incarnation and all its results; in the full sufficiency of the
Atonement; in the work of the Holy Spirit. He had gauged and fathomed
all the schemes and systems of philosophy, and had found them utterly
empty and unsatisfying—"unworkable" was his own word about them—and he
turned with simple faith to the Gospel of the Saviour.
As death approached Maxwell told a Cambridge colleague,
I have been thinking how very
gently I have always been dealt with. I have never had a violent shove
all my life. The only desire which I can have is like David to serve my
own generation by the will of God, and then fall asleep.
Maxwell is buried at Parton Kirk, near Castle Douglas in Galloway close to where he grew up. The extended biography The Life of James Clerk Maxwell, by his former schoolfellow and lifelong friend Professor Lewis Campbell, was published in 1882. His collected works were issued in two volumes by the Cambridge University Press in 1890.
The executors of Maxwell's estate were his physician George Edward Paget, G. G. Stokes, and Colin Mackenzie, who was Maxwell's cousin. Overburdened with work, Stokes passed Maxwell's papers to William Garnett, who had effective custody of the papers until about 1884.
There is a memorial inscription to him near the choir screen at Westminster Abbey.
As a great lover of Scottish poetry, Maxwell memorised poems and wrote his own. The best known is Rigid Body Sings, closely based on "Comin' Through the Rye" by Robert Burns, which he apparently used to sing while accompanying himself on a guitar. It has the opening lines
Gin a body meet a body
Flyin' through the air.
Gin a body hit a body,
Will it fly? And where?
A collection of his poems was published by his friend Lewis Campbell in 1882.
Descriptions of Maxwell remark upon his remarkable intellectual qualities being matched by social awkwardness.
Maxwell wrote the following aphorism for his own conduct as a scientist:
He
that would enjoy life and act with freedom must have the work of the
day continually before his eyes. Not yesterday's work, lest he fall into
despair, not to-morrow's, lest he become a visionary—not that which
ends with the day, which is a worldly work, nor yet that only which
remains to eternity, for by it he cannot shape his action. Happy is the
man who can recognize in the work of to-day a connected portion of the
work of life, and an embodiment of the work of eternity. The foundations
of his confidence are unchangeable, for he has been made a partaker of
Infinity. He strenuously works out his daily enterprises, because the
present is given him for a possession.
Maxwell was an evangelical Presbyterian and in his later years became an Elder of the Church of Scotland. Maxwell's religious beliefs and related activities have been the focus of a number of papers. Attending both Church of Scotland (his father's denomination) and Episcopalian (his mother's denomination) services as a child, Maxwell underwent an evangelical conversion in April 1853. One facet of this conversion may have aligned him with an antipositivist position.
Scientific legacy
Recognition
In a survey of the 100 most prominent physicists conducted by Physics World—Maxwell was voted the third greatest physicist of all time, behind only Newton and Einstein. Another survey of rank-and-file physicists by PhysicsWeb voted him third.
Maxwell had studied and commented on electricity and magnetism as
early as 1855 when his paper "On Faraday's lines of force" was read to
the Cambridge Philosophical Society. The paper presented a simplified model of Faraday's work and how
electricity and magnetism are related. He reduced all of the current
knowledge into a linked set of differential equations with 20 equations in 20 variables. This work was later published as "On Physical Lines of Force" in March 1861.
Around 1862, while lecturing at King's College, Maxwell
calculated that the speed of propagation of an electromagnetic field is
approximately that of the speed of light.
He considered this to be more than just a coincidence, commenting, "We
can scarcely avoid the conclusion that light consists in the transverse
undulations of the same medium which is the cause of electric and
magnetic phenomena.
Working on the problem further, Maxwell showed that the equations predict the existence of waves of oscillating electric and magnetic fields
that travel through empty space at a speed that could be predicted from
simple electrical experiments; using the data available at the time,
Maxwell obtained a velocity of 310,740,000 metres per second (1.0195×109 ft/s). In his 1865 paper "A Dynamical Theory of the Electromagnetic Field",
Maxwell wrote, "The agreement of the results seems to show that light
and magnetism are affections of the same substance, and that light is an
electromagnetic disturbance propagated through the field according to
electromagnetic laws".
His famous twenty equations, in their modern form of partial differential equations, first appeared in fully developed form in his textbook A Treatise on Electricity and Magnetism in 1873. Most of this work was done by Maxwell at Glenlair during the period
between holding his London post and his taking up the Cavendish chair. Oliver Heaviside reduced the complexity of Maxwell's theory down to four partial differential equations, known now collectively as Maxwell's Laws or Maxwell's equations. Although potentials became much less popular in the nineteenth century, the use of scalar and vector potentials is now standard in the solution of Maxwell's equations. His work achieved the second great unification in physics.
As Barrett and Grimes (1995) describe:
Maxwell expressed electromagnetism in the algebra of quaternions
and made the electromagnetic potential the centerpiece of his theory.
In 1881 Heaviside replaced the electromagnetic potential field by force
fields as the centerpiece of electromagnetic theory. According to
Heaviside, the electromagnetic potential field was arbitrary and needed
to be "assassinated". (sic) A few years later there was a debate between Heaviside and [Peter Guthrie] Tate (sic) about the relative merits of vector analysis and quaternions. The result was the realization that there was no need for the greater physical insights provided by quaternions if the theory was purely local, and vector analysis became commonplace.
Maxwell was proved correct, and his quantitative connection between
light and electromagnetism is considered one of the great
accomplishments of 19th-century mathematical physics.
Maxwell also introduced the concept of the electromagnetic field in comparison to force lines that Faraday described. By understanding the propagation of electromagnetism as a field emitted
by active particles, Maxwell could advance his work on light. At that
time, Maxwell believed that the propagation of light required a medium
for the waves, dubbed the luminiferous aether. Over time, the existence of such a medium, permeating all space and yet
apparently undetectable by mechanical means, proved impossible to
reconcile with experiments such as the Michelson–Morley experiment. Moreover, it seemed to require an absolute frame of reference
in which the equations were valid, with the distasteful result that the
equations changed form for a moving observer. These difficulties
inspired Albert Einstein to formulate the theory of special relativity; in the process, Einstein dispensed with the requirement of a stationary luminiferous aether.
Einstein acknowledged the groundbreaking work of Maxwell, stating that:
One scientific epoch ended and another began with James Clerk Maxwell.
He also acknowledged the influence that his work had on his relativity theory:
The special theory of relativity owes its origins to Maxwell's equations of the electromagnetic field.
Colour vision
First durable colour photographic image, demonstrated by Maxwell in an 1861 lecture
Along with most physicists of the time, Maxwell had a strong interest in psychology. Following in the steps of Isaac Newton and Thomas Young, he was particularly interested in the study of colour vision. From 1855 to 1872, Maxwell published at intervals a series of investigations concerning the perception of colour, colour-blindness, and colour theory, and was awarded the Rumford Medal for "On the Theory of Colour Vision".
Isaac Newton had demonstrated, using prisms, that white light, such as sunlight, is composed of a number of monochromatic components which could then be recombined into white light. Newton also showed that an orange paint made of yellow and red could
look exactly like a monochromatic orange light, although being composed
of two monochromatic yellow and red lights. Hence the paradox that
puzzled physicists of the time: two complex lights (composed of more
than one monochromatic light) could look alike but be physically
different, called metameres. Thomas Young
later proposed that this paradox could be explained by colours being
perceived through a limited number of channels in the eyes, which he
proposed to be threefold, the trichromatic colour theory. Maxwell used the recently developed linear algebra
to prove Young's theory. Any monochromatic light stimulating three
receptors should be able to be equally stimulated by a set of three
different monochromatic lights (in fact, by any set of three different
lights). He demonstrated that to be the case, inventing colour matching experiments and Colourimetry.
Maxwell was also interested in applying his theory of colour perception, namely in colour photography.
Stemming directly from his psychological work on colour perception: if a
sum of any three lights could reproduce any perceivable colour, then
colour photographs could be produced with a set of three coloured
filters. In the course of his 1855 paper, Maxwell proposed that, if
three black-and-white photographs of a scene were taken through red, green, and bluefilters,
and transparent prints of the images were projected onto a screen using
three projectors equipped with similar filters, when superimposed on
the screen the result would be perceived by the human eye as a complete
reproduction of all the colours in the scene.
During an 1861 Royal Institution lecture on colour theory,
Maxwell presented the world's first demonstration of colour photography
by this principle of three-colour analysis and synthesis. Thomas Sutton, inventor of the single-lens reflex camera, took the picture. He photographed a tartan
ribbon three times, through red, green, and blue filters, also making a
fourth photograph through a yellow filter, which, according to
Maxwell's account, was not used in the demonstration. Because Sutton's photographic plates
were insensitive to red and barely sensitive to green, the results of
this pioneering experiment were far from perfect. It was remarked in the
published account of the lecture that "if the red and green images had
been as fully photographed as the blue", it "would have been a
truly-coloured image of the riband. By finding photographic materials
more sensitive to the less refrangible rays, the representation of the
colours of objects might be greatly improved." Researchers in 1961 concluded that the seemingly impossible partial success of the red-filtered exposure was due to ultraviolet
light, which is strongly reflected by some red dyes, not entirely
blocked by the red filter used, and within the range of sensitivity of
the wet collodion process Sutton employed.
Maxwell's demon, a thought experiment where entropy decreases
Maxwell also investigated the kinetic theory of gases. Originating with Daniel Bernoulli, this theory was advanced by the successive labours of John Herapath, John James Waterston, James Joule, and particularly Rudolf Clausius,
to such an extent as to put its general accuracy beyond a doubt; but it
received enormous development from Maxwell, who in this field appeared
as an experimenter (on the laws of gaseous friction) as well as a
mathematician.
Between 1859 and 1866, he developed the theory of the
distributions of velocities in particles of a gas, work later
generalised by Ludwig Boltzmann. The formula, called the Maxwell–Boltzmann distribution, gives the fraction of gas molecules moving at a specified velocity at any given temperature. In the kinetic theory,
temperatures and heat involve only molecular movement. This approach
generalised the previously established laws of thermodynamics and
explained existing observations and experiments in a better way than had
been achieved previously. His work on thermodynamics led him to devise the thought experiment that came to be known as Maxwell's demon, where the second law of thermodynamics is violated by an imaginary being capable of sorting particles by energy.
In his 1867 paper On the Dynamical Theory of Gases he introduced the Maxwell model for describing the behavior of a viscoelastic material and originated the Maxwell-Cattaneo equation for describing the transport of heat in a medium.
Peter Guthrie Tait called Maxwell the "leading molecular scientist" of his time. Another person added after Maxwell's death that "only one man lived who
could understand Gibbs's papers. That was Maxwell, and now he is dead."
Maxwell published the paper "On governors" in the Proceedings of the Royal Society, vol. 16 (1867–1868). This paper is considered a central paper of the early days of control theory. Here "governors" refers to the governor or the centrifugal governor used to regulate steam engines.
Example
graph of a logistic regression curve fitted to data. The curve shows
the estimated probability of passing an exam (binary dependent variable)
versus hours studying (scalar independent variable). See § Example for worked details.
In statistics, a logistic model (or logit model) is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression (or logit regression) estimates
the parameters of a logistic model (the coefficients in the linear or
non linear combinations). In binary logistic regression there is a
single binarydependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable
(any real value). The corresponding probability of the value labeled
"1" can vary between 0 (certainly the value "0") and 1 (certainly the
value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.
Binary variables are widely used in statistics to model the
probability of a certain class or event taking place, such as the
probability of a team winning, of a patient being healthy, etc. (see § Applications), and the logistic model has been the most commonly used model for binary regression since about 1970. Binary variables can be generalized to categorical variables
when there are more than two possible values (e.g. whether an image is
of a cat, dog, lion, etc.), and the binary logistic regression
generalized to multinomial logistic regression. If the multiple categories are ordered, one can use the ordinal logistic regression (for example the proportional odds ordinal logistic model). See § Extensions
for further extensions. The logistic regression model itself simply
models probability of output in terms of input and does not perform statistical classification
(it is not a classifier), though it can be used to make a classifier,
for instance by choosing a cutoff value and classifying inputs with
probability greater than the cutoff as one class, below the cutoff as
the other; this is a common way to make a binary classifier.
Analogous linear models for binary variables with a different sigmoid function instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the probit model; see § Alternatives.
The defining characteristic of the logistic model is that increasing
one of the independent variables multiplicatively scales the odds of the
given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio. More abstractly, the logistic function is the natural parameter for the Bernoulli distribution, and in this sense is the "simplest" way to convert a real number to a probability.
The parameters of a logistic regression are most commonly estimated by maximum-likelihood estimation (MLE). This does not have a closed-form expression, unlike linear least squares; see § Model fitting. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by ordinary least squares (OLS) plays for scalar responses: it is a simple, well-analyzed baseline model; see § Comparison with linear regression for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by Joseph Berkson, beginning in Berkson (1944), where he coined "logit"; see § History.
Applications
General
Logistic
regression is used in various fields, including machine learning, most
medical fields, and social sciences. For example, the Trauma and Injury
Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.). Another example might be to predict whether a Nepalese voter will vote
Nepali Congress or Communist Party of Nepal or for any other party,
based on age, income, sex, race, state of residence, votes in previous
elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product. It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc. In economics,
it can be used to predict the likelihood of a person ending up in the
labor force, and a business application would be to predict the
likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing.
Disaster planners and engineers rely on these models to predict
decisions taken by householders or building occupants in small-scale and
large-scales evacuations, such as building fires, wildfires, hurricanes
among others. These models help in the development of reliable disaster managing plans and safer design for the built environment.
Supervised machine learning
Logistic regression is a supervised machine learning algorithm widely used for binary classification
tasks, such as identifying whether an email is spam or not and
diagnosing diseases by assessing the presence or absence of specific
conditions based on patient test results. This approach utilizes the
logistic (or sigmoid) function to transform a linear combination of
input features into a probability value ranging between 0 and 1. This
probability indicates the likelihood that a given input corresponds to
one of two predefined categories. The essential mechanism of logistic
regression is grounded in the logistic function's ability to model the
probability of binary outcomes accurately. With its distinctive S-shaped
curve, the logistic function effectively maps any real-valued number to
a value within the 0 to 1 interval. This feature renders it
particularly suitable for binary classification tasks, such as sorting
emails into "spam" or "not spam". By calculating the probability that
the dependent variable will be categorized into a specific group,
logistic regression provides a probabilistic framework that supports
informed decision-making.
Example
Problem
As
a simple example, we can use a logistic regression with one explanatory
variable and two categories to answer the following question:
A group of 20 students spends between 0 and 6 hours studying for an
exam. How does the number of hours spent studying affect the probability
of the student passing the exam?
The reason for using logistic regression for this problem is that the
values of the dependent variable, pass and fail, while represented by
"1" and "0", are not cardinal numbers. If the problem was changed so that pass/fail was replaced with the grade 0–100 (cardinal numbers), then simple regression analysis could be used.
The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).
Hours (xk)
0.50
0.75
1.00
1.25
1.50
1.75
1.75
2.00
2.25
2.50
2.75
3.00
3.25
3.50
4.00
4.25
4.50
4.75
5.00
5.50
Pass (yk)
0
0
0
0
0
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
We wish to fit a logistic function to the data consisting of the hours studied (xk) and the outcome of the test (yk =1 for pass, 0 for fail). The data points are indexed by the subscript k which runs from to . The x variable is called the "explanatory variable", and the y variable is called the "categorical variable" consisting of two categories: "pass" or "fail" corresponding to the categorical values 1 and 0 respectively.
Model
Graph of a logistic regression curve fitted to the (xm,ym) data. The curve shows the probability of passing an exam versus hours studying.
where μ is a location parameter (the midpoint of the curve, where ) and s is a scale parameter. This expression may be rewritten as:
where and is known as the intercept (it is the vertical intercept or y-intercept of the line ), and (inverse scale parameter or rate parameter): these are the y-intercept and slope of the log-odds as a function of x. Conversely, and .
Note that this model is actually an oversimplification, since it
assumes everybody will pass if they learn long enough (limit = 1).
Fit
The usual measure of goodness of fit for a logistic regression uses logistic loss (or log loss), the negative log-likelihood. For a given xk and yk, write . The are the probabilities that the corresponding will equal one and are the probabilities that they will be zero (see Bernoulli distribution). We wish to find the values of and
which give the "best fit" to the data. In the case of linear
regression, the sum of the squared deviations of the fit from the data
points (yk), the squared error loss, is taken as a measure of the goodness of fit, and the best fit is obtained when that function is minimized.
The log loss for the k-th point is:
The log loss can be interpreted as the "surprisal" of the actual outcome relative to the prediction , and is a measure of information content. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction (i.e., when and , or and ), and approaches infinity as the prediction gets worse (i.e., when and or and ),
meaning the actual outcome is "more surprising". Since the value of the
logistic function is always strictly between zero and one, the log loss
is always greater than zero and less than infinity. Unlike in a linear
regression, where the model can have zero loss at a point by passing
through a data point (and zero loss overall if all points are on a
line), in a logistic regression it is not possible to have zero loss at
any points, since is either 0 or 1, but .
These can be combined into a single expression:
This expression is more formally known as the cross-entropy of the predicted distribution from the actual distribution , as probability distributions on the two-element space of (pass, fail).
The sum of these, the total loss, is the overall negative log-likelihood , and the best fit is obtained for those choices of and for which is minimized.
Alternatively, instead of minimizing the loss, one can maximize its inverse, the (positive) log-likelihood:
or equivalently maximize the likelihood function itself, which is the probability that the given data set is produced by a particular logistic function:
Since ℓ is nonlinear in and , determining their optimum values will require numerical methods. One method of maximizing ℓ is to require the derivatives of ℓ with respect to and to be zero:
and the maximization procedure can be accomplished by solving the above two equations for and , which, again, will generally require the use of numerical methods.
The values of and which maximize ℓ and L using the above data are found to be:
which yields a value for μ and s of:
Predictions
The and coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam.
For example, for a student who studies 2 hours, entering the value into the equation gives the estimated probability of passing the exam of 0.25:
Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87:
This table shows the estimated probability of passing the exam for several values of hours studying.
Hours of study (x)
Passing exam
Log-odds (t)
Odds (et)
Probability (p)
1
−2.57
0.076 ≈ 1:13.1
0.07
2
−1.07
0.34 ≈ 1:2.91
0.26
0
1
1/2 = 0.50
3
0.44
1.55
0.61
4
1.94
6.96
0.87
5
3.45
31.4
0.97
Model evaluation
The logistic regression analysis gives the following output.
Coefficient
Std. Error
z-value
p-value (Wald)
Intercept (β0)
−4.1
1.8
−2.3
0.021
Hours (β1)
1.5
0.9
1.7
0.017
By the Wald test, the output indicates that hours studying is significantly associated with the probability of passing the exam (). Rather than the Wald method, the recommended method to calculate the p-value for logistic regression is the likelihood-ratio test (LRT), which for these data give (see § Deviance and likelihood ratio tests below).
Generalizations
This
simple model is an example of binary logistic regression, and has one
explanatory variable and a binary categorical variable which can assume
one of two categorical values. Multinomial logistic regression
is the generalization of binary logistic regression to include any
number of explanatory variables and any number of categories.
Background
Figure 1. The standard logistic function ; for all .
Definition of the logistic function
An explanation of logistic regression can begin with an explanation of the standard logistic function. The logistic function is a sigmoid function, which takes any real input , and outputs a value between zero and one. For the logit, this is interpreted as taking input log-odds and having output probability. The standard logistic function is defined as follows:
A graph of the logistic function on the t-interval (−6,6) is shown in Figure 1.
Let us assume that is a linear function of a single explanatory variable (the case where is a linear combination of multiple explanatory variables is treated similarly). We can then express as follows:
And the general logistic function can now be written as:
In the logistic model, is interpreted as the probability of the dependent variable equaling a success/case rather than a failure/non-case. It is clear that the response variables are not identically distributed: differs from one data point to another, though they are independent given design matrix and shared parameters .
Definition of the inverse of the logistic function
We can now define the logit (log odds) function as the inverse of the standard logistic function. It is easy to see that it satisfies:
and equivalently, after exponentiating both sides we have the odds:
Interpretation of these terms
In the above equations, the terms are as follows:
is the logit function. The equation for illustrates that the logit (i.e., log-odds or natural logarithm of the odds) is equivalent to the linear regression expression.
is the probability that the dependent variable equals a case, given some linear combination of the predictors. The formula for
illustrates that the probability of the dependent variable equaling a
case is equal to the value of the logistic function of the linear
regression expression. This is important in that it shows that the value
of the linear regression expression can vary from negative to positive
infinity and yet, after transformation, the resulting expression for the
probability ranges between 0 and 1.
is the intercept from the linear regression equation (the value of the criterion when the predictor is equal to zero).
is the regression coefficient multiplied by some value of the predictor.
base denotes the exponential function.
Definition of the odds
The odds of the dependent variable equaling a case (given some linear combination of the predictors) is equivalent to the exponential function of the linear regression expression. This illustrates how the logit
serves as a link function between the probability and the linear
regression expression. Given that the logit ranges between negative and
positive infinity, it provides an adequate criterion upon which to
conduct linear regression and the logit is easily converted back into
the odds.
So we define odds of the dependent variable equaling a case (given some linear combination of the predictors) as follows:
The odds ratio
For a continuous independent variable the odds ratio can be defined as:
The
image represents an outline of what an odds ratio looks like in
writing, through a template in addition to the test score example in the
"Example" section of the contents. In simple terms, if we
hypothetically get an odds ratio of 2 to 1, we can say... "For every
one-unit increase in hours studied, the odds of passing (group 1) or
failing (group 0) are (expectedly) 2 to 1 (Denis, 2019).
This exponential relationship provides an interpretation for : The odds multiply by for every 1-unit increase in x.
For a binary independent variable the odds ratio is defined as where a, b, c and d are cells in a 2×2 contingency table.
Multiple explanatory variables
If there are multiple explanatory variables, the above expression can be revised to .
Then when this is used in the equation relating the log odds of a
success to the values of the predictors, the linear regression will be a
multiple regression with m explanators; the parameters for all are all estimated.
Again, the more traditional equations are:
and
where usually .
Definition
A dataset contains N points. Each point i consists of a set of m input variables x1,i ... xm,i (also called independent variables, explanatory variables, predictor variables, features, or attributes), and a binary outcome variable Yi (also known as a dependent variable,
response variable, output variable, or class), i.e. it can assume only
the two possible values 0 (often meaning "no" or "failure") or 1 (often
meaning "yes" or "success"). The goal of logistic regression is to use
the dataset to create a predictive model of the outcome variable.
As in linear regression, the outcome variables Yi are assumed to depend on the explanatory variables x1,i ... xm,i.
(Discrete variables referring to more than two possible choices are typically coded using dummy variables (or indicator variables),
that is, separate explanatory variables taking the value 0 or 1 are
created for each possible value of the discrete variable, with a 1
meaning "variable does have the given value" and a 0 meaning "variable
does not have that value".)
Outcome variables
Formally, the outcomes Yi are described as being Bernoulli-distributed data, where each outcome is determined by an unobserved probability pi
that is specific to the outcome at hand, but related to the explanatory
variables. This can be expressed in any of the following equivalent
forms:
The meanings of these four lines are:
The first line expresses the probability distribution of each Yi : conditioned on the explanatory variables, it follows a Bernoulli distribution with parameters pi, the probability of the outcome of 1 for trial i.
As noted above, each separate trial has its own probability of success,
just as each trial has its own explanatory variables. The probability
of success pi is not observed, only the outcome of an individual Bernoulli trial using that probability.
The second line expresses the fact that the expected value of each Yi is equal to the probability of success pi,
which is a general property of the Bernoulli distribution. In other
words, if we run a large number of Bernoulli trials using the same
probability of success pi, then take the average of all the 1 and 0 outcomes, then the result would be close to pi.
This is because doing an average this way simply computes the
proportion of successes seen, which we expect to converge to the
underlying probability of success.
The third line writes out the probability mass function of the Bernoulli distribution, specifying the probability of seeing each of the two possible outcomes.
The fourth line is another way of writing the probability mass
function, which avoids having to write separate cases and is more
convenient for certain types of calculations. This relies on the fact
that Yi can take only the value 0 or 1. In
each case, one of the exponents will be 1, "choosing" the value under
it, while the other is 0, "canceling out" the value under it. Hence,
the outcome is either pi or 1 − pi, as in the previous line.
Linear predictor function
The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials. The linear predictor function for a particular data point i is written as:
where are regression coefficients indicating the relative effect of a particular explanatory variable on the outcome.
The model is usually put into a more compact form as follows:
The regression coefficients β0, β1, ..., βm are grouped into a single vector β of size m + 1.
For each data point i, an additional explanatory pseudo-variable x0,i is added, with a fixed value of 1, corresponding to the intercept coefficient β0.
The resulting explanatory variables x0,i, x1,i, ..., xm,i are then grouped into a single vector Xi of size m + 1.
This makes it possible to write the linear predictor function as follows:
using the notation for a dot product between two vectors.
This
is an example of an SPSS output for a logistic regression model using
three explanatory variables (coffee use per week, energy drink use per
week, and soda use per week) and two categories (male and female).
Many explanatory variables, two categories
The
above example of binary logistic regression on one explanatory variable
can be generalized to binary logistic regression on any number of
explanatory variables x1, x2,... and any number of categorical values .
To begin with, we may consider a logistic model with M explanatory variables, x1, x2 ... xM and, as in the example above, two categorical values (y = 0 and 1). For the simple binary logistic regression model, we assumed a linear relationship between the predictor variable and the log-odds (also called logit) of the event that . This linear relationship may be extended to the case of M explanatory variables:
where t is the log-odds and are parameters of the model. An additional generalization has been introduced in which the base of the model (b) is not restricted to Euler's numbere. In most applications, the base of the logarithm is usually taken to be e. However, in some cases it can be easier to communicate results by working in base 2 or base 10.
For a more compact notation, we will specify the explanatory variables and the β coefficients as -dimensional vectors:
with an added explanatory variable x0 =1. The logit may now be written as:
Solving for the probability p that yields:
,
where is the sigmoid function with base . The above formula shows that once the are fixed, we can easily compute either the log-odds that for a given observation, or the probability that for a given observation. The main use-case of a logistic model is to be given an observation , and estimate the probability that . The optimum beta coefficients may again be found by maximizing the log-likelihood. For K measurements, defining as the explanatory vector of the k-th measurement, and as the categorical outcome of that measurement, the log likelihood may be written in a form very similar to the simple case above:
As in the simple example above, finding the optimum β
parameters will require numerical methods. One useful technique is to
equate the derivatives of the log likelihood with respect to each of the
β parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood:
where xmk is the value of the xm explanatory variable from the k-th measurement.
Consider an example with explanatory variables, , and coefficients , , and which have been determined by the above method. To be concrete, the model is:
,
where p is the probability of the event that . This can be interpreted as follows:
is the y-intercept. It is the log-odds of the event that , when the predictors . By exponentiating, we can see that when the odds of the event that are 1-to-1000, or . Similarly, the probability of the event that when can be computed as
means that increasing by 1 increases the log-odds by . So if increases by 1, the odds that increase by a factor of . The probability of has also increased, but it has not increased by as much as the odds have increased.
means that increasing by 1 increases the log-odds by . So if increases by 1, the odds that increase by a factor of Note how the effect of on the log-odds is twice as great as the effect of , but the effect on the odds is 10 times greater. But the effect on the probability of is not as much as 10 times greater, it's only the effect on the odds that is 10 times greater.
Multinomial logistic regression: Many explanatory variables and many categories
In the above cases of two categories (binomial logistic regression),
the categories were indexed by "0" and "1", and we had two
probabilities: The probability that the outcome was in category 1 was
given by and the probability that the outcome was in category 0 was given by . The sum of these probabilities equals 1, which must be true, since "0" and "1" are the only possible categories in this setup.
In general, if we have explanatory variables (including x0) and categories, we will need separate probabilities, one for each category, indexed by n, which describe the probability that the categorical outcome y will be in category y=n, conditional on the vector of covariates x. The sum of these probabilities over all categories must equal 1. Using the mathematically convenient base e, these probabilities are:
for
Each of the probabilities except will have their own set of regression coefficients . It can be seen that, as required, the sum of the over all categories n is 1. The selection of
to be defined in terms of the other probabilities is artificial. Any of
the probabilities could have been selected to be so defined. This
special value of n is termed the "pivot index", and the log-odds (tn) are expressed in terms of the pivot probability and are again expressed as a linear combination of the explanatory variables:
Note also that for the simple case of , the two-category case is recovered, with and .
The log-likelihood that a particular set of K measurements or data points will be generated by the above probabilities can now be calculated. Indexing each measurement by k, let the k-th set of measured explanatory variables be denoted by and their categorical outcomes be denoted by which can be equal to any integer in [0,N]. The log-likelihood is then:
where is an indicator function which equals 1 if yk = n and zero otherwise. In the case of two explanatory variables, this indicator function was defined as yk when n = 1 and 1-yk when n = 0. This was convenient, but not necessary. Again, the optimum beta coefficients may be found by maximizing the
log-likelihood function generally using numerical methods. A possible
method of solution is to set the derivatives of the log-likelihood with
respect to each beta coefficient equal to zero and solve for the beta
coefficients:
where is the m-th coefficient of the vector and is the m-th explanatory variable of the k-th
measurement. Once the beta coefficients have been estimated from the
data, we will be able to estimate the probability that any subsequent
set of explanatory variables will result in any of the possible outcome
categories.
Interpretations
There
are various equivalent specifications and interpretations of logistic
regression, which fit into different types of more general models, and
allow different generalizations.
As a generalized linear model
The particular model used by logistic regression, which distinguishes it from standard linear regression and from other types of regression analysis used for binary-valued outcomes, is the way the probability of a particular outcome is linked to the linear predictor function:
Written using the more compact notation described above, this is:
This formulation expresses logistic regression as a type of generalized linear model, which predicts variables with various types of probability distributions
by fitting a linear predictor function of the above form to some sort
of arbitrary transformation of the expected value of the variable.
The intuition for transforming using the logit function (the natural log of the odds) was explained above.
It also has the practical effect of converting the probability (which
is bounded to be between 0 and 1) to a variable that ranges over — thereby matching the potential range of the linear prediction function on the right side of the equation.
Both the probabilities pi and the
regression coefficients are unobserved, and the means of determining
them is not part of the model itself. They are typically determined by
some sort of optimization procedure, e.g. maximum likelihood estimation,
that finds values that best fit the observed data (i.e. that give the
most accurate predictions for the data already observed), usually
subject to regularization
conditions that seek to exclude unlikely values, e.g. extremely large
values for any of the regression coefficients. The use of a
regularization condition is equivalent to doing maximum a posteriori (MAP) estimation, an extension of maximum likelihood. (Regularization is most commonly done using a squared regularizing function, which is equivalent to placing a zero-mean Gaussianprior distribution
on the coefficients, but other regularizers are also possible.)
Whether or not regularization is used, it is usually not possible to
find a closed-form solution; instead, an iterative numerical method must
be used, such as iteratively reweighted least squares (IRLS) or, more commonly these days, a quasi-Newton method such as the L-BFGS method.
The interpretation of the βj parameter estimates is as the additive effect on the log of the odds for a unit change in the j the explanatory variable. In the case of a dichotomous explanatory variable, for instance, gender is the estimate of the odds of having the outcome for, say, males compared with females.
An equivalent formula uses the inverse of the logit function, which is the logistic function, i.e.:
The logistic model has an equivalent formulation as a latent-variable model. This formulation is common in the theory of discrete choice
models and makes it easier to extend to certain more complicated models
with multiple, correlated choices, as well as to compare logistic
regression to the closely related probit model.
Imagine that, for each trial i, there is a continuous latent variableYi* (i.e. an unobserved random variable) that is distributed as follows:
where
i.e. the latent variable can be written directly in terms of the linear predictor function and an additive random error variable that is distributed according to a standard logistic distribution.
Then Yi can be viewed as an indicator for whether this latent variable is positive:
The choice of modeling the error variable specifically with a
standard logistic distribution, rather than a general logistic
distribution with the location and scale set to arbitrary values, seems
restrictive, but in fact, it is not. It must be kept in mind that we
can choose the regression coefficients ourselves, and very often can use
them to offset changes in the parameters of the error variable's
distribution. For example, a logistic error-variable distribution with a
non-zero location parameter μ (which sets the mean) is equivalent to a distribution with a zero location parameter, where μ has been added to the intercept coefficient. Both situations produce the same value for Yi* regardless of settings of explanatory variables. Similarly, an arbitrary scale parameter s is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by s. In the latter case, the resulting value of Yi* will be smaller by a factor of s
than in the former case, for all sets of explanatory variables — but
critically, it will always remain on the same side of 0, and hence lead
to the same Yi choice.
(This predicts that the irrelevancy of the scale parameter may
not carry over into more complex models where more than two choices are
available.)
This formulation—which is standard in discrete choice models—makes clear the relationship between logistic regression (the "logit model") and the probit model, which uses an error variable distributed according to a standard normal distribution
instead of a standard logistic distribution. Both the logistic and
normal distributions are symmetric with a basic unimodal, "bell curve"
shape. The only difference is that the logistic distribution has
somewhat heavier tails, which means that it is less sensitive to outlying data (and hence somewhat more robust to model mis-specifications or erroneous data).
Two-way latent-variable model
Yet another formulation uses two separate latent variables:
This model has a separate latent variable and a separate set of
regression coefficients for each possible outcome of the dependent
variable. The reason for this separation is that it makes it easy to
extend logistic regression to multi-outcome categorical variables, as in
the multinomial logit
model. In such a model, it is natural to model each possible outcome
using a different set of regression coefficients. It is also possible
to motivate each of the separate latent variables as the theoretical utility associated with making the associated choice, and thus motivate logistic regression in terms of utility theory.
(In terms of utility theory, a rational actor always chooses the choice
with the greatest associated utility.) This is the approach taken by
economists when formulating discrete choice
models, because it both provides a theoretically strong foundation and
facilitates intuitions about the model, which in turn makes it easy to
consider various sorts of extensions. (See the example below.)
It turns out that this model is equivalent to the previous model,
although this seems non-obvious, since there are now two sets of
regression coefficients and error variables, and the error variables
have a different distribution. In fact, this model reduces directly to
the previous one with the following substitutions:
An intuition for this comes from the fact that, since we choose based
on the maximum of two values, only their difference matters, not the
exact values — and this effectively removes one degree of freedom.
Another critical fact is that the difference of two type-1
extreme-value-distributed variables is a logistic distribution, i.e. We can demonstrate the equivalent as follows:
Example
As an example, consider a province-level election where the choice is
between a right-of-center party, a left-of-center party, and a
secessionist party (e.g. the Parti Québécois, which wants Quebec to secede from Canada). We would then use three latent variables, one for each choice. Then, in accordance with utility theory, we can then interpret the latent variables as expressing the utility
that results from making each of the choices. We can also interpret
the regression coefficients as indicating the strength that the
associated factor (i.e. explanatory variable) has in contributing to the
utility — or more correctly, the amount by which a unit change in an
explanatory variable changes the utility of a given choice. A voter
might expect that the right-of-center party would lower taxes,
especially on rich people. This would give low-income people no
benefit, i.e. no change in utility (since they usually don't pay taxes);
would cause moderate benefit (i.e. somewhat more money, or moderate
utility increase) for middle-incoming people; would cause significant
benefits for high-income people. On the other hand, the left-of-center
party might be expected to raise taxes and offset it with increased
welfare and other assistance for the lower and middle classes. This
would cause significant positive benefit to low-income people, perhaps a
weak benefit to middle-income people, and significant negative benefit
to high-income people. Finally, the secessionist party would take no
direct actions on the economy, but simply secede. A low-income or
middle-income voter might expect basically no clear utility gain or loss
from this, but a high-income voter might expect negative utility since
he/she is likely to own companies, which will have a harder time doing
business in such an environment and probably lose money.
These intuitions can be expressed as follows:
Estimated strength of regression coefficient for different
outcomes (party choices) and different values of explanatory variables
Center-right
Center-left
Secessionist
High-income
strong +
strong −
strong −
Middle-income
moderate +
weak +
none
Low-income
none
strong +
none
This clearly shows that
Separate sets of regression coefficients need to exist for each
choice. When phrased in terms of utility, this can be seen very easily.
Different choices have different effects on net utility; furthermore,
the effects vary in complex ways that depend on the characteristics of
each individual, so there need to be separate sets of coefficients for
each characteristic, not simply a single extra per-choice
characteristic.
Even though income is a continuous variable, its effect on utility
is too complex for it to be treated as a single variable. Either it
needs to be directly split up into ranges, or higher powers of income
need to be added so that polynomial regression on income is effectively done.
As a "log-linear" model
Yet
another formulation combines the two-way latent variable formulation
above with the original formulation higher up without latent variables,
and in the process provides a link to one of the standard formulations
of the multinomial logit.
Here, instead of writing the logit of the probabilities pi as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes:
Two separate sets of regression coefficients have been introduced,
just as in the two-way latent variable model, and the two equations
appear a form that writes the logarithm of the associated probability as a linear predictor, with an extra term at the end. This term, as it turns out, serves as the normalizing factor ensuring that the result is a distribution. This can be seen by exponentiating both sides:
In this form it is clear that the purpose of Z is to ensure that the resulting distribution over Yi is in fact a probability distribution, i.e. it sums to 1. This means that Z is simply the sum of all un-normalized probabilities, and by dividing each probability by Z, the probabilities become "normalized". That is:
and the resulting equations are
Or generally:
This shows clearly how to generalize this formulation to more than two outcomes, as in multinomial logit.
This general formulation is exactly the softmax function as in
To prove that this is equivalent to the previous model, we start by recognizing the above model is overspecified, in that and cannot be independently specified: rather so knowing one automatically determines the other. As a result, the model is nonidentifiable, in that multiple combinations of and
will produce the same probabilities for all possible explanatory
variables. In fact, it can be seen that adding any constant vector to
both of them will produce the same probabilities:
As a result, we can simplify matters, and restore identifiability, by
picking an arbitrary value for one of the two vectors. We choose to
set Then,
and so
which shows that this formulation is indeed equivalent to the
previous formulation. (As in the two-way latent variable formulation,
any settings where will produce equivalent results.)
Most treatments of the multinomial logit
model start out either by extending the "log-linear" formulation
presented here or the two-way latent variable formulation presented
above, since both clearly show the way that the model could be extended
to multi-way outcomes. In general, the presentation with latent
variables is more common in econometrics and political science, where discrete choice models and utility theory reign, while the "log-linear" formulation here is more common in computer science, e.g. machine learning and natural language processing.
As a single-layer perceptron
The model has an equivalent formulation
This functional form is commonly called a single-layer perceptron or single-layer artificial neural network. A single-layer neural network computes a continuous output instead of a step function. The derivative of pi with respect to X = (x1, ..., xk) is computed from the general form:
where f(X) is an analytic function in X.
With this choice, the single-layer neural network is identical to the
logistic regression model. This function has a continuous derivative,
which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated:
In terms of binomial data
A closely related model assumes that each i is associated not with a single Bernoulli trial but with niindependent identically distributed trials, where the observation Yi is the number of successes observed (the sum of the individual Bernoulli-distributed random variables), and hence follows a binomial distribution:
An example of this distribution is the fraction of seeds (pi) that germinate after ni are planted.
In terms of expected values, this model is expressed as follows:
so that
Or equivalently:
This model can be fit using the same sorts of methods as the above more basic model.
Model fitting
Maximum likelihood estimation (MLE)
The regression coefficients are usually estimated using maximum likelihood estimation.Unlike linear regression with normally distributed residuals, it is not
possible to find a closed-form expression for the coefficient values
that maximize the likelihood function so an iterative process must be
used instead; for example Newton's method.
This process begins with a tentative solution, revises it slightly to
see if it can be improved, and repeats this revision until no more
improvement is made, at which point the process is said to have
converged.
In some instances, the model may not reach convergence.
Non-convergence of a model indicates that the coefficients are not
meaningful because the iterative process was unable to find appropriate
solutions. A failure to converge may occur for a number of reasons:
having a large ratio of predictors to cases, multicollinearity, sparseness, or complete separation.
Having a large ratio of variables to cases results in an overly
conservative Wald statistic (discussed below) and can lead to
non-convergence. Regularized logistic regression is specifically intended to be used in this situation.
Multicollinearity refers to unacceptably high correlations between
predictors. As multicollinearity increases, coefficients remain unbiased
but standard errors increase and the likelihood of model convergence
decreases. To detect multicollinearity amongst the predictors, one can conduct a
linear regression analysis with the predictors of interest for the sole
purpose of examining the tolerance statistic used to assess whether multicollinearity is unacceptably high.
Sparseness in the data refers to having a large proportion of empty
cells (cells with zero counts). Zero cell counts are particularly
problematic with categorical predictors. With continuous predictors, the
model can infer values for the zero cell counts, but this is not the
case with categorical predictors. The model will not converge with zero
cell counts for categorical predictors because the natural logarithm of
zero is an undefined value so that the final solution to the model
cannot be reached. To remedy this problem, researchers may collapse
categories in a theoretically meaningful way or add a constant to all
cells.
Another numerical problem that may lead to a lack of convergence is
complete separation, which refers to the instance in which the
predictors perfectly predict the criterion – all cases are accurately
classified and the likelihood maximized with infinite coefficients. In
such instances, one should re-examine the data, as there may be some
kind of error.
One can also take semi-parametric or non-parametric approaches,
e.g., via local-likelihood or nonparametric quasi-likelihood methods,
which avoid assumptions of a parametric form for the index function and
is robust to the choice of the link function (e.g., probit or logit).
Iteratively reweighted least squares (IRLS)
Binary logistic regression ( or ) can, for example, be calculated using iteratively reweighted least squares (IRLS), which is equivalent to maximizing the log-likelihood of a Bernoulli distributed process using Newton's method. If the problem is written in vector matrix form, with parameters , explanatory variables and expected value of the Bernoulli distribution , the parameters can be found using the following iterative algorithm:
where is a diagonal weighting matrix, the vector of expected values,
The regressor matrix and the vector of response variables. More details can be found in the literature.
Widely used, the "one in ten rule",
states that logistic regression models give stable values for the
explanatory variables if based on a minimum of about 10 events per
explanatory variable (EPV); where event denotes the cases belonging to the less frequent category in the dependent variable. Thus a study designed to use explanatory variables for an event (e.g. myocardial infarction) expected to occur in a proportion of participants in the study will require a total of
participants. However, there is considerable debate about the
reliability of this rule, which is based on simulation studies and lacks
a secure theoretical underpinning. According to some authors the rule is overly conservative in some circumstances, with the authors
stating, "If we (somewhat subjectively) regard confidence interval
coverage less than 93 percent, type I error greater than 7 percent, or
relative bias greater than 15 percent as problematic, our results
indicate that problems are fairly frequent with 2–4 EPV, uncommon with
5–9 EPV, and still observed with 10–16 EPV. The worst instances of each
problem were not severe with 5–9 EPV and usually comparable to those
with 10–16 EPV".
Others have found results that are not consistent with the above,
using different criteria. A useful criterion is whether the fitted
model will be expected to achieve the same predictive discrimination in a
new sample as it appeared to achieve in the model development sample.
For that criterion, 20 events per candidate variable may be required. Also, one can argue that 96 observations are needed only to estimate
the model's intercept precisely enough that the margin of error in
predicted probabilities is ±0.1 with a 0.95 confidence level.
Error and significance of fit
Deviance and likelihood ratio test ─ a simple case
In
any fitting procedure, the addition of another fitting parameter to a
model (e.g. the beta parameters in a logistic regression model) will
almost always improve the ability of the model to predict the measured
outcomes. This will be true even if the additional term has no
predictive value, since the model will simply be "overfitting"
to the noise in the data. The question arises as to whether the
improvement gained by the addition of another fitting parameter is
significant enough to recommend the inclusion of the additional term, or
whether the improvement is simply that which may be expected from
overfitting.
In short, for logistic regression, a statistic known as the deviance
is defined which is a measure of the error between the logistic model
fit and the outcome data. In the limit of a large number of data points,
the deviance is chi-squared distributed, which allows a chi-squared test to be implemented in order to determine the significance of the explanatory variables.
Linear regression and logistic regression have many similarities. For example, in simple linear regression, a set of K data points (xk, yk) are fitted to a proposed model function of the form . The fit is obtained by choosing the b parameters which minimize the sum of the squares of the residuals (the squared error term) for each data point:
The minimum value which constitutes the fit will be denoted by
The idea of a null model may be introduced, in which it is assumed that the x variable is of no use in predicting the yk outcomes: The data points are fitted to a null model function of the form y = b0 with a squared error term:
The fitting process consists of choosing a value of b0 which minimizes of the fit to the null model, denoted by where the subscript denotes the null model. It is seen that the null model is optimized by where is the mean of the yk values, and the optimized is:
which is proportional to the square of the (uncorrected) sample standard deviation of the yk data points.
We can imagine a case where the yk data points are randomly assigned to the various xk,
and then fitted using the proposed model. Specifically, we can consider
the fits of the proposed model to every permutation of the yk
outcomes. It can be shown that the optimized error of any of these fits
will never be less than the optimum error of the null model, and that
the difference between these minimum error will follow a chi-squared distribution, with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be . Using the chi-squared test, we may then estimate how many of these permuted sets of yk will yield a minimum error less than or equal to the minimum error using the original yk, and so we can estimate how significant an improvement is given by the inclusion of the x variable in the proposed model.
For logistic regression, the measure of goodness-of-fit is the likelihood function L, or its logarithm, the log-likelihood ℓ. The likelihood function L is analogous to the
in the linear regression case, except that the likelihood is maximized
rather than minimized. Denote the maximized log-likelihood of the
proposed model by .
In the case of simple binary logistic regression, the set of K data points are fitted in a probabilistic sense to a function of the form:
where is the probability that . The log-odds are given by:
and the log-likelihood is:
For the null model, the probability that is given by:
The log-odds for the null model are given by:
and the log-likelihood is:
Since we have at the maximum of L, the maximum log-likelihood for the null model is
The optimum is:
where is again the mean of the yk values. Again, we can conceptually consider the fit of the proposed model to every permutation of the yk
and it can be shown that the maximum log-likelihood of these
permutation fits will never be smaller than that of the null model:
Also, as an analog to the error of the linear regression case, we may define the deviance of a logistic regression fit as:
which will always be positive or zero. The reason for this choice is
that not only is the deviance a good measure of the goodness of fit, it
is also approximately chi-squared distributed, with the approximation
improving as the number of data points (K) increases, becoming
exactly chi-square distributed in the limit of an infinite number of
data points. As in the case of linear regression, we may use this fact
to estimate the probability that a random set of data points will give a
better fit than the fit obtained by the proposed model, and so have an
estimate how significantly the model is improved by including the xk data points in the proposed model.
For the simple model of student test scores described above, the maximum value of the log-likelihood of the null model is The maximum value of the log-likelihood for the simple model is so that the deviance is
Using the chi-squared test of significance, the integral of the chi-squared distribution with one degree of freedom from 11.6661... to infinity is equal to 0.00063649...
This effectively means that about 6 out of a 10,000 fits to random yk can be expected to have a better fit (smaller deviance) than the given yk and so we can conclude that the inclusion of the x variable and data in the proposed model is a very significant improvement over the null model. In other words, we reject the null hypothesis with confidence.
Goodness of fit summary
Goodness of fit in linear regression models is generally measured using R2. Since this has no direct analog in logistic regression, various methods including the following can be used instead.
Deviance and likelihood ratio tests
In linear regression analysis, one is concerned with partitioning variance via the sum of squares
calculations – variance in the criterion is essentially divided into
variance accounted for by the predictors and residual variance. In
logistic regression analysis, deviance is used in lieu of a sum of squares calculations. Deviance is analogous to the sum of squares calculations in linear regression and is a measure of the lack of fit to the data in a logistic regression model. When a "saturated" model is available (a model with a theoretically
perfect fit), deviance is calculated by comparing a given model with the
saturated model. This computation gives the likelihood-ratio test:
In the above equation, D
represents the deviance and ln represents the natural logarithm. The
log of this likelihood ratio (the ratio of the fitted model to the
saturated model) will produce a negative value, hence the need for a
negative sign. D can be shown to follow an approximate chi-squared distribution. Smaller values indicate better fit as the fitted model deviates less
from the saturated model. When assessed upon a chi-square distribution,
nonsignificant chi-square values indicate very little unexplained
variance and thus, good model fit. Conversely, a significant chi-square
value indicates that a significant amount of the variance is
unexplained.
When the saturated model is not available (a common case),
deviance is calculated simply as −2·(log likelihood of the fitted
model), and the reference to the saturated model's log likelihood can be
removed from all that follows without harm.
Two measures of deviance are particularly important in logistic
regression: null deviance and model deviance. The null deviance
represents the difference between a model with only the intercept (which
means "no predictors") and the saturated model. The model deviance
represents the difference between a model with at least one predictor
and the saturated model. In this respect, the null model provides a baseline upon which to
compare predictor models. Given that deviance is a measure of the
difference between a given model and the saturated model, smaller values
indicate better fit. Thus, to assess the contribution of a predictor or
set of predictors, one can subtract the model deviance from the null
deviance and assess the difference on a chi-square distribution with degrees of freedom equal to the difference in the number of parameters estimated.
Let
Then the difference of both is:
If the model deviance is significantly smaller than the null deviance
then one can conclude that the predictor or set of predictors
significantly improve the model's fit. This is analogous to the F-test used in linear regression analysis to assess the significance of prediction.
In linear regression the squared multiple correlation, R2
is used to assess goodness of fit as it represents the proportion of
variance in the criterion that is explained by the predictors. In logistic regression analysis, there is no agreed upon analogous
measure, but there are several competing measures each with limitations.
Four of the most commonly used indices and one less commonly used one are examined on this page:
Likelihood ratio R2L
Cox and Snell R2CS
Nagelkerke R2N
McFadden R2McF
Tjur R2T
Hosmer–Lemeshow test
The Hosmer–Lemeshow test uses a test statistic that asymptotically follows a distribution
to assess whether or not the observed event rates match expected event
rates in subgroups of the model population. This test is considered to
be obsolete by some statisticians because of its dependence on arbitrary
binning of predicted probabilities and relative low power.
Coefficient significance
After
fitting the model, it is likely that researchers will want to examine
the contribution of individual predictors. To do so, they will want to
examine the regression coefficients. In linear regression, the
regression coefficients represent the change in the criterion for each
unit change in the predictor. In logistic regression, however, the regression coefficients represent
the change in the logit for each unit change in the predictor. Given
that the logit is not intuitive, researchers are likely to focus on a
predictor's effect on the exponential function of the regression
coefficient – the odds ratio (see definition). In linear regression, the significance of a regression coefficient is assessed by computing a t
test. In logistic regression, there are several different tests
designed to assess the significance of an individual predictor, most
notably the likelihood ratio test and the Wald statistic.
Likelihood ratio test
The likelihood-ratio test
discussed above to assess model fit is also the recommended procedure
to assess the contribution of individual "predictors" to a given model. In the case of a single predictor model, one simply compares the
deviance of the predictor model with that of the null model on a
chi-square distribution with a single degree of freedom. If the
predictor model has significantly smaller deviance (c.f. chi-square
using the difference in degrees of freedom of the two models), then one
can conclude that there is a significant association between the
"predictor" and the outcome. Although some common statistical packages
(e.g. SPSS) do provide likelihood ratio test statistics, without this
computationally intensive test it would be more difficult to assess the
contribution of individual predictors in the multiple logistic
regression case. To assess the contribution of individual predictors one can enter the
predictors hierarchically, comparing each new model with the previous to
determine the contribution of each predictor. There is some debate among statisticians about the appropriateness of so-called "stepwise" procedures. The fear is that they may not preserve nominal statistical properties and may become misleading.
Wald statistic
Alternatively,
when assessing the contribution of individual predictors in a given
model, one may examine the significance of the Wald statistic. The Wald statistic, analogous to the t-test
in linear regression, is used to assess the significance of
coefficients. The Wald statistic is the ratio of the square of the
regression coefficient to the square of the standard error of the
coefficient and is asymptotically distributed as a chi-square
distribution.
Although several statistical packages (e.g., SPSS, SAS) report the
Wald statistic to assess the contribution of individual predictors, the
Wald statistic has limitations. When the regression coefficient is
large, the standard error of the regression coefficient also tends to be
larger increasing the probability of Type-II error. The Wald statistic also tends to be biased when data are sparse.
Case-control sampling
Suppose
cases are rare. Then we might wish to sample them more frequently than
their prevalence in the population. For example, suppose there is a
disease that affects 1 person in 10,000 and to collect our data we need
to do a complete physical. It may be too expensive to do thousands of
physicals of healthy people in order to obtain data for only a few
diseased individuals. Thus, we may evaluate more diseased individuals,
perhaps all of the rare outcomes. This is also retrospective sampling,
or equivalently it is called unbalanced data. As a rule of thumb,
sampling controls at a rate of five times the number of cases will
produce sufficient control data.
Logistic regression is unique in that it may be estimated on
unbalanced data, rather than randomly sampled data, and still yield
correct coefficient estimates of the effects of each independent
variable on the outcome. That is to say, if we form a logistic model
from such data, if the model is correct in the general population, the parameters are all correct except for . We can correct if we know the true prevalence as follows:
where is the true prevalence and is the prevalence in the sample.
Discussion
Like other forms of regression analysis,
logistic regression makes use of one or more predictor variables that
may be either continuous or categorical. Unlike ordinary linear
regression, however, logistic regression is used for predicting
dependent variables that take membership in one of a limited number of categories (treating the dependent variable in the binomial case as the outcome of a Bernoulli trial)
rather than a continuous outcome. Given this difference, the
assumptions of linear regression are violated. In particular, the
residuals cannot be normally distributed. In addition, linear regression
may make nonsensical predictions for a binary dependent variable. What
is needed is a way to convert a binary variable into a continuous one
that can take on any real value (negative or positive). To do that,
binomial logistic regression first calculates the odds of the event happening for different levels of each independent variable, and then takes its logarithm to create a continuous criterion as a transformed version of the dependent variable. The logarithm of the odds is the logit of the probability, the logit is defined as follows:
Although the dependent variable in logistic regression is Bernoulli, the logit is on an unrestricted scale. The logit function is the link function in this kind of generalized linear model, i.e.
Y is the Bernoulli-distributed response variable and x is the predictor variable; the β values are the linear parameters.
The logit of the probability of success is then fitted to the predictors. The predicted value of the logit is converted back into predicted odds, via the inverse of the natural logarithm – the exponential function.
Thus, although the observed dependent variable in binary logistic
regression is a 0-or-1 variable, the logistic regression estimates the
odds, as a continuous variable, that the dependent variable is a
'success'. In some applications, the odds are all that is needed. In
others, a specific yes-or-no prediction is needed for whether the
dependent variable is or is not a 'success'; this categorical prediction
can be based on the computed odds of success, with predicted odds above
some chosen cutoff value being translated into a prediction of success.
Machine learning and cross-entropy loss function
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the cross-entropy loss function.
Logistic regression is an important machine learning algorithm. The goal is to model the probability of a random variable being 0 or 1 given experimental data.
and since , we see that is given by We now calculate the likelihood function assuming that all the observations in the sample are independently Bernoulli distributed,
Typically, the log likelihood is maximized,
which is maximized using optimization techniques such as gradient descent.
Assuming the pairs are drawn uniformly from the underlying distribution, then in the limit of large N,
where is the conditional entropy and is the Kullback–Leibler divergence.
This leads to the intuition that by maximizing the log-likelihood of a
model, you are minimizing the KL divergence of your model from the
maximal entropy distribution. Intuitively searching for the model that
makes the fewest assumptions in its parameters.
Comparison with linear regression
Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression.
The model of logistic regression, however, is based on quite different
assumptions (about the relationship between the dependent and
independent variables) from those of linear regression. In particular,
the key differences between these two models can be seen in the
following two features of logistic regression. First, the conditional
distribution is a Bernoulli distribution rather than a Gaussian distribution,
because the dependent variable is binary. Second, the predicted values
are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes rather than the outcomes themselves.
Alternatives
A common alternative to the logistic model (logit model) is the probit model, as the related names suggest. From the perspective of generalized linear models, these differ in the choice of link function: the logistic model uses the logit function (inverse logistic function), while the probit model uses the probit function (inverse error function). Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard logistic distribution of errors and the second a standard normal distribution of errors. Other sigmoid functions or error distributions can be used instead.
Logistic regression is an alternative to Fisher's 1936 method, linear discriminant analysis. If the assumptions of linear discriminant analysis hold, the
conditioning can be reversed to produce logistic regression. The
converse is not true, however, because logistic regression does not
require the multivariate normal assumption of discriminant analysis.
The assumption of linear predictor effects can easily be relaxed using techniques such as spline functions.
History
A detailed history of the logistic regression is given in Cramer (2002). The logistic function was developed as a model of population growth and named "logistic" by Pierre François Verhulst in the 1830s and 1840s, under the guidance of Adolphe Quetelet; see Logistic function § History for details. In his earliest paper (1838), Verhulst did not specify how he fit the curves to the data. In his more detailed paper (1845), Verhulst determined the three
parameters of the model by making the curve pass through three observed
points, which yielded poor predictions.
The logistic function was independently developed in chemistry as a model of autocatalysis (Wilhelm Ostwald, 1883). An autocatalytic reaction is one in which one of the products is itself a catalyst
for the same reaction, while the supply of one of the reactants is
fixed. This naturally gives rise to the logistic equation for the same
reason as population growth: the reaction is self-reinforcing but
constrained.
The logistic function was independently rediscovered as a model of population growth in 1920 by Raymond Pearl and Lowell Reed, published as Pearl & Reed (1920),
which led to its use in modern statistics. They were initially unaware
of Verhulst's work and presumably learned about it from L. Gustave du Pasquier, but they gave him little credit and did not adopt his terminology. Verhulst's priority was acknowledged and the term "logistic" revived by Udny Yule in 1925 and has been followed since. Pearl and Reed first applied the model to the population of the United
States, and also initially fitted the curve by making it pass through
three points; as with Verhulst, this again yielded poor results.
The logistic model was likely first used as an alternative to the probit model in bioassay by Edwin Bidwell Wilson and his student Jane Worcester in Wilson & Worcester (1943). However, the development of the logistic model as a general alternative to the probit model was principally due to the work of Joseph Berkson over many decades, beginning in Berkson (1944), where he coined "logit", by analogy with "probit", and continuing through Berkson (1951) and following years. The logit model was initially dismissed as inferior to the probit
model, but "gradually achieved an equal footing with the probit", particularly between 1960 and 1970. By 1970, the logit model achieved
parity with the probit model in use in statistics journals and
thereafter surpassed it. This relative popularity was due to the
adoption of the logit outside of bioassay, rather than displacing the
probit within bioassay, and its informal use in practice; the logit's
popularity is credited to the logit model's computational simplicity,
mathematical properties, and generality, allowing its use in varied
fields.
Various refinements occurred during that time, notably by David Cox, as in Cox (1958).
The multinomial logit model was introduced independently in Cox (1966) and Theil (1969), which greatly increased the scope of application and the popularity of the logit model. In 1973 Daniel McFadden linked the multinomial logit to the theory of discrete choice, specifically Luce's choice axiom, showing that the multinomial logit followed from the assumption of independence of irrelevant alternatives and interpreting odds of alternatives as relative preferences; this gave a theoretical foundation for the logistic regression.
Extensions
There are large numbers of extensions:
Multinomial logistic regression (or multinomial logit) handles the case of a multi-way categorical
dependent variable (with unordered values, also called
"classification"). The general case of having dependent variables with
more than two values is termed polytomous regression.