Search This Blog

Saturday, March 6, 2021

Flying Spaghetti Monster

From Wikipedia, the free encyclopedia
 
Flying Spaghetti Monster
Pastafarianism
Oil painting in the style of "The Creation of Adam" by Michelangelo (which shows Adam reclining and reaching out to touch God), but instead of God there is the Flying Spaghetti Monster; two large meatballs wrapped in noodles, with eyes on stalks which are also noodles, all floating in mid-air.
Touched by His Noodly Appendage, a parody of Michelangelo's The Creation of Adam, is an iconic image of the Flying Spaghetti Monster by Arne Niklas Jansson.
Abodespaghettimonster.org
SymbolFSM Logo.svg
TextsThe Gospel of the Flying Spaghetti Monster, The Loose Canon, the Holy Book of the Church of the Flying Spaghetti Monster
Festivals"Holiday"

The Flying Spaghetti Monster (FSM) is the deity of the Church of the Flying Spaghetti Monster or Pastafarianism, a social movement that promotes a light-hearted view of religion and opposes the teaching of intelligent design and creationism in public schools. According to adherents, Pastafarianism (a portmanteau of pasta and Rastafarianism) is a "real, legitimate religion, as much as any other". It has received some limited recognition as such.

The "Flying Spaghetti Monster" was first described in a satirical open letter written by Bobby Henderson in 2005 to protest the Kansas State Board of Education decision to permit teaching intelligent design as an alternative to evolution in public school science classes. In the letter, Henderson demanded equal time in science classrooms for "Flying Spaghetti Monsterism", alongside intelligent design and evolution. After Henderson published the letter on his website, the Flying Spaghetti Monster rapidly became an Internet phenomenon and a symbol of opposition to the teaching of intelligent design in public schools.

Pastafarian tenets (generally satires of creationism) are presented on Henderson's Church of the Flying Spaghetti Monster website (where he is described as "prophet"), and are also elucidated in The Gospel of the Flying Spaghetti Monster, written by Henderson in 2006, and in The Loose Canon, the Holy Book of the Church of the Flying Spaghetti Monster. The central creation myth is that an invisible and undetectable Flying Spaghetti Monster created the universe after drinking heavily. Pirates are revered as the original Pastafarians. Henderson asserts that a decline in the number of pirates over the years is the cause of global warming. The FSM community congregates at Henderson's website to share ideas about the Flying Spaghetti Monster and crafts representing images of it.

Because of its popularity and exposure, the Flying Spaghetti Monster is often used as a contemporary version of Russell's teapot—an argument that the philosophic burden of proof lies upon those who make unfalsifiable claims, not on those who reject them. Pastafarians have engaged in disputes with creationists, including in Polk County, Florida, where they played a role in dissuading the local school board from adopting new rules on teaching evolution. Pastafarianism has received praise from the scientific community and criticism from proponents of intelligent design.

History

In January 2005, Bobby Henderson, a 24-year-old Oregon State University physics graduate, sent an open letter regarding the Flying Spaghetti Monster to the Kansas State Board of Education. In that letter, Henderson satirized creationism by professing his belief that whenever a scientist carbon-dates an object, a supernatural creator that closely resembles spaghetti and meatballs is there "changing the results with His Noodly Appendage". Henderson argued that his beliefs were just as valid as intelligent design, and called for equal time in science classrooms alongside intelligent design and evolution. The letter was sent prior to the Kansas evolution hearings as an argument against the teaching of intelligent design in biology classes. Henderson, describing himself as a "concerned citizen" representing more than ten million others, argued that intelligent design and his belief that "the universe was created by a Flying Spaghetti Monster" were equally valid. In his letter, he noted,

I think we can all look forward to the time when these three theories are given equal time in our science classrooms across the country, and eventually the world; one third time for Intelligent Design, one third time for Flying Spaghetti Monsterism, and one third time for logical conjecture based on overwhelming observable evidence.

— Bobby Henderson

According to Henderson, since the intelligent design movement uses ambiguous references to a designer, any conceivable entity may fulfill that role, including a Flying Spaghetti Monster. Henderson explained, "I don't have a problem with religion. What I have a problem with is religion posing as science. If there is a god and he's intelligent, then I would guess he has a sense of humor."

In May 2005, having received no reply from the Kansas State Board of Education, Henderson posted the letter on his website, gaining significant public interest. Shortly thereafter, Pastafarianism became an Internet phenomenon. Henderson published the responses he then received from board members. Three board members, all of whom opposed the curriculum amendments, responded positively; a fourth board member responded with the comment "It is a serious offense to mock God". Henderson has also published the significant amount of hate mail, including death threats, that he has received. Within one year of sending the open letter, Henderson received thousands of emails on the Flying Spaghetti Monster, eventually totaling over 60,000, of which he has said that "about 95 percent have been supportive, while the other five percent have said I am going to hell". During that time, his site garnered tens of millions of hits.

Internet phenomenon

Drawing of the Flying Spaghetti Monster; crudely drawn with thick lines. Image shows a plain oval for the body, six noodles for the arms and two eye stalks.
A version of the FSM "fish" emblem, the symbol of the Church of the Flying Spaghetti Monster. The symbol was created by readers of the Boing Boing web site in 2005. It is a parody of the Christian Ichthys symbol.

As word of Henderson's challenge to the board spread, his website and cause received more attention and support. The satirical nature of Henderson's argument made the Flying Spaghetti Monster popular with bloggers as well as humor and Internet culture websites. The Flying Spaghetti Monster was featured on websites such as Boing Boing, Something Awful, Uncyclopedia, and Fark.com. Moreover, an International Society for Flying Spaghetti Monster Awareness and other fan sites emerged. As public awareness grew, the mainstream media picked up on the phenomenon. The Flying Spaghetti Monster became a symbol for the case against intelligent design in public education. The open letter was printed in several major newspapers, including The New York Times, The Washington Post, and Chicago Sun-Times, and received worldwide press attention. Henderson himself was surprised by its success, stating that he "wrote the letter for my own amusement as much as anything".

In August 2005, in response to a challenge from a reader, Boing Boing announced a $250,000 prize—later raised to $1,000,000—of "Intelligently Designed currency" payable to any individual who could produce empirical evidence proving that Jesus is not the son of the Flying Spaghetti Monster. It was modeled as a parody of a similar challenge issued by young-earth creationist Kent Hovind.

According to Henderson, newspaper articles on the Flying Spaghetti Monster attracted the attention of book publishers; he said that at one point, there were six publishers interested in the Flying Spaghetti Monster. In November 2005, Henderson received an advance from Villard to write The Gospel of the Flying Spaghetti Monster.

In November 2005, the Kansas State Board of Education voted to allow criticisms of evolution, including language about intelligent design, as part of testing standards. On February 13, 2007, the board voted 6–4 to reject the amended science standards enacted in 2005. This was the fifth time in eight years that the board had rewritten the standards on evolution.

Tenets

With millions, if not thousands, of devout worshipers, the Church of the FSM is widely considered a legitimate religion, even by its opponents—mostly fundamentalist Christians, who have accepted that our God has larger balls than theirs.

Bobby Henderson

Although Henderson has stated that "the only dogma allowed in the Church of the Flying Spaghetti Monster is the rejection of dogma", some general beliefs are held by Pastafarians. Henderson proposed many Pastafarian tenets in reaction to common arguments by proponents of intelligent design. These "canonical beliefs" are presented by Henderson in his letter to the Kansas State Board of Education, The Gospel of the Flying Spaghetti Monster, and on Henderson's web site, where he is described as a "prophet". They tend to satirize creationism.

Creation

The central creation myth is that an invisible and undetectable Flying Spaghetti Monster created the universe "after drinking heavily". According to these beliefs, the Monster's intoxication was the cause for a flawed Earth. Furthermore, according to Pastafarianism, all evidence for evolution was planted by the Flying Spaghetti Monster in an effort to test the faith of Pastafarians—parodying certain biblical literalists. When scientific measurements such as radiocarbon dating are taken, the Flying Spaghetti Monster "is there changing the results with His Noodly Appendage".

Afterlife

The Pastafarian conception of Heaven includes a beer volcano and a stripper (or sometimes prostitute) factory. The Pastafarian Hell is similar, except that the beer is stale and the strippers have sexually transmitted diseases.

Pirates and global warming

chart showing that in 1820 there were 25,000 pirates and the global average temperature was 14.2 degrees C, while in 2000 there were 17 pirates and the global average temperature was 15.9 degrees C.
A misleading graph that is claimed to correlate the number of pirates with global temperature

According to Pastafarian beliefs, pirates are "absolute divine beings" and the original Pastafarians. Furthermore, Pastafarians believe that the concept of pirates as "thieves and outcasts" is misinformation spread by Christian theologians in the Middle Ages and by Hare Krishnas. Instead, Pastafarians believe that they were "peace-loving explorers and spreaders of good will" who distributed candy to small children, adding that modern pirates are in no way similar to "the fun-loving buccaneers from history". In addition, Pastafarians believe that ghost pirates are responsible for all of the mysteriously lost ships and planes of the Bermuda Triangle. Pastafarians are among those who celebrate International Talk Like a Pirate Day on September 19.

The inclusion of pirates in Pastafarianism was part of Henderson's original letter to the Kansas State Board of Education, in an effort to illustrate that correlation does not imply causation. Henderson presented the argument that "global warming, earthquakes, hurricanes, and other natural disasters are a direct effect of the shrinking numbers of pirates since the 1800s". A deliberately misleading graph accompanying the letter (with numbers humorously disordered on the x-axis) shows that as the number of pirates decreased, global temperatures increased. This parodies the suggestion from some religious groups that the high numbers of disasters, famines, and wars in the world is due to the lack of respect and worship toward their deity. In 2008, Henderson interpreted the growing pirate activities at the Gulf of Aden as additional support, pointing out that Somalia has "the highest number of pirates and the lowest carbon emissions of any country".

Holidays

An alternative tree-topper for Pastafarians, handmade from pipe cleaners and pom poms.

Pastafarian beliefs extend into lighthearted religious ceremony. Pastafarians celebrate every Friday as a holy day. Prayers are concluded with a final declaration of affirmation, "R'amen" (or "rAmen"); the term is a parodic portmanteau of the terms "Amen" and "Ramen", referring to instant noodles and to the "noodly appendages" of their deity. The celebration of "Pastover" requires consuming large amounts of pasta, and during "Ramendan", only Ramen noodles are consumed; International Talk Like a Pirate Day is observed as a holiday.

Around the time of Christmas, Hanukkah, and Kwanzaa, Pastafarians celebrate a vaguely defined holiday named "Holiday". Holiday does not take place on "a specific date so much as it is the Holiday season, itself". According to Henderson, as Pastafarians "reject dogma and formalism", there are no specific requirements for Holiday. Pastafarians celebrate Holiday in any manner they please. Pastafarians interpret the increasing usage of "Happy Holidays", rather than more traditional greetings (such as "Merry Christmas"), as support for Pastafarianism. In December 2005, George W. Bush's White House Christmas greeting cards wished people a happy "holiday season", leading Henderson to write the President a note of thanks, including a "fish" emblem depicting the Flying Spaghetti Monster for his limousine or plane. Henderson also thanked Walmart for its use of the phrase.

Books

The Gospel of the Flying Spaghetti Monster

image of the cover of a book, red cover, gold lettering, title, small crudely drawn logo, author. No picture.

In December 2005 Bobby Henderson received a reported US$80,000 advance from Villard to write The Gospel of the Flying Spaghetti Monster. Henderson said he planned to use proceeds from the book to build a pirate ship, with which he would spread the Pastafarian religion. The book was released on March 28, 2006, and elaborates on Pastafarian beliefs established in the open letter. Henderson employs satire to present perceived flaws with evolutionary biology and discusses history and lifestyle from a Pastafarian perspective. The gospel urges readers to try Pastafarianism for thirty days, saying, "If you don't like us, your old religion will most likely take you back". Henderson states on his website that more than 100,000 copies of the book have been sold.

Scientific American described the gospel as "an elaborate spoof on Intelligent Design" and "very funny". In 2006, it was nominated for the Quill Award in Humor, but was not selected as the winner. Wayne Allen Brenner of The Austin Chronicle characterized the book as "a necessary bit of comic relief in the overly serious battle between science and superstition". Simon Singh of The Daily Telegraph wrote that the gospel "might be slightly repetitive...but overall it is a brilliant, provocative, witty and important gem of a book".

Casey Luskin of the Discovery Institute, which advocates intelligent design, labeled the gospel "a mockery of the Christian New Testament".

The Loose Canon

In September 2005, before Henderson had received an advance to write the Gospel of the Flying Spaghetti Monster, a Pastafarian member of the Venganza forums known as Solipsy, announced the beginning of a project to collect texts from fellow Pastafarians to compile into The Loose Canon, the Holy Book of the Church of the Flying Spaghetti Monster, essentially analogous to the Bible. The book was completed in 2010 and was made available for download.

Some excerpts from The Loose Canon include:

I am the Flying Spaghetti Monster. Thou shalt have no other monsters before Me (Afterwards is OK; just use protection). The only Monster who deserves capitalization is Me! Other monsters are false monsters, undeserving of capitalization.

— Suggestions 1:1

"Since you have done a half-ass job, you will receive half an ass!" The Great Pirate Solomon grabbed his ceremonial scimitar and struck his remaining donkey, cleaving it in two.

— Slackers 1:51–52

Influence

various people standing around a small Flying Spaghetti Monster Parade float.
Flying Spaghetti Monster contingent preparing for the 2009 Summer Solstice Parade and Pageant in Fremont, Seattle, Washington

As a cultural phenomenon

A bottle of Flying Spaghetti Monster red wine.

The Church of the Flying Spaghetti Monster now consists of thousands of followers, primarily concentrated on college campuses in North America and Europe. According to the Associated Press, Henderson's website has become "a kind of cyber-watercooler for opponents of intelligent design". On it, visitors track meetings of pirate-clad Pastafarians, sell trinkets and bumper stickers, and sample photographs that show "visions" of the Flying Spaghetti Monster.

In August 2005, the Swedish concept designer Niklas Jansson created an adaptation of Michelangelo's The Creation of Adam, superimposing the Flying Spaghetti Monster over God. This became and remains the Flying Spaghetti Monster's de facto brand image. The Hunger Artists Theatre Company produced a comedy called The Flying Spaghetti Monster Holiday Pageant in December 2006, detailing the history of Pastafarianism. The production has spawned a sequel called Flying Spaghetti Monster Holy Mug of Grog, performed in December 2008. This communal activity attracted the attention of three University of Florida religious scholars, who assembled a panel at the 2007 American Academy of Religion meeting to discuss the Flying Spaghetti Monster.

small handmade knit Flying Spaghetti Monster sitting on a table with people dressed as pirates in background.
Handmade knitted and felted Flying Spaghetti Monster

In November 2007, four talks about the Flying Spaghetti Monster were delivered at the American Academy of Religion's annual meeting in San Diego. The talks, with titles such as Holy Pasta and Authentic Sauce: The Flying Spaghetti Monster's Messy Implications for Theorizing Religion, examined the elements necessary for a group to constitute a religion. Speakers inquired whether "an anti-religion like Flying Spaghetti Monsterism [is] actually a religion". The talks were based on the paper, Evolutionary Controversy and a Side of Pasta: The Flying Spaghetti Monster and the Subversive Function of Religious Parody, published in the GOLEM Journal of Religion and Monsters. The panel garnered an audience of one hundred of the more than 9,000 conference attendees, and conference organizers received critical e-mails from Christians offended by it.

Since October 2008, the local chapter of the Church of the Flying Spaghetti Monster has sponsored an annual convention called Skepticon on the campus of Missouri State University. Atheists and skeptics give speeches on various topics, and a debate with Christian experts is held. Organizers tout the event as the "largest gathering of atheists in the Midwest".

The Moldovan-born poet, fiction writer, and culturologist Igor Ursenco entitled his 2012 poetry book The Flying Spaghetti Monster (thriller poems).

On the nonprofit microfinancing site, Kiva, the Flying Spaghetti Monster group is in an ongoing competition to top all other "religious congregations" in the number of loans issued via their team. The group's motto is "Thou shalt share, that none may seek without funding", an allusion to the Loose Canon which states "Thou shalt share, that none may seek without finding." As of October 2018 it reported to have funded US$4,002,350 in loans.

Bathyphysa conifera, a siphonophore, has been called "Flying Spaghetti Monster" in reference to the FSM.

The 2020 documentary called I, Pastafari details the Church of the Flying Spaghetti Monster and its fight for legal recognition.

In September 2019, the Pastafarian pastor Barrett Fletcher offered an opening prayer on behalf of the Church of the Flying Spaghetti Monster to open a Kenai Peninsula Borough Assembly government meeting in Alaska.

Use in religious disputes

Owing to its popularity and media exposure, the Flying Spaghetti Monster is often used as a modern version of Russell's teapot. Proponents argue that, since the existence of the invisible and undetectable Flying Spaghetti Monster—similar to other proposed supernatural beings—cannot be falsified, it demonstrates that the burden of proof rests on those who affirm the existence of such beings. Richard Dawkins explains, "The onus is on somebody who says, I want to believe in God, Flying Spaghetti Monster, fairies, or whatever it is. It is not up to us to disprove it." Furthermore, according to Lance Gharavi, an editor of The Journal of Religion and Theater, the Flying Spaghetti Monster is "ultimately...an argument about the arbitrariness of holding any one view of creation", since any one view is equally as plausible as the Flying Spaghetti Monster. A similar argument was discussed in the books The God Delusion and The Atheist Delusion.

In December 2007 the Church of the Flying Spaghetti Monster was credited with spearheading successful efforts in Polk County, Florida, to dissuade the Polk County School Board from adopting new science standards on evolution. The issue was raised after five of the seven board members declared a personal belief in intelligent design. Opponents describing themselves as Pastafarians e-mailed members of the Polk County School Board demanding equal instruction time for the Flying Spaghetti Monster. Board member Margaret Lofton, who supported intelligent design, dismissed the e-mail as ridiculous and insulting, stating, "they've made us the laughing stock of the world". Lofton later stated that she had no interest in engaging with the Pastafarians or anyone else seeking to discredit intelligent design. As the controversy developed, scientists expressed opposition to intelligent design. In response to hopes for a new "applied science" campus at the University of South Florida in Lakeland, university vice president Marshall Goodman expressed surprise, stating, "[intelligent design is] not science. You can't even call it pseudo-science." While unhappy with the outcome, Lofton chose not to resign over the issue. She and the other board members expressed a desire to return to the day-to-day work of running the school district.

Legal status

National branches of the Church of the Flying Spaghetti Monster have been striving in many countries to have Pastafarianism become an officially (legally) recognized religion, with varying degrees of success. In New Zealand, Pastafarian representatives have been authorized as marriage celebrants, as the movement satisfies criteria laid down for organisations that primarily promote religious, philosophical, or humanitarian convictions.

A federal court in the US state of Nebraska ruled that Flying Spaghetti Monster is a satirical parody religion, rather than an actual religion, and as a result, Pastafarians are not entitled to religious accommodation under the Religious Land Use and Institutionalized Persons Act:

"This is not a question of theology", the ruling reads in part. "The FSM Gospel is plainly a work of satire, meant to entertain while making a pointed political statement. To read it as religious doctrine would be little different from grounding a 'religious exercise' on any other work of fiction."

Pastafarians have used their claimed faith as a test case to argue for freedom of religion, and to oppose government discrimination against people who do not follow a recognized religion.

Marriage

The Church of the Flying Spaghetti Monster operates an ordination mill on their website which enables officiates in jurisdictions where credentials are needed to officiate weddings. Pastafarians say that separation of church and state precludes the government from arbitrarily labelling one denomination religiously valid but another an ordination mill. In November 2014, Rodney Michael Rogers and Minneapolis-based Atheists for Human Rights sued Washington County, Minnesota under the Fourteenth Amendment equal protection clause and the First Amendment free speech clause, with their attorney claiming discrimination against atheists: "When the statute clearly permits recognition of a marriage celebrant whose religious credentials consist of nothing more than a $20 'ordination' obtained from the Church of the Flying Spaghetti Monster... the requirement is absolutely meaningless in terms of ensuring the qualifications of a marriage celebrant." A few days prior to a hearing on the matter, Washington County changed its policy to allow Rogers his ability to officiate weddings. This action was done in an effort to deny the court jurisdiction on the underlying claim. On May 13, 2015 the Federal Court held that the issue had become moot and dismissed the case. The first legally recognized Pastafarian wedding occurred in New Zealand on April 16, 2016.

Free speech

In March 2007, Bryan Killian, a high school student in Buncombe County, North Carolina, was suspended for wearing "pirate regalia" which he said was part of his Pastafarian faith. Killian protested the suspension, saying it violated his First Amendment rights to religious freedom and freedom of expression. "If this is what I believe in, no matter how stupid it might sound, I should be able to express myself however I want to", he said.

A man dressed in pirate regalia standing next to a person costumed as the Flying Spaghetti Monster.
Two Pastafarians dressed as the Flying Spaghetti Monster and a pirate respectively

In March 2008, Pastafarians in Crossville, Tennessee, were permitted to place a Flying Spaghetti Monster statue in a free speech zone on the courthouse lawn, and proceeded to do so. The display gained national interest on blogs and online news sites and was even covered by Rolling Stone magazine. It was later removed from the premises, along with all the other long-term statues, as a result of the controversy over the statue. In December 2011, Pastafarianism was one of the multiple denominations given equal access to placing holiday displays on the Loudoun County courthouse lawn, in Leesburg, Virginia.

In 2012, Tracy McPherson of the Pennsylvanian Pastafarians petitioned the Chester County, Pennsylvania Commissioners to allow representation of the FSM at the county courthouse, equally with a Jewish menorah and a Christian nativity scene. One commissioner stated that either all religions should be allowed or no religion should be represented, but without support from the other commissioners the motion was rejected. Another commissioner stated that this petition garnered more attention than any he had seen before.

On September 21, 2012, Pastafarian Giorgos Loizos was arrested in Greece on charges of malicious blasphemy and offense of religion for the creation of a satirical Facebook page called "Elder Pastitsios", based on a well-known deceased Greek Orthodox monk, Elder Paisios, where his name and face were substituted with pastitsio – a local pasta and béchamel sauce dish. The case, which started as a Facebook flame, reached the Greek Parliament and created a strong political reaction to the arrest.

In August 2013, Christian Orthodox religious activists from an unregistered group known as "God's Will" attacked a peaceful rally that Russian Pastafarians had organized. Activists as well as police knocked some rally participants to the ground. Police arrested and charged eight of the Pastafarians with attempting to hold an unsanctioned rally. One of the Pastafarians later complained that they were arrested "just for walking".

In February 2014, union officials at London South Bank University forbade an atheist group to display posters of the Flying Spaghetti Monster at a student orientation conference and later banned the group from the conference, leading to complaints about interference with free speech. The Students' Union subsequently apologized.

In November 2014, the Church of the FSM obtained city signage in Templin, Germany, announcing the time of Friday's weekly Nudelmesse ("pasta mass"), alongside signage for various Catholic and Protestant Sunday services.

Headgear in identity photos

smiling woman wearing a colander on her head being "blessed" by a brass Flying Spaghetti Monster in the style of a Roman Catholic scepter.
A woman wearing a colander as Pastafarian headgear
 
Pastafarian protester wears a colander while showing an icon of the Flying Spaghetti Monster

Origins and overview

In July 2011, Austrian pastafarian Niko Alm won the legal right to be shown in his driving license photo wearing a pasta strainer on his head, after three years spent pursuing permission and obtaining an examination certifying that he was psychologically fit to drive. He got the idea after reading that Austrian regulations allow headgear in official photos only when it is worn for religious reasons. Some sources report that the colander in the form of pasta strainer, was recognised by Austrian authorities as a religious headgear of the parody religion Pastafarianism in 2011. This was denied by Austrian authorities, saying that religious motives were not the reason to grant the permission of wearing the headgear in a passport.

Alm's initiative has since been replicated in several (mostly Western) countries around the world, with mixed successes. Many national or subnational authorities (such as U.S. states) granted driver's licences, identity cards or passports featuring photos of citizens wearing a colander, while other authorities rejected applications on grounds that either Pastafarianism was 'not a (real) religion' and reflected satire rather than sincerity or seriousness, or that wearing a colander could not be demonstrated to be a religious obligation as other head-covering items were claimed to be in other religions, such as the hijab in Islam and the kippah/yarmulke in Judaism. Applicants and their attorneys retorted by arguing – also with mixed successes – that Pastafarianism did constitute a real religion, or that it was neither up to the government to decide what qualifies as a religion, nor whether certain religious beliefs are valid or invalid, nor whether certain practices within religions had the status of obligation, established doctrine, recommendation, or personal choice. Moreover, some Pastafarians argued, satire and parody themselves are or could be a religious practice or an integral part of a religion such as Pastafarianism, and the government has no right to decide which beliefs should be taken seriously and which should not, and that it is only up to the individual believers themselves to decide which elements of their religion to take seriously, and to what degree.

Europe

On August 9, 2011 the chairman of the church of the Flying Spaghetti Monster Germany, Rüdiger Weida, obtained a driver's license with a picture of him wearing a pirate bandana. In contrast to the Austrian officials in the case of Niko Alm the German officials allowed the headgear as a religious exception.

Some anti-clerical protesters wore colanders to Piazza XXIV Maggio square in Milan, Italy, on June 2, 2012, in mock obedience to the Flying Spaghetti Monster.

In March 2013 a Belgian's identity photos were refused by the local and national administrations because he wore a pasta strainer on his head.

The Czech Republic recognised this as religious headgear in 2013. In July that year, Lukáš Nový, a member of the Czech Pirate Party from Brno was given permission to wear a pasta strainer on his head for the photograph on his official Czech Republic ID card.

A man's Irish driving licence photograph including a colander was rejected by the Road Safety Authority (RSA) in December 2013. In March 2016, an Equality Officer of the Workplace Relations Commission reviewed the RSA's decision under the Equal Rights Acts and upheld it, on the basis that the complaint did "not come within the definition of religion and/or religious belief".

In January 2016 Russian Pastafarian Andrei Filin got a driver's license with his photo in a colander.

In the Netherlands, Dirk Jan Dijkstra applied for a Dutch passport around 2015 using a colander on his identity photo, which was rejected by the municipality of Emmen, after which Dijkstra successfully registered the Church of the Flying Spaghetti Monster as a church association (kerkgenootschap) at the – initially hesitant – Dutch Chamber of Commerce in January 2016. However, the municipality continued rejecting his application, arguing that registering as a church association did not mean that Pastafarianism was now a (recognised) religion, leading Dijkstra to sue the municipality for discrimination, and gathering dozens of colander-wearing FSM Church members and sympathisers at the trial in Groningen on 7 July 2016. Meanwhile, other Pastafarians succeeded in obtaining colander-featuring passports and driver's licences from the municipalities of Leiden and The Hague. On 1 August 2016, the Groningen court ruled that, although Pastafarianism is a life stance, it is not a religion, nor is there a duty in Pastafarianism to wear the colander, and therefore the religious exemption to the prohibition on wearing headgear in identity photos did not apply to Pastafarians. In January 2017, Nijmegen Pastafarian and law student Mienke de Wilde petitioned the Arnhem court to be allowed to wear a colander in her driver's licence photo. She lost the petition, both at first instance in February 2017 and on appeal at the Council of State in August 2018.

United States

In February 2013, a Pastafarian was denied the right to wear a spaghetti strainer on his head for his driver's license photo by the New Jersey Motor Vehicle Commission, which stated that a pasta strainer was not on a list of approved religious headwear.

In August 2013 Eddie Castillo, a student at Texas Tech University, got approval to wear a pasta strainer on his head in his driver's license photo. He said, "You might think this is some sort of a gag or prank by a college student, but thousands, including myself, see it as a political and religious milestone for all atheists everywhere."

In January 2014 a member of the Pomfret, New York Town Council wore a colander while taking the oath of office.

In November 2014 former porn star Asia Carrera obtained an identity photo with the traditional Pastafarian headgear from a Department of Motor Vehicles office in Hurricane, Utah. The director of Utah's Driver License Division says that about a dozen Pastafarians have had their state driver's license photos taken with a similar pasta strainer over the years.

In November 2015 Massachusetts resident Lindsay Miller was allowed to wear a colander on her head in her driver's license photo after she cited her religious beliefs. Miller (who resides in Lowell) said on Friday, November 13 that she "absolutely loves the history and the story" of Pastafarians, whose website says has existed in secrecy for hundreds of years and entered the mainstream in 2005. Ms. Miller was represented in her quest by The American Humanist Association's Appignani Humanist Legal Center.

In February 2016, a man from Madison, Wisconsin won a legal struggle against the state, which, reasoning that Pastafarianism was not a religion, had initially refused him a colander photo on his driver's licence. The man's attorney successfully defended his request on the basis of the First Amendment to the United States Constitution, arguing that it was 'not up to the government to decide what qualifies as a religion'.

After the Drivers Services of Schaumburg, Illinois initially granted Rachel Hoover, a student at Northern Illinois University, a colander-featuring photo in her driver's licence in June 2016, the Illinois Secretary of State's office overturned the decision in July 2016, stating that such a photo was 'incorrect' and a new one had to be taken before her old licence expired on 29 July. The office did not recognise Pastafarianism as a religion, with a spokesperson saying 'If you look into their history, it’s more of a mockery of religion than a practice itself'. Hoover lodged a religious discrimination complaint with the American Civil Liberties Union, but was unsure to pursue further legal action since it didn't fit into her college budget. Previously, Pastafarian David Hoover from Pekin, Illinois had his request for a driving licence featuring a colander picture rejected in May 2013.

In June 2017, Sean Corbett from Chandler, Arizona succeeded in obtaining a driver's licence with a colander picture after trying several Arizona motor vehicle locations for two years.

In October 2019, the Ohio Bureau of Motor Vehicles rejected a Cincinnati man's driver's licence colander photo, saying its policy allows people to wear religious head coverings in driver's licence photos only if they wear them in public in daily life.

Commonwealth of Nations

In June 2014 a New Zealand man called Russell obtained a driver's license with a photograph of himself wearing a blue spaghetti strainer on his head. This was granted under a law allowing the wearing of religious headgear in official photos.

In October 2014, Obi Canuel, an ordained minister in the Church of the Flying Spaghetti Monster residing in Surrey, British Columbia, Canada, effectively lost his right to drive. After initially refusing Canuel's request for a licence renewal in autumn 2013 because he insisted on wearing a colander on the photo, the Insurance Corporation of British Columbia granted him temporary driving permits while it was considering to definitively reject or grant his request. ICBC claimed their October 2014 definitive refusal was based on the fact that it would only 'accommodate customers with head coverings where their faith prohibits them from removing it', and that 'Mr. Canuel was not able to provide us with any evidence that he cannot remove his head covering for his photo'.

The states of Australia have differed in dealing with applications for official documents featuring colander photos. Sydney science student Preshalin Moodley got a provisional driver's licence from New South Wales in September 2014, while Brisbane tradesman Simon Leadbetter was denied a licence renewal by Queensland's Department of Transport and Main Roads the same month. Earlier in 2014, South Australia refused Adelaide resident Guy Ablon a gun licence with a photo of him wearing a colander; the authorities even seized his legally obtained guns, questioned his religion and forced him to undergo a psychiatric evaluation before his weapons were returned. state of Victoria issued the first strainer-featuring driver's licence in November 2016.

List of identity photo applications with headgear

Jurisdiction Document status Date Person(s) involved
 Austria Driver's licence granted July 2011 Niko Alm
 Germany Driver's licence granted August 2011 Rüdiger Weida
 New Jersey Driver's licence refused February 2013 Aaron Williams
 Belgium Identity card refused March 2013 Alain Graulus
 Czech Republic Identity card granted July 2013 Lukáš Nový
 Texas Driver's licence granted August 2013 Eddie Castillo
 South Australia Gun licence refused 2014 Guy Ablon
 New Zealand Driver's licence granted June 2014 Russell
 California Driver's licence granted August 2014 Beth
 Oklahoma Driver's licence granted September 2014 Shawna Hammond
 New South Wales Driver's licence granted September 2014 Preshalin Moodley
 Queensland Driver's licence refused September 2014 Simon Leadbetter
 British Columbia Driver's licence refused October 2014 Obi Canuel
 Utah Driver's licence granted November 2014 Asia Carrera
 Tennessee Driver's licence granted December 2014 Joy Camacho
 Massachusetts Driver's licence granted November 2015 Lindsay Miller
 Georgia (U.S. state) Driver's licence refused December 2015 Chris Avino
 Israel Passport granted 2016 Michael Afanasyev
 Nevada Driver's licence granted January 2016 Chris Avino
 Russia Driver's licence granted January 2016 Andrei Filin
 Wisconsin Driver's licence granted February 2016 Michael Schumacher
 Ireland Driver's licence refused 2013, 2016 Noel Mulryan
 Illinois Driver's licence granted, then refused July 2016 Rachel Hoover
 Victoria Driver's licence granted November 2016 Marcus Bowring
 Arizona Driver's licence granted June 2017 Sean Corbett
 Netherlands Passport and driver's licence refused 2015 – 2018 Mienke de Wilde and others
 Ohio Driver's licence refused October 2019 Richard Moser

Critical reception

With regard to Henderson's 2005 open letter, according to Justin Pope of the Associated Press:

Between the lines, the point of the letter was this: there's no more scientific basis for intelligent design than there is for the idea an omniscient creature made of pasta created the universe. If intelligent design supporters could demand equal time in a science class, why not anyone else? The only reasonable solution is to put nothing into sciences classes but the best available science.

— Justin Pope
two metal US Army dog tags with Atheist/FSM stamped on them.
U.S. Army ID tag (dog tag) listing "Atheist/FSM" as the religious/belief system preference

Pope praised the Flying Spaghetti Monster as "a clever and effective argument". Simon Singh of the Daily Telegraph described the Flying Spaghetti Monster as "a masterstroke, which underlined the absurdity of Intelligent Design", and applauded Henderson for "galvanis[ing] a defence of science and rationality". Sarah Boxer of the New York Times said that Henderson "has wit on his side". In addition, the Flying Spaghetti Monster was mentioned in an article footnote of the Harvard Civil Rights-Civil Liberties Law Review as an example of evolution "enter[ing] the fray in popular culture", which the author deemed necessary for evolution to prevail over intelligent design. The abstract of the paper, Evolutionary Controversy and a Side of Pasta: The Flying Spaghetti Monster and the Subversive Function of Religious Parody, describes the Flying Spaghetti Monster as "a potent example of how monstrous humor can be used as a popular tool of carnivalesque subversion". Its author praised Pastafarianism for its "epistemological humility". Moreover, Henderson's website contains numerous endorsements from the scientific community. As Jack Schofield of The Guardian noted, "The joke, of course, is that it's arguably more rational than Intelligent Design."

Conservative columnist Jeff Jacoby wrote in The Boston Globe that intelligent design "isn't primitivism or Bible-thumping or flying spaghetti. It's science." This view of science, however, was rejected by the United States National Academy of Sciences. Peter Gallings of Answers in Genesis, a Young Earth Creationist ministry said "Ironically enough, Pastafarians, in addition to mocking God himself, are lampooning the Intelligent Design Movement for not identifying a specific deity—that is, leaving open the possibility that a spaghetti monster could be the intelligent designer... Thus, the satire is possible because the Intelligent Design Movement hasn't affiliated with a particular religion, exactly the opposite of what its other critics claim!"

Friday, March 5, 2021

Statistical hypothesis testing

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Statistical_hypothesis_testing

A statistical hypothesis is a hypothesis that is testable on the basis of observed data modelled as the realised values taken by a collection of random variables. A set of data is modelled as being realised values of a collection of random variables having a joint probability distribution in some set of possible joint distributions. The hypothesis being tested is exactly that set of possible probability distributions. A statistical hypothesis test is a method of statistical inference. An alternative hypothesis is proposed for the probability distribution of the data, either explicitly or only informally. The comparison of the two models is deemed statistically significant if, according to a threshold probability—the significance level—the data would be unlikely to occur if the null hypothesis were true. A hypothesis test specifies which outcomes of a study may lead to a rejection of the null hypothesis at a pre-specified level of significance, while using a pre-chosen measure of deviation from that hypothesis (the test statistic, or goodness-of-fit measure). The pre-chosen level of significance is the maximal allowed "false positive rate". One wants to control the risk of incorrectly rejecting a true null hypothesis.

The process of distinguishing between the null hypothesis and the alternative hypothesis is aided by considering two conceptual types of errors. The first type of error occurs when the null hypothesis is wrongly rejected. The second type of error occurs when the null hypothesis is wrongly not rejected. (The two types are known as type 1 and type 2 errors.)

Hypothesis tests based on statistical significance are another way of expressing confidence intervals (more precisely, confidence sets). In other words, every hypothesis test based on significance can be obtained via a confidence interval, and every confidence interval can be obtained via a hypothesis test based on significance.

Significance-based hypothesis testing is the most common framework for statistical hypothesis testing. An alternative framework for statistical hypothesis testing is to specify a set of statistical models, one for each candidate hypothesis, and then use model selection techniques to choose the most appropriate model. The most common selection techniques are based on either Akaike information criterion or Bayes factor. However, this is not really an "alternative framework", though one can call it a more complex framework. It is a situation in which one likes to distinguish between many possible hypotheses, not just two. Alternatively, one can see it as a hybrid between testing and estimation, where one of the parameters is discrete, and specifies which of a hierarchy of more and more complex models is correct.

  • Null hypothesis significance testing* is the name for a version of hypothesis testing with no explicit mention of possible alternatives, and not much consideration of error rates. It was championed by Ronald Fisher in a context in which he downplayed any explicit choice of alternative hypothesis and consequently paid no attention to the power of a test. One simply set up a null hypothesis as a kind of straw man, or more kindly, as a formalisation of a standard, establishment, default idea of how things were. One tried to overthrow this conventional view by showing that it led to the conclusion that something extremely unlikely had happened, thereby discrediting the theory.

The testing process

In the statistics literature, statistical hypothesis testing plays a fundamental role. There are two mathematically equivalent processes that can be used.

The usual line of reasoning is as follows:

  1. There is an initial research hypothesis of which the truth is unknown.
  2. The first step is to state the relevant null and alternative hypotheses. This is important, as mis-stating the hypotheses will muddy the rest of the process.
  3. The second step is to consider the statistical assumptions being made about the sample in doing the test; for example, assumptions about the statistical independence or about the form of the distributions of the observations. This is equally important as invalid assumptions will mean that the results of the test are invalid.
  4. Decide which test is appropriate, and state the relevant test statistic T.
  5. Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example, the test statistic might follow a Student's t distribution with known degrees of freedom, or a normal distribution with known mean and variance. If the distribution of the test statistic is completely fixed by the null hypothesis we call the hypothesis simple, otherwise it is called composite.
  6. Select a significance level (α), a probability threshold below which the null hypothesis will be rejected. Common values are 5% and 1%.
  7. The distribution of the test statistic under the null hypothesis partitions the possible values of T into those for which the null hypothesis is rejected—the so-called critical region—and those for which it is not. The probability of the critical region is α. In the case of a composite null hypothesis, the maximal probability of the critical region is α.
  8. Compute from the observations the observed value tobs of the test statistic T.
  9. Decide to either reject the null hypothesis in favor of the alternative or not reject it. The decision rule is to reject the null hypothesis H0 if the observed value tobs is in the critical region, and to accept or "fail to reject" the hypothesis otherwise.

A common alternative formulation of this process goes as follows:

  1. Compute from the observations the observed value tobs of the test statistic T.
  2. Calculate the p-value. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed (the maximal probability of that event, if the hypothesis is composite).
  3. Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value is less than (or equal to) the significance level (the selected probability) threshold ().

The former process was advantageous in the past when only tables of test statistics at common probability thresholds were available. It allowed a decision to be made without the calculation of a probability. It was adequate for classwork and for operational use, but it was deficient for reporting results. The latter process relied on extensive tables or on computational support not always available. The explicit calculation of a probability is useful for reporting. The calculations are now trivially performed with appropriate software.

The difference in the two processes applied to the Radioactive suitcase example (below):

  • "The Geiger-counter reading is 10. The limit is 9. Check the suitcase."
  • "The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase."

The former report is adequate, the latter gives a more detailed explanation of the data and the reason why the suitcase is being checked.

The difference between accepting the null hypothesis and simply failing to reject it is important. The "fail to reject" terminology highlights the fact that the a non-significant result provides no way to determine which of the two hypotheses is true, so all that can be concluded is that the null hypothesis has not been rejected. The phrase "accept the null hypothesis" may suggest it has been proved simply because it has not been disproved, a logical fallacy known as the argument from ignorance. Unless a test with particularly high power is used, the idea of "accepting" the null hypothesis is likely to be incorrect. Nonetheless the terminology is prevalent throughout statistics, where the meaning actually intended is well understood.

The processes described here are perfectly adequate for computation. They seriously neglect the design of experiments considerations.

It is particularly critical that appropriate sample sizes be estimated before conducting the experiment.

The phrase "test of significance" was coined by statistician Ronald Fisher.

Interpretation

The p-value is the probability that a given result (or a more significant result) would occur under the null hypothesis (or in the case of a composite null, it is the largest such probability; see Chapter 10 of "All of Statistics: A Concise Course in Statistical Inference", Springer; 1st Corrected ed. 20 edition, September 17, 2004; Larry Wasserman). For example, say that a fair coin is tested for fairness (the null hypothesis). At a significance level of 0.05, the fair coin would be expected to (incorrectly) reject the null hypothesis in about 1 out of every 20 tests. The p-value does not provide the probability that either hypothesis is correct (a common source of confusion). If the p-value is less than the chosen significance threshold (equivalently, if the observed test statistic is in the critical region), then we say the null hypothesis is rejected at the chosen level of significance. Rejection of the null hypothesis is a conclusion. This is like a "guilty" verdict in a criminal trial: the evidence is sufficient to reject innocence, thus proving guilt. We might accept the alternative hypothesis (and the research hypothesis).

If the p-value is not less than the chosen significance threshold (equivalently, if the observed test statistic is outside the critical region), then the evidence is insufficient to support a conclusion. (This is similar to a "not guilty" verdict.) The researcher typically gives extra consideration to those cases where the p-value is close to the significance level.

Some people find it helpful to think of the hypothesis testing framework as analogous to a mathematical proof by contradiction.

In the Lady tasting tea example (below), Fisher required the Lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. His test revealed that if the lady was effectively guessing at random (the null hypothesis), there was a 1.4% chance that the observed results (perfectly ordered tea) would occur.

Whether rejection of the null hypothesis truly justifies acceptance of the research hypothesis depends on the structure of the hypotheses. Rejecting the hypothesis that a large paw print originated from a bear does not immediately prove the existence of Bigfoot. Hypothesis testing emphasizes the rejection, which is based on a probability, rather than the acceptance, which requires extra steps of logic.

"The probability of rejecting the null hypothesis is a function of five factors: whether the test is one- or two-tailed, the level of significance, the standard deviation, the amount of deviation from the null hypothesis, and the number of observations." These factors are a source of criticism; factors under the control of the experimenter/analyst give the results an appearance of subjectivity.

Use and importance

Statistics are helpful in analyzing most collections of data. This is equally true of hypothesis testing which can justify conclusions even when no scientific theory exists. In the Lady tasting tea example, it was "obvious" that no difference existed between (milk poured into tea) and (tea poured into milk). The data contradicted the "obvious".

Real world applications of hypothesis testing include:

  • Testing whether more men than women suffer from nightmares
  • Establishing authorship of documents
  • Evaluating the effect of the full moon on behavior
  • Determining the range at which a bat can detect an insect by echo
  • Deciding whether hospital carpeting results in more infections
  • Selecting the best means to stop smoking
  • Checking whether bumper stickers reflect car owner behavior
  • Testing the claims of handwriting analysts

Statistical hypothesis testing plays an important role in the whole of statistics and in statistical inference. For example, Lehmann (1992) in a review of the fundamental paper by Neyman and Pearson (1933) says: "Nevertheless, despite their shortcomings, the new paradigm formulated in the 1933 paper, and the many developments carried out within its framework continue to play a central role in both the theory and practice of statistics and can be expected to do so in the foreseeable future".

Significance testing has been the favored statistical tool in some experimental social sciences (over 90% of articles in the Journal of Applied Psychology during the early 1990s). Other fields have favored the estimation of parameters (e.g. effect size). Significance testing is used as a substitute for the traditional comparison of predicted value and experimental result at the core of the scientific method. When theory is only capable of predicting the sign of a relationship, a directional (one-sided) hypothesis test can be configured so that only a statistically significant result supports theory. This form of theory appraisal is the most heavily criticized application of hypothesis testing.

Cautions

"If the government required statistical procedures to carry warning labels like those on drugs, most inference methods would have long labels indeed." This caution applies to hypothesis tests and alternatives to them.

The successful hypothesis test is associated with a probability and a type-I error rate. The conclusion might be wrong.

The conclusion of the test is only as solid as the sample upon which it is based. The design of the experiment is critical. A number of unexpected effects have been observed including:

  • The clever Hans effect. A horse appeared to be capable of doing simple arithmetic.
  • The Hawthorne effect. Industrial workers were more productive in better illumination, and most productive in worse.
  • The placebo effect. Pills with no medically active ingredients were remarkably effective.

A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In forecasting for example, there is no agreement on a measure of forecast accuracy. In the absence of a consensus measurement, no decision based on measurements will be without controversy.

The book How to Lie with Statistics is the most popular book on statistics ever published. It does not much consider hypothesis testing, but its cautions are applicable, including: Many claims are made on the basis of samples too small to convince. If a report does not mention sample size, be doubtful.

Hypothesis testing acts as a filter of statistical conclusions; only those results meeting a probability threshold are publishable. Economics also acts as a publication filter; only those results favorable to the author and funding source may be submitted for publication. The impact of filtering on publication is termed publication bias. A related problem is that of multiple testing (sometimes linked to data mining), in which a variety of tests for a variety of possible effects are applied to a single data set and only those yielding a significant result are reported. These are often dealt with by using multiplicity correction procedures that control the family wise error rate (FWER) or the false discovery rate (FDR).

Those making critical decisions based on the results of a hypothesis test are prudent to look at the details rather than the conclusion alone. In the physical sciences most results are fully accepted only when independently confirmed. The general advice concerning statistics is, "Figures never lie, but liars figure" (anonymous).

Examples

Human sex ratio

The earliest use of statistical hypothesis testing is generally credited to the question of whether male and female births are equally likely (null hypothesis), which was addressed in the 1700s by John Arbuthnot (1710), and later by Pierre-Simon Laplace (1770s).

Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710, and applied the sign test, a simple non-parametric test. In every year, the number of males born in London exceeded the number of females. Considering more male or more female births as equally likely, the probability of the observed outcome is 0.582, or about 1 in 4,8360,0000,0000,0000,0000,0000; in modern terms, this is the p-value. Arbuthnot concluded that this is too small to be due to chance and must instead be due to divine providence: "From whence it follows, that it is Art, not Chance, that governs." In modern terms, he rejected the null hypothesis of equally likely male and female births at the p = 1/282 significance level.

Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect.

Lady tasting tea

In a famous example of hypothesis testing, known as the Lady tasting tea, Dr. Muriel Bristol, a female colleague of Fisher claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the number she got correct, but just by chance. The null hypothesis was that the Lady had no such ability. The test statistic was a simple count of the number of successes in selecting the 4 cups. The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (< 5%). A pattern of 4 successes corresponds to 1 out of 70 possible combinations (p≈ 1.4%). Fisher asserted that no alternative hypothesis was (ever) required. The lady correctly identified every cup, which would be considered a statistically significant result.

Courtroom trial

A statistical test procedure is comparable to a criminal trial; a defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough evidence for the prosecution is the defendant convicted.

In the start of the procedure, there are two hypotheses : "the defendant is not guilty", and : "the defendant is guilty". The first one, , is called the null hypothesis, and is for the time being accepted. The second one, , is called the alternative hypothesis. It is the alternative hypothesis that one hopes to support.

The hypothesis of innocence is rejected only when an error is very unlikely, because one doesn't want to convict an innocent defendant. Such an error is called error of the first kind (i.e., the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, an error of the second kind (acquitting a person who committed the crime), is more common.


H0 is true
Truly not guilty
H1 is true
Truly guilty
Accept null hypothesis
Acquittal
Right decision Wrong decision
Type II Error
Reject null hypothesis
Conviction
Wrong decision
Type I Error
Right decision

A criminal trial can be regarded as either or both of two decision processes: guilty vs not guilty or evidence vs a threshold ("beyond a reasonable doubt"). In one view, the defendant is judged; in the other view the performance of the prosecution (which bears the burden of proof) is judged. A hypothesis test can be regarded as either a judgment of a hypothesis or as a judgment of evidence.

Philosopher's beans

The following example was produced by a philosopher describing scientific methods generations before hypothesis testing was formalized and popularized.

Few beans of this handful are white.
Most beans in this bag are white.
Therefore: Probably, these beans were taken from another bag.
This is an hypothetical inference.

The beans in the bag are the population. The handful are the sample. The null hypothesis is that the sample originated from the population. The criterion for rejecting the null-hypothesis is the "obvious" difference in appearance (an informal difference in the mean). The interesting result is that consideration of a real population and a real sample produced an imaginary bag. The philosopher was considering logic rather than probability. To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard.

A simple generalization of the example considers a mixed bag of beans and a handful that contain either very few or very many white beans. The generalization considers both extremes. It requires more calculations and more comparisons to arrive at a formal answer, but the core philosophy is unchanged; If the composition of the handful is greatly different from that of the bag, then the sample probably originated from another bag. The original example is termed a one-sided or a one-tailed test while the generalization is termed a two-sided or two-tailed test.

The statement also relies on the inference that the sampling was random. If someone had been picking through the bag to find white beans, then it would explain why the handful had so many white beans, and also explain why the number of white beans in the bag was depleted (although the bag is probably intended to be assumed much larger than one's hand).

Clairvoyant card game

A person (the subject) is tested for clairvoyance. They are shown the reverse of a randomly chosen playing card 25 times and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X.

As we try to find evidence of their clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant. The alternative is: the person is (more or less) clairvoyant.

If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly p. The hypotheses, then, are:

  • null hypothesis     (just guessing)

and

  • alternative hypothesis    (true clairvoyant).

When the test subject correctly predicts all 25 cards, we will consider them clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider them so. But what about 12 hits, or 17 hits? What is the critical number, c, of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value c? With the choice c=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with c=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind – a false positive, or Type I error. With c = 25 the probability of such an error is:

and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times.

Being less critical, with c=10, gives:

(where C(25,k) is the binomial coefficient 25 choose k). Thus, c = 10 yields a much greater probability of false positive.

Before the test is actually performed, the maximum acceptable probability of a Type I error (α) is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value c is calculated. For example, if we select an error rate of 1%, c is calculated thus:

From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a false negative. For the above example, we select: .

Radioactive suitcase

As an example, consider determining whether a suitcase contains some radioactive material. Placed under a Geiger counter, it produces 10 counts per minute. The null hypothesis is that no radioactive material is in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects. We can then calculate how likely it is that we would observe 10 counts per minute if the null hypothesis were true. If the null hypothesis predicts (say) on average 9 counts per minute, then according to the Poisson distribution typical for radioactive decay there is about 41% chance of recording 10 or more counts. Thus we can say that the suitcase is compatible with the null hypothesis (this does not guarantee that there is no radioactive material, just that we don't have enough evidence to suggest there is). On the other hand, if the null hypothesis predicts 3 counts per minute (for which the Poisson distribution predicts only 0.1% chance of recording 10 or more counts) then the suitcase is not compatible with the null hypothesis, and there are likely other factors responsible to produce the measurements.

The test does not directly assert the presence of radioactive material. A successful test asserts that the claim of no radioactive material present is unlikely given the reading (and therefore ...). The double negative (disproving the null hypothesis) of the method is confusing, but using a counter-example to disprove is standard mathematical practice. The attraction of the method is its practicality. We know (from experience) the expected range of counts with only ambient radioactivity present, so we can say that a measurement is unusually large. Statistics just formalizes the intuitive by using numbers instead of adjectives. We probably do not know the characteristics of the radioactive suitcases; We just assume that they produce larger readings.

To slightly formalize intuition: radioactivity is suspected if the Geiger-count with the suitcase is among or exceeds the greatest (5% or 1%) of the Geiger-counts made with ambient radiation alone. This makes no assumptions about the distribution of counts. Many ambient radiation observations are required to obtain good probability estimates for rare events.

The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis represents what we would believe by default, before seeing any evidence. Statistical significance is a possible finding of the test, declared when the observed sample is unlikely to have occurred by chance if the null hypothesis were true. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: to reject or not reject the null hypothesis. A calculated value is compared to a threshold, which is determined from the tolerable risk of error.

Definition of terms

The following definitions are mainly based on the exposition in the book by Lehmann and Romano:

Statistical hypothesis
A statement about the parameters describing a population (not a sample).
Statistic
A value calculated from a sample without any unknown parameters, often to summarize the sample for comparison purposes.
Simple hypothesis
Any hypothesis which specifies the population distribution completely.
Composite hypothesis
Any hypothesis which does not specify the population distribution completely.
Null hypothesis (H0)
A hypothesis associated with a contradiction to a theory one would like to prove.
Positive data
Data that enable the investigator to reject a null hypothesis.
Alternative hypothesis (H1)
A hypothesis (often composite) associated with a theory one would like to prove.
Statistical test
A procedure whose inputs are samples and whose result is a hypothesis.
Region of acceptance
The set of values of the test statistic for which we fail to reject the null hypothesis.
Region of rejection / Critical region
The set of values of the test statistic for which the null hypothesis is rejected.
Critical value
The threshold value delimiting the regions of acceptance and rejection for the test statistic.
Power of a test (1 − β)
The test's probability of correctly rejecting the null hypothesis when the alternative hypothesis is true. The complement of the false negative rate, β. Power is termed sensitivity in biostatistics. ("This is a sensitive test. Because the result is negative, we can confidently say that the patient does not have the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions.
Size
For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the supremum of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. The complement of the false positive rate is termed specificity in biostatistics. ("This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions.
Significance level of a test (α)
It is the upper bound imposed on the size of a test. Its value is chosen by the statistician prior to looking at the data or choosing any particular test to be used. It is the maximum exposure to erroneously rejecting H0 that they are ready to accept. Testing H0 at significance level α means testing H0 with a test whose size does not exceed α. In most cases, one uses tests whose size is equal to the significance level.
p-value
The probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic. In case of a composite null hypothesis, the worst case probability.
Statistical significance test
A predecessor to the statistical hypothesis test (see the Origins section). An experimental result was said to be statistically significant if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used for the modern version which is now part of statistical hypothesis testing.
Conservative test
A test is conservative if, when constructed for a given nominal significance level, the true probability of incorrectly rejecting the null hypothesis is never greater than the nominal level.
Exact test
A test in which the significance level or critical value can be computed exactly, i.e., without any approximation. In some contexts this term is restricted to tests applied to categorical data and to permutation tests, in which computations are carried out by complete enumeration of all possible outcomes and their probabilities.

A statistical hypothesis test compares a test statistic (z or t for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality:

Most powerful test
For a given size or significance level, the test with the greatest power (probability of rejection) for a given value of the parameter(s) being tested, contained in the alternative hypothesis.
Uniformly most powerful test (UMP)
A test with the greatest power for all values of the parameter(s) being tested, contained in the alternative hypothesis.

Common test statistics

Variations and sub-classes

Statistical hypothesis testing is a key technique of both frequentist inference and Bayesian inference, although the two types of inference have notable differences. Statistical hypothesis tests define a procedure that controls (fixes) the probability of incorrectly deciding that a default position (null hypothesis) is incorrect. The procedure is based on how likely it would be for a set of observations to occur if the null hypothesis were true. Note that this probability of making an incorrect decision is not the probability that the null hypothesis is true, nor whether any specific alternative hypothesis is true. This contrasts with other possible techniques of decision theory in which the null and alternative hypothesis are treated on a more equal basis.

One naïve Bayesian approach to hypothesis testing is to base decisions on the posterior probability, but this fails when comparing point and continuous hypotheses. Other approaches to decision making, such as Bayesian decision theory, attempt to balance the consequences of incorrect decisions across all possibilities, rather than concentrating on a single null hypothesis. A number of other approaches to reaching a decision based on data are available via decision theory and optimal decisions, some of which have desirable properties. Hypothesis testing, though, is a dominant approach to data analysis in many fields of science. Extensions to the theory of hypothesis testing include the study of the power of tests, i.e. the probability of correctly rejecting the null hypothesis given that it is false. Such considerations can be used for the purpose of sample size determination prior to the collection of data.

History

Early use

While hypothesis testing was popularized early in the 20th century, early forms were used in the 1700s. The first use is credited to John Arbuthnot (1710), followed by Pierre-Simon Laplace (1770s), in analyzing the human sex ratio at birth; see § Human sex ratio.

Modern origins and early controversy

Modern significance testing is largely the product of Karl Pearson (p-value, Pearson's chi-squared test), William Sealy Gosset (Student's t-distribution), and Ronald Fisher ("null hypothesis", analysis of variance, "significance test"), while hypothesis testing was developed by Jerzy Neyman and Egon Pearson (son of Karl). Ronald Fisher began his life in statistics as a Bayesian (Zabell 1992), but Fisher soon grew disenchanted with the subjectivity involved (namely use of the principle of indifference when determining prior probabilities), and sought to provide a more "objective" approach to inductive inference.

Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.

Fisher popularized the "significance test". He required a null-hypothesis (corresponding to a population frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a Type II error.

The p-value was devised as an informal, but objective, index meant to help a researcher determine (based on other knowledge) whether to modify future experiments or strengthen one's faith in the null hypothesis. Hypothesis testing (and Type I/II errors) was devised by Neyman and Pearson as a more objective alternative to Fisher's p-value, also meant to determine researcher behaviour, but without requiring any inductive inference by the researcher.

Neyman & Pearson considered a different problem (which they called "hypothesis testing"). They initially considered two simple hypotheses (both with frequency distributions). They calculated two probabilities and typically selected the hypothesis associated with the higher probability (the hypothesis more likely to have generated the sample). Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities.

Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson considered their formulation to be an improved generalization of significance testing.(The defining paper was abstract. Mathematicians have generalized and refined the theory for decades.) Fisher thought that it was not applicable to scientific research because often, during the course of the experiment, it is discovered that the initial assumptions about the null hypothesis are questionable due to unexpected sources of error. He believed that the use of rigid reject/accept decisions based on models formulated before data is collected was incompatible with this common scenario faced by scientists and attempts to apply this method to scientific research would lead to mass confusion.

The dispute between Fisher and Neyman–Pearson was waged on philosophical grounds, characterized by a philosopher as a dispute over the proper role of models in statistical inference.

Events intervened: Neyman accepted a position in the western hemisphere, breaking his partnership with Pearson and separating disputants (who had occupied the same building) by much of the planetary diameter. World War II provided an intermission in the debate. The dispute between Fisher and Neyman terminated (unresolved after 27 years) with Fisher's death in 1962. Neyman wrote a well-regarded eulogy. Some of Neyman's later publications reported p-values and significance levels.

The modern version of hypothesis testing is a hybrid of the two approaches that resulted from confusion by writers of statistical textbooks (as predicted by Fisher) beginning in the 1940s. (But signal detection, for example, still uses the Neyman/Pearson formulation.) Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs. This history explains the inconsistent terminology (example: the null hypothesis is never accepted, but there is a region of acceptance).

Sometime around 1940, in an apparent effort to provide researchers with a "non-controversial" way to have their cake and eat it too, the authors of statistical text books began anonymously combining these two strategies by using the p-value in place of the test statistic (or data) to test against the Neyman–Pearson "significance level". Thus, researchers were encouraged to infer the strength of their data against some null hypothesis using p-values, while also thinking they are retaining the post-data collection objectivity provided by hypothesis testing. It then became customary for the null hypothesis, which was originally some realistic research hypothesis, to be used almost solely as a strawman "nil" hypothesis (one where a treatment has no effect, regardless of the context).

A comparison between Fisherian, frequentist (Neyman–Pearson)
# Fisher's null hypothesis testing Neyman–Pearson decision theory
1 Set up a statistical null hypothesis. The null need not be a nil hypothesis (i.e., zero difference). Set up two statistical hypotheses, H1 and H2, and decide about α, β, and sample size before the experiment, based on subjective cost-benefit considerations. These define a rejection region for each hypothesis.
2 Report the exact level of significance (e.g. p = 0.051 or p = 0.049). Do not use a conventional 5% level, and do not talk about accepting or rejecting hypotheses. If the result is "not significant", draw no conclusions and make no decisions, but suspend judgement until further data is available. If the data falls into the rejection region of H1, accept H2; otherwise accept H1. Note that accepting a hypothesis does not mean that you believe in it, but only that you act as if it were true.
3 Use this procedure only if little is known about the problem at hand, and only to draw provisional conclusions in the context of an attempt to understand the experimental situation. The usefulness of the procedure is limited among others to situations where you have a disjunction of hypotheses (e.g. either μ1 = 8 or μ2 = 10 is true) and where you can make meaningful cost-benefit trade-offs for choosing alpha and beta.

Early choices of null hypothesis

Paul Meehl has argued that the epistemological importance of the choice of null hypothesis has gone largely unacknowledged. When the null hypothesis is predicted by theory, a more precise experiment will be a more severe test of the underlying theory. When the null hypothesis defaults to "no difference" or "no effect", a more precise experiment is a less severe test of the theory that motivated performing the experiment. An examination of the origins of the latter practice may therefore be useful:

1778: Pierre Laplace compares the birthrates of boys and girls in multiple European cities. He states: "it is natural to conclude that these possibilities are very nearly in the same ratio". Thus Laplace's null hypothesis that the birthrates of boys and girls should be equal given "conventional wisdom".

1900: Karl Pearson develops the chi squared test to determine "whether a given form of frequency curve will effectively describe the samples drawn from a given population." Thus the null hypothesis is that a population is described by some distribution predicted by theory. He uses as an example the numbers of five and sixes in the Weldon dice throw data.

1904: Karl Pearson develops the concept of "contingency" in order to determine whether outcomes are independent of a given categorical factor. Here the null hypothesis is by default that two things are unrelated (e.g. scar formation and death rates from smallpox). The null hypothesis in this case is no longer predicted by theory or conventional wisdom, but is instead the principle of indifference that led Fisher and others to dismiss the use of "inverse probabilities".

Null hypothesis statistical significance testing

An example of Neyman–Pearson hypothesis testing can be made by a change to the radioactive suitcase example. If the "suitcase" is actually a shielded container for the transportation of radioactive material, then a test might be used to select among three hypotheses: no radioactive source present, one present, two (all) present. The test could be required for safety, with actions required in each case. The Neyman–Pearson lemma of hypothesis testing says that a good criterion for the selection of hypotheses is the ratio of their probabilities (a likelihood ratio). A simple method of solution is to select the hypothesis with the highest probability for the Geiger counts observed. The typical result matches intuition: few counts imply no source, many counts imply two sources and intermediate counts imply one source. Notice also that usually there are problems for proving a negative. Null hypotheses should be at least falsifiable.

Neyman–Pearson theory can accommodate both prior probabilities and the costs of actions resulting from decisions. The former allows each test to consider the results of earlier tests (unlike Fisher's significance tests). The latter allows the consideration of economic issues (for example) as well as probabilities. A likelihood ratio remains a good criterion for selecting among hypotheses.

The two forms of hypothesis testing are based on different problem formulations. The original test is analogous to a true/false question; the Neyman–Pearson test is more like multiple choice. In the view of Tukey the former produces a conclusion on the basis of only strong evidence while the latter produces a decision on the basis of available evidence. While the two tests seem quite different both mathematically and philosophically, later developments lead to the opposite claim. Consider many tiny radioactive sources. The hypotheses become 0,1,2,3... grains of radioactive sand. There is little distinction between none or some radiation (Fisher) and 0 grains of radioactive sand versus all of the alternatives (Neyman–Pearson). The major Neyman–Pearson paper of 1933 also considered composite hypotheses (ones whose distribution includes an unknown parameter). An example proved the optimality of the (Student's) t-test, "there can be no better test for the hypothesis under consideration" (p 321). Neyman–Pearson theory was proving the optimality of Fisherian methods from its inception.

Fisher's significance testing has proven a popular flexible statistical tool in application with little mathematical growth potential. Neyman–Pearson hypothesis testing is claimed as a pillar of mathematical statistics, creating a new paradigm for the field. It also stimulated new applications in statistical process control, detection theory, decision theory and game theory. Both formulations have been successful, but the successes have been of a different character.

The dispute over formulations is unresolved. Science primarily uses Fisher's (slightly modified) formulation as taught in introductory statistics. Statisticians study Neyman–Pearson theory in graduate school. Mathematicians are proud of uniting the formulations. Philosophers consider them separately. Learned opinions deem the formulations variously competitive (Fisher vs Neyman), incompatible or complementary. The dispute has become more complex since Bayesian inference has achieved respectability.

The terminology is inconsistent. Hypothesis testing can mean any mixture of two formulations that both changed with time. Any discussion of significance testing vs hypothesis testing is doubly vulnerable to confusion.

Fisher thought that hypothesis testing was a useful strategy for performing industrial quality control, however, he strongly disagreed that hypothesis testing could be useful for scientists. Hypothesis testing provides a means of finding test statistics used in significance testing. The concept of power is useful in explaining the consequences of adjusting the significance level and is heavily used in sample size determination. The two methods remain philosophically distinct. They usually (but not always) produce the same mathematical answer. The preferred answer is context dependent. While the existing merger of Fisher and Neyman–Pearson theories has been heavily criticized, modifying the merger to achieve Bayesian goals has been considered.

Criticism

Criticism of statistical hypothesis testing fills volumes. Much of the criticism can be summarized by the following issues:

  • The interpretation of a p-value is dependent upon stopping rule and definition of multiple comparison. The former often changes during the course of a study and the latter is unavoidably ambiguous. (i.e. "p values depend on both the (data) observed and on the other possible (data) that might have been observed but weren't").
  • Confusion resulting (in part) from combining the methods of Fisher and Neyman–Pearson which are conceptually distinct.
  • Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments.
  • Rigidly requiring statistical significance as a criterion for publication, resulting in publication bias. Most of the criticism is indirect. Rather than being wrong, statistical hypothesis testing is misunderstood, overused and misused.
  • When used to detect whether a difference exists between groups, a paradox arises. As improvements are made to experimental design (e.g. increased precision of measurement and sample size), the test becomes more lenient. Unless one accepts the absurd assumption that all sources of noise in the data cancel out completely, the chance of finding statistical significance in either direction approaches 100%. However, this absurd assumption that the mean difference between two groups cannot be zero implies that the data cannot be independent and identically distributed (i.i.d.) because the expected difference between any two subgroups of i.i.d. random variates is zero; therefore, the i.i.d. assumption is also absurd.
  • Layers of philosophical concerns. The probability of statistical significance is a function of decisions made by experimenters/analysts. If the decisions are based on convention they are termed arbitrary or mindless while those not so based may be termed subjective. To minimize type II errors, large samples are recommended. In psychology practically all null hypotheses are claimed to be false for sufficiently large samples so "...it is usually nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis.". "Statistically significant findings are often misleading" in psychology. Statistical significance does not imply practical significance and correlation does not imply causation. Casting doubt on the null hypothesis is thus far from directly supporting the research hypothesis.
  • "[I]t does not tell us what we want to know". Lists of dozens of complaints are available.

Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis. The continuing controversy concerns the selection of the best statistical practices for the near-term future given the (often poor) existing practices. Critics would prefer to ban NHST completely, forcing a complete departure from those practices, while supporters suggest a less absolute change.

Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The American Psychological Association has strengthened its statistical reporting requirements after review, medical journal publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias and a journal (Journal of Articles in Support of the Null Hypothesis) has been created to publish such results exclusively. Textbooks have added some cautions and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. Major organizations have not abandoned use of significance tests although some have discussed doing so.

Alternatives

A unifying position of critics is that statistics should not lead to an accept-reject conclusion or decision, but to an estimated value with an interval estimate; this data-analysis philosophy is broadly referred to as estimation statistics. Estimation statistics can be accomplished with either frequentist or Bayesian methods.

One strong critic of significance testing suggested a list of reporting alternatives: effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality. None of these suggested alternatives produces a conclusion/decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals. "The distinction between the ... approaches is largely one of reporting and interpretation."

On one "alternative" there is no disagreement: Fisher himself said, "In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result." Cohen, an influential critic of significance testing, concurred, "... don't look for a magic alternative to NHST [null hypothesis significance testing] ... It doesn't exist." "... given the problems of statistical induction, we must finally rely, as have the older sciences, on replication." The "alternative" to significance testing is repeated testing. The easiest way to decrease statistical uncertainty is by obtaining more data, whether by increased sample size or by repeated tests. Nickerson claimed to have never seen the publication of a literally replicated experiment in psychology. An indirect approach to replication is meta-analysis.

Bayesian inference is one proposed alternative to significance testing. (Nickerson cited 10 sources suggesting it, including Rozeboom (1960)). For example, Bayesian parameter estimation can provide rich information about the data from which researchers can draw inferences, while using uncertain priors that exert only minimal influence on the results when enough data is available. Psychologist John K. Kruschke has suggested Bayesian estimation as an alternative for the t-test. Alternatively two competing models/hypothesis can be compared using Bayes factors. Bayesian methods could be criticized for requiring information that is seldom available in the cases where significance testing is most heavily used. Neither the prior probabilities nor the probability distribution of the test statistic under the alternative hypothesis are often available in the social sciences.

Advocates of a Bayesian approach sometimes claim that the goal of a researcher is most often to objectively assess the probability that a hypothesis is true based on the data they have collected. Neither Fisher's significance testing, nor Neyman–Pearson hypothesis testing can provide this information, and do not claim to. The probability a hypothesis is true can only be derived from use of Bayes' Theorem, which was unsatisfactory to both the Fisher and Neyman–Pearson camps due to the explicit use of subjectivity in the form of the prior probability. Fisher's strategy is to sidestep this with the p-value (an objective index based on the data alone) followed by inductive inference, while Neyman–Pearson devised their approach of inductive behaviour.

Philosophy

Hypothesis testing and philosophy intersect. Inferential statistics, which includes hypothesis testing, is applied probability. Both probability and its application are intertwined with philosophy. Philosopher David Hume wrote, "All knowledge degenerates into probability." Competing practical definitions of probability reflect philosophical differences. The most common application of hypothesis testing is in the scientific interpretation of experimental data, which is naturally studied by the philosophy of science.

Fisher and Neyman opposed the subjectivity of probability. Their views contributed to the objective definitions. The core of their historical disagreement was philosophical.

Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly correlation does not imply causation and the design of experiments. Hypothesis testing is of continuing interest to philosophers.

Education

Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught. Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. Some writers have stated that statistical analysis of this kind allows for thinking clearly about problems involving mass data, as well as the effective reporting of trends and inferences from said data, but caution that writers for a broad public should have a solid understanding of the field in order to use the terms and concepts correctly. An introductory college statistics class places much emphasis on hypothesis testing – perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis. An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like z, Student's t, F and chi-squared). Statistical hypothesis testing is considered a mature area within statistics, but a limited amount of development continues.

An academic study states that the cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors. While the problem was addressed more than a decade ago, and calls for educational reform continue, students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing. Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject.

 

Cooperative

From Wikipedia, the free encyclopedia ...