Search This Blog

Monday, October 1, 2018

The World’s First Practical Quantum Computer May Be Just Five Years Away

A practical quantum computer would leave your desktop in the dust.


You’ve read the headlines: quantum computers are going to cure disease by discovering new pharmaceuticals! They’re going to pore through all the world’s data and find solutions to problems like poverty and inequality!

Alternatively, they might not do any of that. We’re really not sure what a quantum computer will even look like, but boy are we excited.

It often feels like quantum computers are in their own quantum state — they’re revolutionizing the world, but are still a distant pipe dream.

We’re really not sure what a quantum computer will even look like, but boy are we excited.
Now, though, the National Science Foundation has plans to pluck quantum computers from the realm of the fantastic and drop them squarely in its research labs. And it’s willing to pay an awful lot to do so.

In August, the federal agency announced the Software-Tailored Architecture for Quantum co-design (STAQ) project. Physicists, engineers, computer scientists, and other researchers from Duke and six other universities (including MIT and University of California-Berkeley) will band together to embark on the five-year, $15 million mission.

The goal is to create the world’s first practical quantum computer — one that goes beyond a proof-of-concept and actually outperforms the best classical computers out there — from the ground up.

A little background: there are a few key differences between a classical computer and a quantum computer. Where a classic computer uses bits that are either in a 0 or 1 state, quantum bits, or qubits, can also be both 1 and 0 at the same time. The quantum circuits that use these qubits to transfer information or carry out a calculation are called quantum logic gates; just as a classic circuit controls the flow of electricity within a computer’s circuitry, these gates steer the individual qubits via photons or trapped ions.

In order to develop quantum computers that are actually useful, scientists need to figure out how to improve both hardware we use to build the physical devices, and the software we run on them. That means figuring out how to build systems with more qubits that are less error-prone, and determining how to sort out the correct responses to our queries when we get lots of noise back with them. It’s likely that part of the answer is building automated tools that can optimize how certain algorithms are mapped onto the specific hardware, ultimately tackling both problems at once.
Image Credit: TheDigitalArtist/Victor Tangermann
To better understand what this program might produce, Futurism caught up with Kenneth Brown, the Duke University engineer in charge of STAQ. Here’s our conversation, which has been lightly edited and condensed for clarity. We’ve supplemented Brown’s answers with hyperlinks.

Futurism: A lot of what we hear about quantum computing is very abstract and theoretical — there’s lots of research that might lead towards quantum computers, but doesn’t show any clear path on how to get there. What will your team be able to do that others haven’t been able to do in the past?

Kenneth Brown: I think it’s important to remember that quantum computers can be made out of a wide variety of things. I usually make an analogy to classical computers. The first classical computers were just gears, pretty much because that was the best technology we had. And then there was this vacuum tube phase of classical computers that was quite useful and good. And then the first silicon transistor first appeared. And it’s important to remember that when the silicon transistor first appeared, it couldn’t quite compete with vacuum tubes. Sometimes I think people forget it was such an amazing discovery.

Quantum computing is the same thing. There are lots of ways to represent quantum information. Right now, the two technologies that have demonstrated the most useful applications are superconducting qubits and trapped ion qubits. They’re different and they have pluses and minuses, but in our group, we’ve been collectively focused on these trapped ion qubits.

With trapped ion qubits, what’s nice is that on a small scale of tens of ions, all the qubits are directly connected. That’s very different from a superconducting system of a solid-state system, in which you have to talk to the qubits that are nearby. So I think we have very concrete plans to get to 30, 32 qubits. That’s clear. We would like to extend that to something closer to 64 or so, and that is going to require some new research.

F: What makes this a “practical” computer compared to all of the other people working on quantum computers?

KB: I do think there are industrial efforts pushing towards building exactly these practical devices. The thing which really distinguishes us is being on the academic side. I think it allows for more exploration, with the goal of making a device which enables people to test wildly different ideas on how the architecture should be and what applications should be on it, these sort of things.

Just to pick an example, the guys at IBM have their quantum device. I actually collaborate with them through other projects, and I think they’re pretty open. But right now, the way you interact with it, you’re already at a level of abstraction [in that people can ask things of the computer online but can’t change how it’s programmed]. If you were thinking about totally optimizing this thing, you can’t. They have a tradeoff: they have their computer totally open for access on the web, but to make it stable like that, you have to turn off some knobs. [The IBM computer, because it’s sequestered off and intended for many researchers to use, can’t be customized to do everything an individual might want it to.]

So our goal is to make a device reaching this practical scale where researchers can play with all the knobs.

F: How would quantum computing change things for the average person? 

KB: I think in the long term, quantum computing and communication will change how we deal with encoded information on the internet. In Google Chrome, in fact, you can already change your cryptography to a possible post-quantum cryptography setup.

The second thing is I think people don’t think about all the ways that molecular design impacts materials — from boring things like water bottles to fancy things like specific new medicines. So what’s interesting is if the quantum computer fulfills its promise to efficiently and accurately calculate those molecular properties, that could really change the materials and medicines we see in the future.

But on what you’re going to do on your home computer, the way I think about it is most people use their computer to watch Netflix and occasionally write a letter or email or whatever. Those are not places where quantum computers really help you.

So it’s sort of funny — I don’t know what the user base would be. But when computers were first built, people had the same impression. They said that computers would just be for scientists doing lab work. And, clearly, that’s no longer the case.

F: What sort of person will be able to use a quantum computer? How do you train someone to use it, and what might the quantum computing degree of the future look like?

KB: When I try to explain quantum computing to someone, if they know the physics or chemistry of quantum mechanics, then I can usually start there to explain how to do the computing side. And the other side is also true: if people understand computing pretty well, I can explain the extra modules that quantum computing gives you.

In the future, we probably need people trained from both of those disciplines. We need people who have a physical sciences background who we get up to speed on the computer science side and the opposite.

The specific thing we’re going to try to do is have this quantum summer school, with the idea to bring in people from industry who are maybe excellent microwave engineers or software engineers, and try to give them enough tools so they can start to think about the extra rules you have to think about with quantum.

F: What sorts of new research will you need to sort out before this thing can be built? What will that take?

KB: We have some ideas. In a classic computer you work with voltage, but in quantum computing, I need to somehow carry information from one place to another. Do messenger qubits that carry information to other parts of the computer have to be the same type of qubit that the rest of the computer is made of? We’re not sure yet.

A common way to think about scaling up the complexity of quantum computers is called the CCD architecture. The idea is to shuttle manageable chains of ions from point to point. That’s one possibility.

There’s been some theoretical work looking at whether you can have photons interconnect between ion chains. The idea across all kinds of supercomputers is to use photons as these messenger qubits. And by doing that, you can basically have a bunch of small quantum computers wired up by all these photons that collectively act like a larger computer.

But that’s farther out. I think getting that to work at the bandwidth we need in the next five years would be pretty challenging. If it happens, that would be great, but it’s probably farther out.

F: Along the way, how will you know that you’re making tangible progress? Do you have benchmarks for knowing that you’re, say, halfway there? How can you test to know for sure that it’s working?

KB: On the hardware side, we can increase the number of qubits and get the gates [these are, if you recall, the things that move ions or photons to transfer information] better and call it tangible progress. We have a sense that we have to get, even though the number moves, somewhere above fifty qubits to have a fighting chance. [As of March, Google holds the record with a 72-qubit system]

At the same time, we’re going to take algorithms and applications that we know, and we’re going to map them onto the hardware. We’re going to try to optimize the algorithm as we map it in a way that makes the overall application less vulnerable to noise.

Before we run these applications, we have a rule of thumb about how often they should fail in tests and general use. But after this software optimization that my team is working on, ideally it will fail much less often. That helps us explore more in the algorithm space because it gives you confidence that you can push quantum computers towards more complicated systems. I think it’s important to note that we have the space to be very exploratory, to look at problems people aren’t thinking about.

F: What’s the worst misconception about quantum computers that you run into? What do people always seem to get wrong about them?

KB: The one misconception is that it’s magic. Quantum computers aren’t magic; they don’t allow you to solve all problems.

Here’s the thing — in classical computing, we have the sense that there are some problems that are easy and some problems which are really hard, which means we can’t solve them in polynomial time [a computer science term used to denote whether a computer is able to complete a task quickly].

It turns out we spend a lot of our computing power trying to solve the problems that we can’t solve in polynomial time, and we just have approximations.

Quantum computers do allow you to solve some of the problems which are intractable on a classic computer, but they don’t solve them all. Usually, the thing which drives me crazy is when a quantum computer article says they can solve all problems instantly because they do infinite parallel calculations at once.

I’m really excited for when we have large scale quantum computers. With some problems — the famous example is the Traveling Salesman Problem — we know we can’t solve it for all possible routes of salesmen, but we have to solve it anyway. The classical computer does the best it can, and then when it blows it, nobody’s upset. You’re like, ‘oh okay, well it’s going to get it wrong some of the time.’

When we have large-scale quantum computers, we can test algorithms like that more accurately. We’ll know we can solve the classical problem, just occasionally the new computer gets bogged down.

I’m a big optimist. I guess that’s how you end up working this kind of field.

More on teams racing towards a quantum computer: AI Will Lead the Charge Developing Quantum Computers

Open science

From Wikipedia, the free encyclopedia
 
The six principles of open science

Open science is the movement to make scientific research, data and dissemination accessible to all levels of an inquiring society, amateur or professional. Open science is transparent and accessible knowledge that is shared and developed through collaborative networks. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open notebook science, and generally making it easier to publish and communicate scientific knowledge.

Open science began in the 17th century with the advent of the academic journal, when the societal demand for access to scientific knowledge reached a point where it became necessary for groups of scientists to share resources with each other so that they could collectively do their work. In modern times there is debate about the extent to which scientific information should be shared. The conflict is between the desire of scientists to have access to shared resources versus the desire of individual entities to profit when other entities partake of their resources. Additionally, the status of open access and resources that are available for its promotion are likely to differ from one field of academic inquiry to another.

Background

Science is broadly understood as collecting, analyzing, publishing, reanalyzing, critiquing, and reusing data. Proponents of open science identify a number of barriers that impede or dissuade the broad dissemination of scientific data. These include financial paywalls of for-profit research publishers, restrictions on usage applied by publishers of data, poor formatting of data or use of proprietary software that makes it difficult to re-purpose, and cultural reluctance to publish data for fears of losing control of how the information is used.

Open Science Taxonomy

According to the FOSTER taxonomy Open science can often include aspects of Open access, Open data and the open source movement whereby modern science requires software in order to process data and information. Open research computation also addresses the problem of reproducibility of scientific results. The FOSTER Open Science taxonomy is available in RDF/XML and high resolution image.

Types

The term "open science" does not have any one fixed definition or operationalization. On the one hand, it has been referred to as a "puzzling phenomenon". On the other hand, the term has been used to encapsulate a series of principles that aim to foster scientific growth and its complementary access to the public. Two influential sociologists, Benedikt Fecher and Sascha Friesike, have created multiple "schools of thought" that describe the different interpretations of the term.

According to Fecher and Friesike ‘Open Science’ is an umbrella term for various assumptions about the development and dissemination of knowledge. To show the term’s multitudinous perceptions, they differentiate between five Open Science schools of thought:

Infrastructure School

The infrastructure school is founded on the assumption that "efficient" research depends on the availability of tools and applications. Therefore, the "goal" of the school is to promote the creation of openly available platforms, tools, and services for scientists. Hence, the infrastructure school is concerned with the technical infrastructure that promotes the development of emerging and developing research practices through the use of the internet, including the use of software and applications, in addition to conventional computing networks. In that sense, the infrastructure school regards open science as a technological challenge. The infrastructure school is tied closely with the notion of "cyberscience", which describes the trend of applying information and communication technologies to scientific research, which has led to an amicable development of the infrastructure school. Specific elements of this prosperity include increasing collaboration and interaction between scientists, as well as the development of "open-source science" practices. The sociologists discuss two central trends in the Infrastructure school:

1. Distributed computing: This trend encapsulates practices that outsource complex, process-heavy scientific computing to a network of volunteer computers around the world. The examples that the sociologists cite in their paper is that of the Open Science Grid, which enables the development of large-scale projects that require high-volume data management and processing, which is accomplished through a distributed computer network. Moreover, the grid provides the necessary tools that the scientists can use to facilitate this process.

2. Social and Collaboration Networks or Scientists: This trend encapsulates the development of software that makes interaction with other researchers and scientific collaborations much easier than traditional, non-digital practices. Specifically, the trend is focused on implementing newer Web 2.0 tools to facilitate research related activities on the internet. De Roure and colleagues (2008)  list a series of four key capabilities which they believe composes A Social Virtual Research Environment (SVRE):
  • The SVRE should primarily aid the management and sharing of research objects. The authors define these to be a variety of digital commodities that are used repeatedly by researchers.
  • Second, the SVRE should have inbuilt incentives for researchers to make their research objects available on the online platform.
  • Third, the SVRE should be "open" as well as "extensible", implying that different types of digital artifacts composing the SVRE can be easily integrated.
  • Fourth, the authors propose that the SVRE is more than a simple storage tool for research information. Instead, the researchers propose that the platform should be "actionable". That is, the platform should be built in such a way that research objects can be used in the conduct of research as opposed to simply being stored.

Measurement School

The measurement school, in the view of the authors, deals with developing alternative methods to determine scientific impact. This school acknowledges that measurements of scientific impact are crucial to a researcher's reputation, funding opportunities, and career development. Hence, the authors argue, that any discourse about Open Science is pivoted around developing a robust measure of scientific impact in the digital age. The authors then discuss other research indicating support for the measurement school. The three key currents of previous literature discussed by the authors are:
  • The peer-review is described as being time-consuming.
  • The impact of an article, tied to the name of the authors of the article, is related more to the circulation of the journal rather than the overall quality of the article itself.
  • New publishing formats that are closely aligned with the philosophy of Open Science are rarely found in the format of a journal that allows for the assignment of the impact factor.
Hence, this school argues that there are faster impact measurement technologies that can account for a range of publication types as well as social media web coverage of a scientific contribution to arrive at a complete evaluation of how impactful the science contribution was. The gist of the argument for this school is that hidden uses like reading, bookmarking, sharing, discussing and rating are traceable activities, and these traces can and should be used to develop a newer measure of scientific impact. The umbrella jargon for this new type of impact measurements is called altmetrics, coined in a 2011 article by Priem et al., (2011). Markedly, the authors discuss evidence that altmetrics differ from traditional webometrics which are slow and unstructured. Altmetrics are proposed to rely upon a greater set of measures that account for tweets, blogs, discussions, and bookmarks. The authors claim that the existing literature has often proposed that altmetrics should also encapsulate the scientific process, and measure the process of research and collaboration to create an overall metric. However, the authors are explicit in their assessment that few papers offer methodological details as to how to accomplish this. The authors use this and the general dearth of evidence to conclude that research in the area of altmetrics is still in its infancy.

Public School

According to the authors, the central concern of the school is to make science accessible to a wider audience. The inherent assumption of this school, as described by the authors, is that the newer communication technologies such as Web 2.0 allow scientists to open up the research process and also allow scientist to better prepare their "products of research" for interested non-experts. Hence, the school is characterized by two broad streams: one argues for the access of the research process to the masses, whereas the other argues for increased access to the scientific product to the public.
  • Accessibility to the Research Process: Communication technology allows not only for the constant documentation of research but also promotes the inclusion of many different external individuals in the process itself. The authors cite citizen science- the participation of non-scientists and amateurs in research. The authors discuss instances in which gaming tools allow scientists to harness the brain power of a volunteer workforce to run through several permutations of protein-folded structures. This allows for scientists to eliminate many more plausible protein structures while also "enriching" the citizens about science. The authors also discuss a common criticism of this approach: the amateur nature of the participants threatens to pervade the scientific rigor of experimentation.
  • Comprehensibility of the Research Result: This stream of research concerns itself with making research understandable for a wider audience. The authors describe a host of authors that promote the use of specific tools for scientific communication, such as microblogging services, to direct users to relevant literature. The authors claim that this school proposes that it is the obligation of every researcher to make their research accessible to the public. The authors then proceed to discuss if there is an emerging market for brokers and mediators of knowledge that is otherwise too complicated for the public to grasp effortlessly.

Democratic School

The democratic school concerns itself with the concept of access to knowledge. As opposed to focusing on the accessibility of research and its understandability, advocates of this school focus on the access of products of research to the public. The central concern of the school is with the legal and other obstacles that hinder the access of research publications and scientific data to the public. The authors argue that proponents of this school assert that any research product should be freely available. The authors argue that the underlying notion of this school is that everyone has the same, equal right of access to knowledge, especially in the instances of state-funded experiments and data. The authors categorize two central currents that characterize this school: Open Access and Open Data.
  • Open Data: The authors discuss existing attitudes in the field that rebel against the notion that publishing journals should claim copyright over experimental data, which prevents the re-use of data and therefore lowers the overall efficiency of science in general. The claim is that journals have no use of the experimental data and that allowing other researchers to utilize this data will be fruitful. The authors cite other literature streams that discovered that only a quarter of researchers agree to share their data with other researchers because of the effort required for compliance.
  • Open Access to Research Publication: According to this school, there is a gap between the creation and sharing of knowledge. Proponents argue, as the authors describe, that even scientific knowledge doubles every 5 years, access to this knowledge remains limited. These proponents consider access to knowledge as a necessity for human development, especially in the economic sense.

Pragmatic School

The pragmatic school considers Open Science as the possibility to make knowledge creation and dissemination more efficient by increasing the collaboration throughout the research process. Proponents argue that science could be optimized by modularizing the process and opening up the scientific value chain. ‘Open’ in this sense follows very much the concept of open innovation. Take for instance transfers the outside-in (including external knowledge in the production process) and inside-out (spillovers from the formerly closed production process) principles to science. Web 2.0 is considered a set of helpful tools that can foster collaboration (sometimes also referred to as Science 2.0). Further, citizen science is seen as a form of collaboration that includes knowledge and information from non-scientists. Fecher and Friesike describe data sharing as an example of the pragmatic school as it enables researchers to use other researchers’ data to pursue new research questions or to conduct data-driven replications.

History

The widespread adoption of the institution of the scientific journal marks the beginning of the modern concept of open science. Before this time societies pressured scientists into secretive behaviors.

Before journals

Before the advent of scientific journals, scientists had little to gain and much to lose by publicizing scientific discoveries. Many scientists, including Galileo, Kepler, Isaac Newton, Christiaan Huygens, and Robert Hooke, made claim to their discoveries by describing them in papers coded in anagrams or cyphers and then distributing the coded text. Their intent was to develop their discovery into something off which they could profit, then reveal their discovery to prove ownership when they were prepared to make a claim on it.

The system of not publicizing discoveries caused problems because discoveries were not shared quickly and because it sometimes was difficult for the discoverer to prove priority. Newton and Gottfried Leibniz both claimed priority in discovering calculus. Newton said that he wrote about calculus in the 1660s and 1670s, but did not publish until 1693. Leibniz published "Nova Methodus pro Maximis et Minimis", a treatise on calculus]] in 1684. Debates over priority are inherent in systems where science is not published openly, and this was problematic for scientists who wanted to benefit from priority.

These cases are representative of a system of aristocratic patronage in which scientists received funding to develop either immediately useful things or to entertain. In this sense, funding of science gave prestige to the patron in the same way that funding of artists, writers, architects, and philosophers did. Because of this, scientists were under pressure to satisfy the desires of their patrons, and discouraged from being open with research which would bring prestige to persons other than their patrons.

Emergence of academies and journals

Eventually the individual patronage system ceased to provide the scientific output which society began to demand. Single patrons could not sufficiently fund scientists, who had unstable careers and needed consistent funding. The development which changed this was a trend to pool research by multiple scientists into an academy funded by multiple patrons. In 1660 England established the Royal Society and in 1666 the French established the French Academy of Sciences. Between the 1660s and 1793, governments gave official recognition to 70 other scientific organizations modeled after those two academies. In 1665, Henry Oldenburg became the editor of Philosophical Transactions of the Royal Society, the first academic journal devoted to science, and the foundation for the growth of scientific publishing. By 1699 there were 30 scientific journals; by 1790 there were 1052. Since then publishing has expanded at even greater rates.

Popular Science Writing

The first popular science periodical of its kind was published in 1872, under a suggestive name that is still a modern portal for the offering science journalism: Popular Science. The magazine claims to have documented the invention of the telephone, the phonograph, the electric light and the onset of automobile technology. The magazine goes so far as to claim that the "history of Popular Science is a true reflection of humankind's progress over the past 129+ years". Discussions of popular science writing most often contend their arguments around some type of "Science Boom". A recent historiographic account of popular science traces mentions of the term"science boom" to Daniel Greenberg's Science and Government Reports in 1979 which posited that "Scientific magazines are bursting out all over. Similarly, this account discusses the publication Time, and its cover story of Carl Sagan in 1980 as propagating the claim that popular science has "turned into enthusiasm". Crucially, this secondary accounts asks the important question as to what was considered as popular "science" to begin with. The paper claims that any account of how popular science writing bridged the gap between the informed masses and the expert scientists must first consider who was considered a scientist to begin with.

Collaboration among academies

In modern times many academies have pressured researchers at publicly funded universities and research institutions to engage in a mix of sharing research and making some technological developments proprietary. Some research products have the potential to generate commercial revenue, and in hope of capitalizing on these products, many research institutions withhold information and technology which otherwise would lead to overall scientific advancement if other research institutions had access to these resources. It is difficult to predict the potential payouts of technology or to assess the costs of withholding it, but there is general agreement that the benefit to any single institution of holding technology is not as great as the cost of withholding it from all other research institutions.

Coining of phrase "OpenScience"

The exact phrase "Open Science" was coined by Steve Mann in 1998 at which time he also registered the domain name openscience.com and openscience.org which he sold to egruyter.com in 2011.

Politics

In many countries, governments fund some science research. Scientists often publish the results of their research by writing articles and donating them to be published in scholarly journals, which frequently are commercial. Public entities such as universities and libraries subscribe to these journals. Michael Eisen, a founder of the Public Library of Science, has described this system by saying that "taxpayers who already paid for the research would have to pay again to read the results."

In December 2011, some United States legislators introduced a bill called the Research Works Act, which would prohibit federal agencies from issuing grants with any provision requiring that articles reporting on taxpayer-funded research be published for free to the public online. Darrell Issa, a co-sponsor of the bill, explained the bill by saying that "Publicly funded research is and must continue to be absolutely available to the public. We must also protect the value added to publicly funded research by the private sector and ensure that there is still an active commercial and non-profit research community." One response to this bill was protests from various researchers; among them was a boycott of commercial publisher Elsevier called The Cost of Knowledge.

The Dutch Presidency of the Council of the European Union called out for action in April 2016 to migrate European Commission funded research to Open Science. European Commissioner Carlos Moedas introduced the Open Science Cloud at the Open Science Conference in Amsterdam on April 4–5. During this meeting also The Amsterdam Call for Action on Open Science was presented, a living document outlining concrete actions for the European Community to move to Open Science.

Reaction

Arguments against

The open sharing of research data is not widely practiced

Arguments against open science tend to advance several concerns. These include the potential for some scholars to capitalize on data other scholars have worked hard to collect, without collecting data themselves, the potential for less qualified individuals to misuse open data and arguments that novel data are more critical than reproducing or replicating older findings.
Too much unsorted information overwhelms scientists
Some scientists find inspiration in their own thoughts by restricting the amount of information they get from others. Alexander Grothendieck has been cited as a scientist who wanted to learn with restricted influence when he said that he wanted to "reach out in (his) own way to the things (he) wished to learn, rather than relying on the notions of consensus."
Potential misuse
In 2011, Dutch researchers announced their intention to publish a research paper in the journal Science describing the creation of a strain of H5N1 influenza which can be easily passed between ferrets, the mammals which most closely mimic the human response to the flu. The announcement triggered a controversy in both political and scientific circles about the ethical implications of publishing scientific data which could be used to create biological weapons. These events are examples of how science data could potentially be misused. Scientists have collaboratively agreed to limit their own fields of inquiry on occasions such as the Asilomar conference on recombinant DNA in 1975, and a proposed 2015 worldwide moratorium on a human-genome-editing technique.
The public will misunderstand science data
In 2009 NASA launched the Kepler spacecraft and promised that they would release collected data in June 2010. Later they decided to postpone release so that their scientists could look at it first. Their rationale was that non-scientists might unintentionally misinterpret the data, and NASA scientists thought it would be preferable for them to be familiar with the data in advance so that they could report on it with their level of accuracy.
Increasing the scale of science will make verification of any discovery more difficult
When more people report data it will take longer for anyone to consider all data, and perhaps more data of lower quality, before drawing any conclusion.
Low-quality science
Post-publication peer review, a staple of open science, has been criticized as promoting the production of lower quality papers that are extremely voluminous. Specifically, critics assert that as quality is not guaranteed by preprint servers, the veracity of papers will be difficult to assess by individual readers. This will lead to rippling effects of false science, akin to the recent epidemic of false news, propagated with ease on social media websites. Common solutions to this problem have been cited as adaptations of a new format in which everything is allowed to be published but a subsequent filter-curator model is imposed to ensure some basic quality of standards are met by all publications.

Arguments in favor

A number of scholars across disciplines have advanced various arguments in favor of open science. These generally focus on the perceived value of open science in improving the transparency and validity of research as well as in regards to public ownership of science, particularly that which is publicly funded. For example, in January 2014 J. Christopher Bare published a comprehensive "Guide to Open Science". Likewise in January, 2017, a group of scholars known for advocating open science published a "manifesto" for open science in the journal Nature. In November 2017, a group of early career researchers published their "manifesto" in the journal Genome Biology, stating that it is their task to change scientific research into open scientific research and commit to Open Science principles. 
Open access publication of research reports and data allows for rigorous peer-review
An article published by a team of NASA astrobiologists in 2010 in Science reported a bacterium known as GFAJ-1 that could purportedly metabolize arsenic (unlike any previously known species of lifeform). This finding, along with NASA's claim that the paper "will impact the search for evidence of extraterrestrial life", met with criticism within the scientific community. Much of the scientific commentary and critique around this issue took place in public forums, most notably on Twitter, where hundreds of scientists and non-scientists created a hashtag community around the hashtag #arseniclife. University of British Columbia astrobiologist Rosie Redfield, one of the most vocal critics of the NASA team's research, also submitted a draft of a research report of a study that she and colleagues conducted which contradicted the NASA team's findings; the draft report appeared in arXiv, an open-research repository, and Redfield called in her lab's research blog for peer review both of their research and of the NASA team's original paper. Researcher Jeff Rouder defined Open Science as "endeavoring to preserve the rights of others to reach independent conclusions about your data and work".

Science is publicly funded so all results of the research should be publicly available

Public funding of research has long been cited as one of the primary reasons for providing Open Access to research articles. Since there is significant value in other parts of the research such as code, data, protocols, and research proposals a similar argument is made that since these are publicly funded, they should be publicly available under a creative commons licence.

Open Science will make science more reproducible and transparent
 
Increasingly the reproducibility of science is being questioned and the term "reproducibility crisis" has been coined. For example, psychologist Stuart Vyse notes that "(r)ecent research aimed at previously published psychology studies has demonstrated--shockingly--that a large number of classic phenomena cannot be reproduced, and the popularity of p-hacking is thought to be one of the culprits." Open Science approaches are proposed as one way to help increase the reproducibility of work as well as to help mitigate against manipulation of data.

Open Science has more impact
 
There are several components to impact in research, many of which are hotly debated. However, under traditional scientific metrics parts Open science such as Open Access and Open Data have proved to outperform traditional versions.

Open Science will help answer uniquely complex questions
 
Recent arguments in favor of Open Science have maintained that Open Science is a necessary tool to begin answering immensely complex questions, such as the neural basis of consciousness. The typical argument propagates the fact that these type of investigations are too complex to be carried out by any one individual, and therefore, they must rely on a network of open scientists to be accomplished. By default, the nature of these investigations also makes this "open science" as "big science".

Organizations and projects

Big scientific projects are more likely to practice open science than small projects. Different projects conduct, advocate, develop tools for, or fund open science, and many organizations run multiple projects. For example, the Allen Institute for Brain Science conducts numerous open science projects while the Center for Open Science has projects to conduct, advocate, and create tools for open science. Open science is stimulating the emergence of sub-branches such as open synthetic biology and open therapeutics
.
Organizations have extremely diverse sizes and structures. The Open Knowledge Foundation (OKF) is a global organization sharing large data catalogs, running face to face conferences, and supporting open source software projects. In contrast, Blue Obelisk is an informal group of chemists and associated cheminformatics projects. The tableau of organizations is dynamic with some organizations becoming defunct, e.g., Science Commons, and new organizations trying to grow, e.g., the Self-Journal of Science. Common organizing forces include the knowledge domain, type of service provided, and even geography, e.g., OCSDNet's concentration on the developing world.

Conduct

Many open science projects focus on gathering and coordinating encyclopedic collections of large amounts of organized data. The Allen Brain Atlas maps gene expression in human and mouse brains; the Encyclopedia of Life documents all the terrestrial species; the Galaxy Zoo classifies galaxies; the International HapMap Project maps the haplotypes of the human genome; the Monarch Initiative makes available integrated public model organism and clinical data; and the Sloan Digital Sky Survey which regularizes and publishes data sets from many sources. All these projects accrete information provided by many different researchers with different standards of curation and contribution.

Other projects are organized around completion of projects that require extensive collaboration. For example, OpenWorm seeks to make a cellular level simulation of a roundworm, a multidisciplinary project. The Polymath Project seeks to solve difficult mathematical problems by enabling faster communications within the discipline of mathematics. The Collaborative Replications and Education project  recruits undergraduate students as citizen scientists by offering funding. Each project defines its needs for contributors and collaboration.

Advocacy

Numerous documents, organizations, and social movements advocate wider adoption of open science. Statements of principles include the Budapest Open Access Initiative from a December 2001 conference and the Panton Principles. New statements are constantly developed, such as the Amsterdam Call for Action on Open Science to be presented to the Dutch Presidency of the Council of the European Union in late May, 2016. These statements often try to regularize licenses and disclosure for data and scientific literature.

Other advocates concentrate on educating scientists about appropriate open science software tools. Education is available as training seminars, e.g., the Software Carpentry project; as domain specific training materials, e.g., the Data Carpentry project; and as materials for teaching graduate classes, e.g., the Open Science Training Initiative. Many organizations also provide education in the general principles of open science.

Within scholarly societies there are also sections and interest groups that promote open science practices. The Ecological Society of America has an Open Science Section . Similarly, the Society for American Archaeology has an Open Science Interest Group.

Publishing

Replacing the current scientific publishing model is one goal of open science. High costs to access literature gave rise to protests such as The Cost of Knowledge and to sharing papers without publisher consent, e.g., Sci-hub and ICanHazPDF. New organizations are experimenting with the open access model: the Public Library of Science, or PLOS, is creating a library of open access journals and scientific literature; F1000Research provides open publishing and open peer review for the life-sciences; figshare archives and shares images, readings, and other data; and arXiv provide electronic preprints across many fields; and many individual journals. Other publishing experiments include delayed and hybrid models.

Software

A variety of computer resources support open science. These include software like the Open Science Framework from the Center for Open Science to manage project information, data archiving and team coordination; distributed computing services like Ibercivis to utilize unused CPU time for computationally intensive tasks; and services like Experiment.com to provide crowdsourced funding for research projects.

Blockchain platforms for open science have been proposed. The first such platform is the Open Science Organization, which aims to solve urgent problems with fragmentation of the scientific ecosystem and difficulties of producing validated, quality science. Among the initiatives of Open Science Organization include the Interplanetary Idea System (IPIS), Researcher Index (RR-index), Unique Researcher Identity (URI), and Research Network. The Interplanetary Idea System is a blockchain based system that tracks the evolution of scientific ideas over time. It serves to quantify ideas based on uniqueness and importance, thus allowing the scientific community to identify pain points with current scientific topics and preventing unnecessary re-invention of previously conducted science. The Researcher Index aims to establish a data-driven statistical metric for quantifying researcher impact. The Unique Researcher Identity is a blockchain technology based solution for creating a single unifying identity for each researcher, which is connected to the researcher's profile, research activities, and publications. The Research Network is a social networking platform for researchers.

Preprint Servers

Preprint Servers come in many varieties, but the standard traits across them are stable: they seek to create a quick, free mode of communicating scientific knowledge to the public. Preprint servers act as a venue to quickly disseminate research and vary on their policies concerning when articles may be submitted relative to journal acceptance. Also typical of preprint servers is their lack of a peer-review process - typically, preprint servers have some type of quality check in place to ensure a minimum standard of publication, but this mechanism is not the same as a peer-review mechanism. Some preprint servers have explicitly partnered with the broader open science movement. Preprint servers can offer service similar to those of journals, and Google Scholar indexes many preprint servers and collects information about citations to preprints. The case for preprint servers is often made based on the slow pace of conventional publication formats. The motivation to start Socarxiv, an open-access preprint server for social science research, is the claim that valuable research being published in traditional venues often times takes several months to years to get published, which slows down the process of science significantly. Another argument made in favor of preprint servers like Socarxiv is the quality and quickness of feedback offered to scientists on their pre-published work. The founders of Socarxiv claim that their platform allows researchers to gain easy feedback from their colleagues on the platform, thereby allowing scientists to develop their work into the highest possible quality before formal publication and circulation. The founders of Socarxiv further claim that their platform affords the authors the greatest level of flexibility in updating and editing their work to ensure that the latest version is available for rapid dissemination. The founders claim that this is not traditionally the case with formal journals, which instate formal procedures to make updates to published articles. Perhaps the strongest advantage of some preprint servers is their seamless compatibility with Open Science software such as the Open Science Framework. The founders of SocArXiv claim that their preprint server connects all aspects of the research life cycle in OSF with the article being published on the preprint server. According to the founders, this allows for greater transparency and minimal work on the authors' part.

One criticism of pre-print servers is their potential to foster a culture of plagiarism. For example, the popular physics preprint server ArXiv had to withdraw 22 papers whence it came to light that they were plagiarized . In June 2002, a high-energy physicist in Japan was contacted by a man called Ramy Naboulsi, a non-institutionally affiliated mathematical physicist. Naboulsi requested Watanabe to upload his papers on ArXiv as he was not able to do so, because of his lack of an institutional affiliation. Later, the papers were realized to have been copied from the proceedings of a physics conference. Preprint servers are increasingly developing measures to circumvent this plagiarism problem. In developing nations like India and China, explicit measures are being taken to combat it. These measures usually involve creating some type of central repository for all available pre-prints, allowing the use of traditional plagiarism detecting algorithms to detect the fraud. Nonetheless, this is a pressing issue in the discussion of pre-print servers, and consequently for Open Science.

Simple linear regression

From Wikipedia, the free encyclopedia
Okun's law in macroeconomics is an example of the simple linear regression. Here the dependent variable (GDP growth) is presumed to be in a linear relationship with the changes in the unemployment rate.

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

It is common to make the additional stipulation that the ordinary least squares method should be used: the accuracy of each predicted value is measured by its squared residual (vertical distance between the point of the data set and the fitted line), and the goal is to make the sum of these squared deviations as small as possible. Other regression methods that can be used in place of ordinary least squares include least absolute deviations (minimizing the sum of absolute values of residuals) and the Theil–Sen estimator (which chooses a line whose slope is the median of the slopes determined by pairs of sample points). Deming regression (total least squares) also finds a line that fits a set of two-dimensional sample points, but (unlike ordinary least squares, least absolute deviations, and median slope regression) it is not really an instance of simple linear regression, because it does not separate the coordinates into one dependent and one independent variable and could potentially return a vertical line as its fit.

The remainder of the article assumes an ordinary least squares regression. In this case, the slope of the fitted line is equal to the correlation between y and x corrected by the ratio of standard deviations of these variables. The intercept of the fitted line is such that the line passes through the center of mass (x, y) of the data points.

Fitting the regression line

Consider the model function
y=\alpha +\beta x,
which describes a line with slope β and y-intercept α. In general such a relationship may not hold exactly for the largely unobserved population of values of the independent and dependent variables; we call the unobserved deviations from the above equation the errors. Suppose we observe n data pairs and call them {(xi, yi), i = 1, ..., n}. We can describe the underlying relationship between yi and xi involving this error term εi by
 y_i = \alpha + \beta x_i + \varepsilon_i.
This relationship between the true (but unobserved) underlying parameters α and β and the data points is called a linear regression model.

The goal is to find estimated values {\hat {\alpha }} and {\hat {\beta }} for the parameters α and β which would provide the "best" fit in some sense for the data points. As mentioned in the introduction, in this article the "best" fit will be understood as in the least-squares approach: a line that minimizes the sum of squared residuals {\displaystyle {\hat {\varepsilon }}_{i}} (differences between actual and predicted values of the dependent variable y), each of which is given by, for any candidate parameter values a and b,
{\displaystyle {\hat {\varepsilon }}_{i}=y_{i}-a-bx_{i}.}
In other words, {\hat {\alpha }} and {\hat {\beta }} solve the following minimization problem:
{\displaystyle {\text{Find }}\min _{a,\,b}Q(a,b),\quad {\text{for }}Q(a,b)=\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{\,2}=\sum _{i=1}^{n}(y_{i}-a-bx_{i})^{2}\ .}
By expanding to get a quadratic expression in a and b, we can derive values of a and b that minimize the objective function Q (these minimizing values are denoted \hat{\alpha} and \hat{\beta}):
{\displaystyle {\begin{aligned}{\hat {\alpha }}&={\bar {y}}-{\hat {\beta }}\,{\bar {x}},\\{\hat {\beta }}&={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}\\[6pt]&={\frac {\operatorname {Cov} (x,y)}{\operatorname {Var} (x)}}\\&=r_{xy}{\frac {s_{y}}{s_{x}}}.\\[6pt]\end{aligned}}}
Here we have introduced
Substituting the above expressions for \hat{\alpha} and \hat{\beta} into
{\displaystyle f={\hat {\alpha }}+{\hat {\beta }}x,}
yields
{\displaystyle {\frac {f-{\bar {y}}}{s_{y}}}=r_{xy}{\frac {x-{\bar {x}}}{s_{x}}}.}
This shows that rxy is the slope of the regression line of the standardized data points (and that this line passes through the origin).

Generalizing the {\bar {x}} notation, we can write a horizontal bar over an expression to indicate the average value of that expression over the set of samples. For example:
{\displaystyle {\overline {xy}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}y_{i}.}
This notation allows us a concise formula for rxy:
{\displaystyle r_{xy}={\frac {{\overline {xy}}-{\bar {x}}{\bar {y}}}{\sqrt {\left({\overline {x^{2}}}-{\bar {x}}^{2}\right)\left({\overline {y^{2}}}-{\bar {y}}^{2}\right)}}}.}
The coefficient of determination ("R squared") is equal to r_{xy}^2 when the model is linear with a single independent variable.

Simple linear regression without the intercept term (single regressor)

Sometimes it is appropriate to force the regression line to pass through the origin, because x and y are assumed to be proportional. For the model without the intercept term, y = βx, the OLS estimator for β simplifies to
{\displaystyle {\hat {\beta }}={\frac {\sum _{i=1}^{n}x_{i}y_{i}}{\sum _{i=1}^{n}x_{i}^{2}}}={\frac {\overline {xy}}{\overline {x^{2}}}}}
Substituting (xh, yk) in place of (x, y) gives the regression through (h, k):
{\displaystyle {\begin{aligned}{\hat {\beta }}&={\frac {\overline {(x-h)(y-k)}}{\overline {(x-h)^{2}}}}\\[6pt]&={\frac {{\overline {xy}}+k{\bar {x}}-h{\bar {y}}-hk}{{\overline {x^{2}}}-2h{\bar {x}}+h^{2}}}\\[6pt]&={\frac {{\overline {xy}}-{\bar {x}}{\bar {y}}+({\bar {x}}-h)({\bar {y}}-k)}{{\overline {x^{2}}}-{\bar {x}}^{2}+({\bar {x}}-h)^{2}}}\\[6pt]&={\frac {\operatorname {Cov} (x,y)+({\bar {x}}-h)({\bar {y}}-k)}{\operatorname {Var} (x)+({\bar {x}}-h)^{2}}},\end{aligned}}}
where Cov and Var refer to the covariance and variance of the sample data (uncorrected for bias).
The last form above demonstrates how moving the line away from the center of mass of the data points affects the slope.

Numerical properties

  1. The regression line goes through the center of mass point, {\displaystyle ({\bar {x}},\,{\bar {y}})}, if the model includes an intercept term (i.e., not forced through the origin).
  2. The sum of the residuals is zero if the model includes an intercept term:
    {\displaystyle \sum _{i=1}^{n}{\hat {\varepsilon }}_{i}=0.}
  3. The residuals and x values are uncorrelated, meaning (whether or not there is an intercept term in the model):
    {\displaystyle \sum _{i=1}^{n}x_{i}{\hat {\varepsilon }}_{i}\;=\;0}

Model-based properties

Description of the statistical properties of estimators from the simple linear regression estimates requires the use of a statistical model. The following is based on assuming the validity of a model under which the estimates are optimal. It is also possible to evaluate the properties under other assumptions, such as inhomogeneity, but this is discussed elsewhere.

Unbiasedness

The estimators \hat{\alpha} and \hat{\beta} are unbiased.

To formalize this assertion we must define a framework in which these estimators are random variables. We consider the residuals εi as random variables drawn independently from some distribution with mean zero. In other words, for each value of x, the corresponding value of y is generated as a mean response α + βx plus an additional random variable ε called the error term, equal to zero on average. Under such interpretation, the least-squares estimators {\hat {\alpha }} and {\hat {\beta }} will themselves be random variables whose means will equal the "true values" α and β. This is the definition of an unbiased estimator.

Confidence intervals

The formulas given in the previous section allow one to calculate the point estimates of α and β — that is, the coefficients of the regression line for the given set of data. However, those formulas don't tell us how precise the estimates are, i.e., how much the estimators \hat{\alpha} and \hat{\beta} vary from sample to sample for the specified sample size. Confidence intervals were devised to give a plausible set of values to the estimates one might have if one repeated the experiment a very large number of times.
The standard method of constructing confidence intervals for linear regression coefficients relies on the normality assumption, which is justified if either:
  1. the errors in the regression are normally distributed (the so-called classic regression assumption), or
  2. the number of observations n is sufficiently large, in which case the estimator is approximately normally distributed.
The latter case is justified by the central limit theorem.

Normality assumption

Under the first assumption above, that of the normality of the error terms, the estimator of the slope coefficient will itself be normally distributed with mean β and variance {\displaystyle \sigma ^{2}/\sum (x_{i}-{\bar {x}})^{2},} where σ2 is the variance of the error terms. At the same time the sum of squared residuals Q is distributed proportionally to χ2 with n − 2 degrees of freedom, and independently from \hat{\beta}. This allows us to construct a t-value
{\displaystyle t={\frac {{\hat {\beta }}-\beta }{s_{\hat {\beta }}}}\ \sim \ t_{n-2},}
where
{\displaystyle s_{\hat {\beta }}={\sqrt {\frac {{\frac {1}{n-2}}\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{\,2}}{\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}}}
is the standard error of the estimator \hat{\beta}.

This t-value has a Student's t-distribution with n − 2 degrees of freedom. Using it we can construct a confidence interval for β:
{\displaystyle \beta \in \left[{\hat {\beta }}-s_{\hat {\beta }}t_{n-2}^{*},\ {\hat {\beta }}+s_{\hat {\beta }}t_{n-2}^{*}\right],}
at confidence level (1 − γ), where {\displaystyle t_{n-2}^{*}} is the {\displaystyle \scriptstyle \left(1\;-\;{\frac {\gamma }{2}}\right){\text{-th}}} quantile of the tn−2 distribution. For example, if γ = 0.05 then the confidence level is 95%.

Similarly, the confidence interval for the intercept coefficient α is given by
{\displaystyle \alpha \in \left[{\hat {\alpha }}-s_{\hat {\alpha }}t_{n-2}^{*},\ {\hat {\alpha }}+s_{\hat {\alpha }}t_{n-2}^{*}\right],}
at confidence level (1 − γ), where
{\displaystyle s_{\hat {\alpha }}=s_{\hat {\beta }}{\sqrt {{\frac {1}{n}}\sum _{i=1}^{n}x_{i}^{2}}}={\sqrt {{\frac {1}{n(n-2)}}\left(\sum _{i=1}^{n}{\hat {\varepsilon }}_{j}^{\,2}\right){\frac {\sum _{i=1}^{n}x_{i}^{2}}{\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}}}}
The US "changes in unemployment – GDP growth" regression with the 95% confidence bands.

The confidence intervals for α and β give us the general idea where these regression coefficients are most likely to be. For example, in the Okun's law regression shown here the point estimates are
{\displaystyle {\hat {\alpha }}=0.859,\qquad {\hat {\beta }}=-1.817.}
The 95% confidence intervals for these estimates are
{\displaystyle \alpha \in \left[0.76,0.96\right],\qquad \beta \in \left[-2.06,-1.58\right].}
In order to represent this information graphically, in the form of the confidence bands around the regression line, one has to proceed carefully and account for the joint distribution of the estimators. It can be shown that at confidence level (1 − γ) the confidence band has hyperbolic form given by the equation
{\displaystyle (\alpha +\beta \xi )\in \left[{\hat {\alpha }}+{\hat {\beta }}\xi \pm t_{n-2}^{*}{\sqrt {\left({\frac {1}{n-2}}\sum {\hat {\varepsilon }}_{i}^{\,2}\right)\cdot \left({\frac {1}{n}}+{\frac {(\xi -{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}\right)}}\right].}

Asymptotic assumption

The alternative second assumption states that when the number of points in the dataset is "large enough", the law of large numbers and the central limit theorem become applicable, and then the distribution of the estimators is approximately normal. Under this assumption all formulas derived in the previous section remain valid, with the only exception that the quantile t*n−2 of Student's t distribution is replaced with the quantile q* of the standard normal distribution. Occasionally the fraction 1/n−2 is replaced with 1/n. When n is large such a change does not alter the results appreciably.

Numerical example

This data set gives average masses for women as a function of their height in a sample of American women of age 30–39. Although the OLS article argues that it would be more appropriate to run a quadratic regression for this data, the simple linear regression model is applied here instead.

Height (m), xi 1.47 1.50 1.52 1.55 1.57 1.60 1.63 1.65 1.68 1.70 1.73 1.75 1.78 1.80 1.83
Mass (kg), yi 52.21 53.12 54.48 55.84 57.20 58.57 59.93 61.29 63.11 64.47 66.28 68.10 69.92 72.19 74.46

There are n = 15 points in this data set. Hand calculations would be started by finding the following five sums:
{\displaystyle {\begin{aligned}&S_{x}=\sum x_{i}=24.76,\quad S_{y}=\sum y_{i}=931.17\\&S_{xx}=\sum x_{i}^{2}=41.0532,\quad S_{xy}=\sum x_{i}y_{i}=1548.2453,\quad S_{yy}=\sum y_{i}^{2}=58498.5439\end{aligned}}}
These quantities would be used to calculate the estimates of the regression coefficients, and their standard errors.
{\displaystyle {\begin{aligned}{\hat {\beta }}&={\frac {nS_{xy}-S_{x}S_{y}}{nS_{xx}-S_{x}^{2}}}=61.272\\{\hat {\alpha }}&={\frac {1}{n}}S_{y}-{\hat {\beta }}{\frac {1}{n}}S_{x}=-39.062\\s_{\varepsilon }^{2}&={\frac {1}{n(n-2)}}\left[nS_{yy}-S_{y}^{2}-{\hat {\beta }}^{2}(nS_{xx}-S_{x}^{2})\right]=0.5762\\s_{\hat {\beta }}^{2}&={\frac {ns_{\varepsilon }^{2}}{nS_{xx}-S_{x}^{2}}}=3.1539\\s_{\hat {\alpha }}^{2}&=s_{\hat {\beta }}^{2}{\frac {1}{n}}S_{xx}=8.63185\end{aligned}}}
The 0.975 quantile of Student's t-distribution with 13 degrees of freedom is t*13 = 2.1604, and thus the 95% confidence intervals for α and β are
{\displaystyle {\begin{aligned}&\alpha \in [\,{\hat {\alpha }}\mp t_{13}^{*}s_{\alpha }\,]=[\,{-45.4},\ {-32.7}\,]\\&\beta \in [\,{\hat {\beta }}\mp t_{13}^{*}s_{\beta }\,]=[\,57.4,\ 65.1\,]\end{aligned}}}
The product-moment correlation coefficient might also be calculated:
{\displaystyle {\hat {r}}={\frac {nS_{xy}-S_{x}S_{y}}{\sqrt {(nS_{xx}-S_{x}^{2})(nS_{yy}-S_{y}^{2})}}}=0.9945}
This example also demonstrates that sophisticated calculations will not overcome the use of badly prepared data. The heights were originally given in inches, and have been converted to the nearest centimetre. Since the conversion has introduced rounding error, this is not an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding: if this is done, the results become
\hat\beta = 61.6746, \qquad \hat\alpha = -39.7468.
Thus a seemingly small variation in the data has a real effect.

From Wikipedia, the free encyclopedia   ...