You’ve
read the headlines: quantum computers are going to cure disease by
discovering new pharmaceuticals! They’re going to pore through all the
world’s data and find solutions to problems like poverty and inequality!
Alternatively, they might not do any of that. We’re really not sure what a quantum computer will even look like, but boy are we excited.
It often feels like quantum computers are in their own quantum state —
they’re revolutionizing the world, but are still a distant pipe dream.
We’re really not sure what a quantum computer will even look like, but boy are we excited.
Now, though, the National Science Foundation has plans to pluck
quantum computers from the realm of the fantastic and drop them squarely
in its research labs. And it’s willing to pay an awful lot to do so.
In August, the federal agency announced the Software-Tailored Architecture for Quantum co-design (STAQ) project.
Physicists, engineers, computer scientists, and other researchers from
Duke and six other universities (including MIT and University of
California-Berkeley) will band together to embark on the five-year, $15
million mission.
The goal is to create the world’s first practical quantum computer —
one that goes beyond a proof-of-concept and actually outperforms the
best classical computers out there — from the ground up.
A little background: there are a few key differences
between a classical computer and a quantum computer. Where a classic
computer uses bits that are either in a 0 or 1 state, quantum bits, or
qubits, can also be both 1 and 0 at the same time. The quantum circuits that use these qubits
to transfer information or carry out a calculation are called quantum
logic gates; just as a classic circuit controls the flow of electricity
within a computer’s circuitry, these gates steer the individual qubits
via photons or trapped ions.
In order to develop quantum computers that are actually useful,
scientists need to figure out how to improve both hardware we use to
build the physical devices, and the software we run on them. That means
figuring out how to build systems with more qubits that are less
error-prone, and determining how to sort out the correct responses to
our queries when we get lots of noise back with them. It’s likely that
part of the answer is building automated tools that can optimize how
certain algorithms are mapped onto the specific hardware, ultimately
tackling both problems at once.
To better understand what this program might produce, Futurism caught up with Kenneth Brown,
the Duke University engineer in charge of STAQ. Here’s our
conversation, which has been lightly edited and condensed for clarity.
We’ve supplemented Brown’s answers with hyperlinks.
Futurism: A lot of what we hear about quantum computing is very abstract and theoretical — there’s lots of research that might
lead towards quantum computers, but doesn’t show any clear path on how
to get there. What will your team be able to do that others haven’t been
able to do in the past?
Kenneth Brown: I think it’s important to remember that quantum computers can be made out of a wide variety of things.
I usually make an analogy to classical computers. The first classical
computers were just gears, pretty much because that was the best
technology we had. And then there was this vacuum tube phase of
classical computers that was quite useful and good. And then the first
silicon transistor first appeared. And it’s important to remember that
when the silicon transistor first appeared, it couldn’t quite compete
with vacuum tubes. Sometimes I think people forget it was such an
amazing discovery.
Quantum computing is the same thing. There are lots of ways to
represent quantum information. Right now, the two technologies that have
demonstrated the most useful applications are superconducting qubits
and trapped ion qubits. They’re different and they have pluses and
minuses, but in our group, we’ve been collectively focused on these
trapped ion qubits.
With trapped ion qubits, what’s nice is that on a small scale of tens
of ions, all the qubits are directly connected. That’s very different
from a superconducting system of a solid-state system, in which you have
to talk to the qubits that are nearby. So I think we have very concrete
plans to get to 30, 32 qubits. That’s clear. We would like to extend
that to something closer to 64 or so, and that is going to require some
new research.
F: What makes this a “practical” computer compared to all of the other people working on quantum computers?
KB: I do think there are industrial efforts
pushing towards building exactly these practical devices. The thing
which really distinguishes us is being on the academic side. I think it
allows for more exploration, with the goal of making a device which
enables people to test wildly different ideas on how the architecture
should be and what applications should be on it, these sort of things.
Just to pick an example, the guys at IBM
have their quantum device. I actually collaborate with them through
other projects, and I think they’re pretty open. But right now, the way
you interact with it, you’re already at a level of abstraction [in that
people can ask things of the computer online but can’t change how it’s
programmed]. If you were thinking about totally optimizing this thing,
you can’t. They have a tradeoff: they have their computer totally open
for access on the web, but to make it stable like that, you have to turn
off some knobs. [The IBM computer, because it’s sequestered off and
intended for many researchers to use, can’t be customized to do
everything an individual might want it to.]
So our goal is to make a device reaching this practical scale where researchers can play with all the knobs.
F: How would quantum computing change things for the average person?
KB: I think in the long term, quantum computing and communication will change how we deal withencoded
information on the internet. In Google Chrome, in fact, you can already
change your cryptography to a possible post-quantum cryptography setup.
The second thing is I think people don’t think about all the ways
that molecular design impacts materials — from boring things like water
bottles to fancy things like specific new medicines. So what’s
interesting is if the quantum computer fulfills its promise to
efficiently and accurately calculate those molecular properties, that
could really change the materials and medicines we see in the future.
But on what you’re going to do on your home computer, the way I think
about it is most people use their computer to watch Netflix and
occasionally write a letter or email or whatever. Those are not places
where quantum computers really help you.
So it’s sort of funny — I don’t know what the user base would be. But
when computers were first built, people had the same impression. They
said that computers would just be for scientists doing lab work. And,
clearly, that’s no longer the case.
F: What sort of person will be able to use a quantum
computer? How do you train someone to use it, and what might the quantum
computing degree of the future look like?
KB: When I try to explain quantum computing to
someone, if they know the physics or chemistry of quantum mechanics,
then I can usually start there to explain how to do the computing side.
And the other side is also true: if people understand computing pretty
well, I can explain the extra modules that quantum computing gives you.
In the future, we probably need people trained from both of those
disciplines. We need people who have a physical sciences background who
we get up to speed on the computer science side and the opposite.
The specific thing we’re going to try to do is have this quantum
summer school, with the idea to bring in people from industry who are
maybe excellent microwave engineers or software engineers, and try to
give them enough tools so they can start to think about the extra rules
you have to think about with quantum.
F: What sorts of new research will you need to sortout before this thing can be built? What will that take?
KB: We have some ideas. In a classic computer you
work with voltage, but in quantum computing, I need to somehow carry
information from one place to another. Do messenger qubits that carry
information to other parts of the computer have to be the same type of
qubit that the rest of the computer is made of? We’re not sure yet.
A common way to think about scaling up the complexity of quantum computersis called the CCD architecture. The idea is to shuttle manageable chains of ions from point to point. That’s one possibility.
There’s been some theoretical work looking at whether you can have
photons interconnect between ion chains. The idea across all kinds of
supercomputers is to use photons as these messenger qubits. And by doing
that, you can basically have a bunch of small quantum computers wired
up by all these photons that collectively act like a larger computer.
But that’s farther out. I think getting that to work at the bandwidth
we need in the next five years would be pretty challenging. If it
happens, that would be great, but it’s probably farther out.
F: Along the way, how will you know that you’re making
tangible progress? Do you have benchmarks for knowing that you’re, say,
halfway there? How can you test to know for sure that it’s working?
KB: On the hardware side, we can increase the number
of qubits and get the gates [these are, if you recall, the things that
move ions or photons to transfer information] better and call it
tangible progress. We have a sense that we have to get, even though the
number moves, somewhere above fifty qubits to have a fighting chance.
[As of March, Google holds the record with a 72-qubit system]
At the same time, we’re going to take algorithms and applications
that we know, and we’re going to map them onto the hardware. We’re going
to try to optimize the algorithm as we map it in a way that makes the
overall application less vulnerableto noise.
Before we run these applications, we have a rule of thumb about how
often they should fail in tests and general use. But after this software
optimization that my team is working on, ideally it will fail much less
often. That helps us explore more in the algorithm space because it
gives you confidence that you can push quantum computerstowards
more complicated systems. I think it’s important to note that we have
the space to be very exploratory, to look at problems people aren’t
thinking about.
F: What’s the worst misconception about quantum computers that you run into? What do people always seem to get wrong about them?
KB: The one misconception is that it’s magic. Quantum computers aren’t magic; they don’t allow you to solve all problems.
Here’s the thing — in classical computing, we have the sense that
there are some problems that are easy and some problems which are really
hard, which means we can’t solve them in polynomial time [a computer science term used to denote whether a computer is able to complete a task quickly].
It turns out we spend a lot of our computing power trying to solve
the problems that we can’t solve in polynomial time, and we just have
approximations.
Quantum computers do allow you to solve some of the problems which
are intractable on a classic computer, but they don’t solve them all.
Usually, the thing which drives me crazy is when a quantum computer
article says they can solve all problems instantly because they do
infinite parallel calculations at once.
I’m really excited for when we have large scale quantum computers. With some problems — the famous example is the Traveling Salesman Problem — we know we can’t solve it for all
possible routes of salesmen, but we have to solve it anyway. The
classical computer does the best it can, and then when it blows it,
nobody’s upset. You’re like, ‘oh okay, well it’s going to get it wrong
some of the time.’
When we have large-scale quantum computers, we can test algorithms
like that more accurately. We’ll know we can solve the classical
problem, just occasionally the new computer gets bogged down.
I’m a big optimist. I guess that’s how you end up working this kind of field.
Open science is the movement to make scientific research, data
and dissemination accessible to all levels of an inquiring society,
amateur or professional. Open science is transparent and accessible
knowledge that is shared and developed through collaborative networks. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open notebook science, and generally making it easier to publish and communicate scientific knowledge.
Open science began in the 17th century with the advent of the academic journal,
when the societal demand for access to scientific knowledge reached a
point where it became necessary for groups of scientists to share
resources with each other so that they could collectively do their work. In modern times there is debate about the extent to which scientific information should be shared.
The conflict is between the desire of scientists to have access to
shared resources versus the desire of individual entities to profit when
other entities partake of their resources. Additionally, the status of open access and resources that are available for its promotion are likely to differ from one field of academic inquiry to another.
Background
Science
is broadly understood as collecting, analyzing, publishing,
reanalyzing, critiquing, and reusing data. Proponents of open science
identify a number of barriers that impede or dissuade the broad
dissemination of scientific data.
These include financial paywalls
of for-profit research publishers, restrictions on usage applied by
publishers of data, poor formatting of data or use of proprietary
software that makes it difficult to re-purpose, and cultural reluctance
to publish data for fears of losing control of how the information is
used.
Open Science Taxonomy
According to the FOSTER taxonomy Open science can often include aspects of Open access, Open data and the open source movement whereby modern science requires software in order to process data and information. Open research computation also addresses the problem of reproducibility of scientific results. The FOSTER Open Science taxonomy is available in RDF/XML and high resolution image.
Types
The term
"open science" does not have any one fixed definition or
operationalization. On the one hand, it has been referred to as a
"puzzling phenomenon".
On the other hand, the term has been used to encapsulate a series of
principles that aim to foster scientific growth and its complementary
access to the public. Two influential sociologists, Benedikt Fecher and
Sascha Friesike, have created multiple "schools of thought" that
describe the different interpretations of the term.
According to Fecher and Friesike ‘Open Science’ is an umbrella
term for various assumptions about the development and dissemination of
knowledge. To show the term’s multitudinous perceptions, they
differentiate between five Open Science schools of thought:
Infrastructure School
The
infrastructure school is founded on the assumption that "efficient"
research depends on the availability of tools and applications.
Therefore, the "goal" of the school is to promote the creation of openly
available platforms, tools, and services for scientists. Hence, the
infrastructure school is concerned with the technical infrastructure
that promotes the development of emerging and developing research
practices through the use of the internet, including the use of software
and applications, in addition to conventional computing networks. In
that sense, the infrastructure school regards open science as a
technological challenge. The infrastructure school is tied closely with
the notion of "cyberscience", which describes the trend of applying
information and communication technologies to scientific research, which
has led to an amicable development of the infrastructure school.
Specific elements of this prosperity include increasing collaboration
and interaction between scientists, as well as the development of
"open-source science" practices. The sociologists discuss two central
trends in the Infrastructure school:
1. Distributed computing: This trend encapsulates practices that
outsource complex, process-heavy scientific computing to a network of
volunteer computers around the world. The examples that the
sociologists cite in their paper is that of the Open Science Grid, which
enables the development of large-scale projects that require
high-volume data management and processing, which is accomplished
through a distributed computer network. Moreover, the grid provides the
necessary tools that the scientists can use to facilitate this process.
2. Social and Collaboration Networks or Scientists: This trend
encapsulates the development of software that makes interaction with
other researchers and scientific collaborations much easier than
traditional, non-digital practices. Specifically, the trend is focused
on implementing newer Web 2.0 tools to facilitate research related
activities on the internet. De Roure and colleagues (2008) list a series of four key capabilities which they believe composes A Social Virtual Research Environment (SVRE):
The SVRE should primarily aid the management and sharing of
research objects. The authors define these to be a variety of digital
commodities that are used repeatedly by researchers.
Second, the SVRE should have inbuilt incentives for researchers to make their research objects available on the online platform.
Third, the SVRE should be "open" as well as "extensible", implying
that different types of digital artifacts composing the SVRE can be
easily integrated.
Fourth, the authors propose that the SVRE is more than a simple
storage tool for research information. Instead, the researchers propose
that the platform should be "actionable". That is, the platform should
be built in such a way that research objects can be used in the conduct
of research as opposed to simply being stored.
Measurement School
The
measurement school, in the view of the authors, deals with developing
alternative methods to determine scientific impact. This school
acknowledges that measurements of scientific impact are crucial to a
researcher's reputation, funding opportunities, and career development.
Hence, the authors argue, that any discourse about Open Science is
pivoted around developing a robust measure of scientific impact in the
digital age. The authors then discuss other research indicating support
for the measurement school. The three key currents of previous
literature discussed by the authors are:
The peer-review is described as being time-consuming.
The impact of an article, tied to the name of the authors of the
article, is related more to the circulation of the journal rather than
the overall quality of the article itself.
New publishing formats that are closely aligned with the philosophy
of Open Science are rarely found in the format of a journal that allows
for the assignment of the impact factor.
Hence, this school argues that there are faster impact measurement
technologies that can account for a range of publication types as well
as social media web coverage of a scientific contribution to arrive at a
complete evaluation of how impactful the science contribution was. The
gist of the argument for this school is that hidden uses like reading,
bookmarking, sharing, discussing and rating are traceable activities,
and these traces can and should be used to develop a newer measure of
scientific impact. The umbrella jargon for this new type of impact
measurements is called altmetrics, coined in a 2011 article by Priem et
al., (2011).
Markedly, the authors discuss evidence that altmetrics differ from
traditional webometrics which are slow and unstructured. Altmetrics are
proposed to rely upon a greater set of measures that account for tweets,
blogs, discussions, and bookmarks. The authors claim that the existing
literature has often proposed that altmetrics should also encapsulate
the scientific process, and measure the process of research and
collaboration to create an overall metric. However, the authors are
explicit in their assessment that few papers offer methodological
details as to how to accomplish this. The authors use this and the
general dearth of evidence to conclude that research in the area of
altmetrics is still in its infancy.
Public School
According
to the authors, the central concern of the school is to make science
accessible to a wider audience. The inherent assumption of this school,
as described by the authors, is that the newer communication
technologies such as Web 2.0 allow scientists to open up the research
process and also allow scientist to better prepare their "products of
research" for interested non-experts. Hence, the school is characterized
by two broad streams: one argues for the access of the research process
to the masses, whereas the other argues for increased access to the
scientific product to the public.
Accessibility to the Research Process: Communication technology
allows not only for the constant documentation of research but also
promotes the inclusion of many different external individuals in the
process itself. The authors cite citizen science- the
participation of non-scientists and amateurs in research. The authors
discuss instances in which gaming tools allow scientists to harness the
brain power of a volunteer workforce to run through several permutations
of protein-folded structures. This allows for scientists to eliminate
many more plausible protein structures while also "enriching" the
citizens about science. The authors also discuss a common criticism of
this approach: the amateur nature of the participants threatens to
pervade the scientific rigor of experimentation.
Comprehensibility of the Research Result: This stream of research
concerns itself with making research understandable for a wider
audience. The authors describe a host of authors that promote the use of
specific tools for scientific communication, such as microblogging
services, to direct users to relevant literature. The authors claim that
this school proposes that it is the obligation of every researcher to
make their research accessible to the public. The authors then proceed
to discuss if there is an emerging market for brokers and mediators of
knowledge that is otherwise too complicated for the public to grasp
effortlessly.
Democratic School
The
democratic school concerns itself with the concept of access to
knowledge. As opposed to focusing on the accessibility of research and
its understandability, advocates of this school focus on the access of
products of research to the public. The central concern of the school is
with the legal and other obstacles that hinder the access of research
publications and scientific data to the public. The authors argue that
proponents of this school assert that any research product should be
freely available. The authors argue that the underlying notion of this
school is that everyone has the same, equal right of access to
knowledge, especially in the instances of state-funded experiments and
data. The authors categorize two central currents that characterize
this school: Open Access and Open Data.
Open Data: The authors discuss existing attitudes in the field
that rebel against the notion that publishing journals should claim
copyright over experimental data, which prevents the re-use of data and
therefore lowers the overall efficiency of science in general. The claim
is that journals have no use of the experimental data and that allowing
other researchers to utilize this data will be fruitful. The authors
cite other literature streams that discovered that only a quarter of
researchers agree to share their data with other researchers because of
the effort required for compliance.
Open Access to Research Publication: According to this school, there
is a gap between the creation and sharing of knowledge. Proponents
argue, as the authors describe, that even scientific knowledge doubles
every 5 years, access to this knowledge remains limited. These
proponents consider access to knowledge as a necessity for human
development, especially in the economic sense.
Pragmatic School
The
pragmatic school considers Open Science as the possibility to make
knowledge creation and dissemination more efficient by increasing the
collaboration throughout the research process. Proponents argue that
science could be optimized by modularizing the process and opening up
the scientific value chain. ‘Open’ in this sense follows very much the
concept of open innovation.
Take for instance transfers the outside-in (including external
knowledge in the production process) and inside-out (spillovers from the
formerly closed production process) principles to science. Web 2.0 is considered a set of helpful tools that can foster collaboration (sometimes also referred to as Science 2.0). Further, citizen science
is seen as a form of collaboration that includes knowledge and
information from non-scientists. Fecher and Friesike describe data
sharing as an example of the pragmatic school as it enables researchers
to use other researchers’ data to pursue new research questions or to
conduct data-driven replications.
History
The widespread adoption of the institution of the scientific journal
marks the beginning of the modern concept of open science. Before this
time societies pressured scientists into secretive behaviors.
Before journals
Before the advent of scientific journals, scientists had little to gain and much to lose by publicizing scientific discoveries. Many scientists, including Galileo, Kepler, Isaac Newton, Christiaan Huygens, and Robert Hooke,
made claim to their discoveries by describing them in papers coded in
anagrams or cyphers and then distributing the coded text.
Their intent was to develop their discovery into something off which
they could profit, then reveal their discovery to prove ownership when
they were prepared to make a claim on it.
The system of not publicizing discoveries caused problems because
discoveries were not shared quickly and because it sometimes was
difficult for the discoverer to prove priority. Newton and Gottfried Leibniz both claimed priority in discovering calculus. Newton said that he wrote about calculus in the 1660s and 1670s, but did not publish until 1693. Leibniz published "Nova Methodus pro Maximis et Minimis",
a treatise on calculus]] in 1684. Debates over priority are inherent in
systems where science is not published openly, and this was problematic
for scientists who wanted to benefit from priority.
These cases are representative of a system of aristocratic
patronage in which scientists received funding to develop either
immediately useful things or to entertain.
In this sense, funding of science gave prestige to the patron in the
same way that funding of artists, writers, architects, and philosophers
did.
Because of this, scientists were under pressure to satisfy the desires
of their patrons, and discouraged from being open with research which
would bring prestige to persons other than their patrons.
Emergence of academies and journals
Eventually the individual patronage system ceased to provide the scientific output which society began to demand. Single patrons could not sufficiently fund scientists, who had unstable careers and needed consistent funding.
The development which changed this was a trend to pool research by
multiple scientists into an academy funded by multiple patrons. In 1660 England established the Royal Society and in 1666 the French established the French Academy of Sciences.
Between the 1660s and 1793, governments gave official recognition to 70
other scientific organizations modeled after those two academies. In 1665, Henry Oldenburg became the editor of Philosophical Transactions of the Royal Society, the first academic journal devoted to science, and the foundation for the growth of scientific publishing. By 1699 there were 30 scientific journals; by 1790 there were 1052. Since then publishing has expanded at even greater rates.
Popular Science Writing
The
first popular science periodical of its kind was published in 1872,
under a suggestive name that is still a modern portal for the offering
science journalism: Popular Science. The magazine claims to have
documented the invention of the telephone, the phonograph, the electric
light and the onset of automobile technology. The magazine goes so far
as to claim that the "history of Popular Science is a true reflection of
humankind's progress over the past 129+ years".
Discussions of popular science writing most often contend their
arguments around some type of "Science Boom". A recent historiographic
account of popular science traces mentions of the term"science boom" to
Daniel Greenberg's Science and Government Reports in 1979 which posited
that "Scientific magazines are bursting out all over. Similarly, this
account discusses the publication Time, and its cover story of Carl
Sagan in 1980 as propagating the claim that popular science has "turned
into enthusiasm". Crucially, this secondary accounts asks the important question as to
what was considered as popular "science" to begin with. The paper claims
that any account of how popular science writing bridged the gap between
the informed masses and the expert scientists must first consider who
was considered a scientist to begin with.
Collaboration among academies
In
modern times many academies have pressured researchers at publicly
funded universities and research institutions to engage in a mix of
sharing research and making some technological developments proprietary.
Some research products have the potential to generate commercial
revenue, and in hope of capitalizing on these products, many research
institutions withhold information and technology which otherwise would
lead to overall scientific advancement if other research institutions
had access to these resources.
It is difficult to predict the potential payouts of technology or to
assess the costs of withholding it, but there is general agreement that
the benefit to any single institution of holding technology is not as
great as the cost of withholding it from all other research
institutions.
Coining of phrase "OpenScience"
The exact phrase "Open Science" was coined by Steve Mann
in 1998 at which time he also registered the domain name
openscience.com and openscience.org which he sold to egruyter.com in
2011.
Politics
In
many countries, governments fund some science research. Scientists often
publish the results of their research by writing articles and donating
them to be published in scholarly journals, which frequently are
commercial. Public entities such as universities and libraries subscribe
to these journals. Michael Eisen, a founder of the Public Library of Science,
has described this system by saying that "taxpayers who already paid
for the research would have to pay again to read the results."
In December 2011, some United States legislators introduced a bill called the Research Works Act,
which would prohibit federal agencies from issuing grants with any
provision requiring that articles reporting on taxpayer-funded research
be published for free to the public online.
Darrell Issa, a co-sponsor of the bill, explained the bill by saying
that "Publicly funded research is and must continue to be absolutely
available to the public. We must also protect the value added to
publicly funded research by the private sector and ensure that there is
still an active commercial and non-profit research community." One response to this bill was protests from various researchers; among them was a boycott of commercial publisher Elsevier called The Cost of Knowledge.
The Dutch Presidency of the Council of the European Union called out for action in April 2016 to migrate European Commission funded research to Open Science. European Commissioner Carlos Moedas introduced the Open Science Cloud at the Open Science Conference in Amsterdam on
April 4–5. During this meeting also The Amsterdam Call for Action on Open Science was presented, a living document outlining concrete actions for the European Community to move to Open Science.
Reaction
Arguments against
The open sharing of research data is not widely practiced
Arguments
against open science tend to advance several concerns. These include
the potential for some scholars to capitalize on data other scholars
have worked hard to collect, without collecting data themselves, the
potential for less qualified individuals to misuse open data and
arguments that novel data are more critical than reproducing or
replicating older findings.
Too much unsorted information overwhelms scientists
Some scientists find inspiration in their own thoughts by restricting the amount of information they get from others. Alexander Grothendieck
has been cited as a scientist who wanted to learn with restricted
influence when he said that he wanted to "reach out in (his) own way to
the things (he) wished to learn, rather than relying on the notions of
consensus."
Potential misuse
In 2011, Dutch researchers announced their intention to publish a research paper in the journal Science describing the creation of a strain of H5N1 influenza which can be easily passed between ferrets, the mammals which most closely mimic the human response to the flu. The announcement triggered a controversy in both political and scientific circles about the ethical implications of publishing scientific data which could be used to create biological weapons. These events are examples of how science data could potentially be misused. Scientists have collaboratively agreed to limit their own fields of inquiry on occasions such as the Asilomar conference on recombinant DNA in 1975, and a proposed 2015 worldwide moratorium on a human-genome-editing technique.
The public will misunderstand science data
In 2009 NASA launched the Kepler
spacecraft and promised that they would release collected data in June
2010. Later they decided to postpone release so that their scientists
could look at it first. Their rationale was that non-scientists might
unintentionally misinterpret the data, and NASA scientists thought it
would be preferable for them to be familiar with the data in advance so
that they could report on it with their level of accuracy.
Increasing the scale of science will make verification of any discovery more difficult
When more people report data it will take longer for anyone to
consider all data, and perhaps more data of lower quality, before
drawing any conclusion.
Low-quality science
Post-publication peer review, a staple of open science, has been
criticized as promoting the production of lower quality papers that are
extremely voluminous.
Specifically, critics assert that as quality is not guaranteed by
preprint servers, the veracity of papers will be difficult to assess by
individual readers. This will lead to rippling effects of false science,
akin to the recent epidemic of false news, propagated with ease on
social media websites.
Common solutions to this problem have been cited as adaptations of a
new format in which everything is allowed to be published but a
subsequent filter-curator model is imposed to ensure some basic quality
of standards are met by all publications.
Arguments in favor
A
number of scholars across disciplines have advanced various arguments
in favor of open science. These generally focus on the perceived value
of open science in improving the transparency and validity of research
as well as in regards to public ownership of science, particularly that
which is publicly funded. For example, in January 2014 J. Christopher
Bare published a comprehensive "Guide to Open Science".
Likewise in January, 2017, a group of scholars known for advocating
open science published a "manifesto" for open science in the journal Nature. In November 2017, a group of early career researchers published their "manifesto" in the journal Genome Biology,
stating that it is their task to change scientific research into open
scientific research and commit to Open Science principles.
Open access publication of research reports and data allows for rigorous peer-review
An article published by a team of NASA astrobiologists in 2010 in Science reported a bacterium known as GFAJ-1 that could purportedly metabolize arsenic (unlike any previously known species of lifeform). This finding, along with NASA's claim that the paper "will impact the search for evidence of extraterrestrial life", met with criticism within the scientific community. Much of the scientific commentary and critique around this issue took place in public forums, most notably on Twitter, where hundreds of scientists and non-scientists created a hashtag community around the hashtag #arseniclife.
University of British Columbia astrobiologist Rosie Redfield, one of
the most vocal critics of the NASA team's research, also submitted a
draft of a research report of a study that she and colleagues conducted
which contradicted the NASA team's findings; the draft report appeared
in arXiv,
an open-research repository, and Redfield called in her lab's research
blog for peer review both of their research and of the NASA team's
original paper.
Researcher Jeff Rouder defined Open Science as "endeavoring to preserve
the rights of others to reach independent conclusions about your data
and work".
Science is publicly funded so all results of the research should be publicly available
Public funding of research has long been cited as one of the primary reasons for providing Open Access to research articles.
Since there is significant value in other parts of the research such as
code, data, protocols, and research proposals a similar argument is
made that since these are publicly funded, they should be publicly
available under a creative commons licence.
Open Science will make science more reproducible and transparent
Increasingly the reproducibility of science is being questioned and the term "reproducibility crisis" has been coined.
For example, psychologist Stuart Vyse notes that "(r)ecent research
aimed at previously published psychology studies has
demonstrated--shockingly--that a large number of classic phenomena
cannot be reproduced, and the popularity of p-hacking is thought to be one of the culprits." Open Science approaches are proposed as one way to help increase the reproducibility of work as well as to help mitigate against manipulation of data.
Open Science has more impact
There are several components to impact in research, many of which are hotly debated.
However, under traditional scientific metrics parts Open science such
as Open Access and Open Data have proved to outperform traditional
versions.
Open Science will help answer uniquely complex questions
Recent arguments in favor of Open Science have maintained that
Open Science is a necessary tool to begin answering immensely complex
questions, such as the neural basis of consciousness.
The typical argument propagates the fact that these type of
investigations are too complex to be carried out by any one individual,
and therefore, they must rely on a network of open scientists to be
accomplished. By default, the nature of these investigations also makes
this "open science" as "big science".
Organizations and projects
Big scientific projects are more likely to practice open science than small projects.
Different projects conduct, advocate, develop tools for, or fund open
science, and many organizations run multiple projects. For example, the
Allen Institute for Brain Science conducts numerous open science projects while the Center for Open Science
has projects to conduct, advocate, and create tools for open science.
Open science is stimulating the emergence of sub-branches such as open synthetic biology and open therapeutics
.
Organizations have extremely diverse sizes and structures. The Open Knowledge Foundation
(OKF) is a global organization sharing large data catalogs, running
face to face conferences, and supporting open source software projects.
In contrast, Blue Obelisk is an informal group of chemists and associated cheminformatics projects. The tableau of organizations is dynamic with some organizations becoming defunct, e.g., Science Commons, and new organizations trying to grow, e.g., the Self-Journal of Science. Common organizing forces include the knowledge domain, type of service provided, and even geography, e.g., OCSDNet's concentration on the developing world.
Conduct
Many
open science projects focus on gathering and coordinating encyclopedic
collections of large amounts of organized data. The Allen Brain Atlas maps gene expression in human and mouse brains; the Encyclopedia of Life documents all the terrestrial species; the Galaxy Zoo classifies galaxies; the International HapMap Project maps the haplotypes of the human genome; the Monarch Initiative makes available integrated public model organism and clinical data; and the Sloan Digital Sky Survey
which regularizes and publishes data sets from many sources. All these
projects accrete information provided by many different researchers
with different standards of curation and contribution.
Other projects are organized around completion of projects that require extensive collaboration. For example, OpenWorm seeks to make a cellular level simulation of a roundworm, a multidisciplinary project. The Polymath Project
seeks to solve difficult mathematical problems by enabling faster
communications within the discipline of mathematics. The Collaborative
Replications and Education project recruits undergraduate students as citizen scientists by offering funding. Each project defines its needs for contributors and collaboration.
Other advocates concentrate on educating scientists about
appropriate open science software tools. Education is available as
training seminars, e.g., the Software Carpentry project; as domain specific training materials, e.g., the Data Carpentry project; and as materials for teaching graduate classes, e.g., the Open Science Training Initiative. Many organizations also provide education in the general principles of open science.
Within scholarly societies there are also sections and interest groups that promote open science practices. The Ecological Society of America has an Open Science Section . Similarly, the Society for American Archaeology has an Open Science Interest Group.
Publishing
Replacing
the current scientific publishing model is one goal of open science.
High costs to access literature gave rise to protests such as The Cost of Knowledge and to sharing papers without publisher consent, e.g., Sci-hub and ICanHazPDF. New organizations are experimenting with the open access model: the Public Library of Science, or PLOS, is creating a library of open access journals and scientific literature; F1000Research provides open publishing and open peer review for the life-sciences; figshare archives and shares images, readings, and other data; and arXiv provide electronic preprints across many fields; and many individual journals. Other publishing experiments include delayed and hybrid models.
Software
A variety of computer resources support open science. These include software like the Open Science Framework from the Center for Open Science to manage project information, data archiving and team coordination; distributed computing services like Ibercivis to utilize unused CPU time for computationally intensive tasks; and services like Experiment.com to provide crowdsourced funding for research projects.
Blockchain platforms for open science have been proposed. The first such platform is the Open Science Organization,
which aims to solve urgent problems with fragmentation of the
scientific ecosystem and difficulties of producing validated, quality
science. Among the initiatives of Open Science Organization include the
Interplanetary Idea System (IPIS), Researcher Index (RR-index), Unique
Researcher Identity (URI), and Research Network. The Interplanetary Idea
System is a blockchain based system that tracks the evolution of
scientific ideas over time. It serves to quantify ideas based on
uniqueness and importance, thus allowing the scientific community to
identify pain points with current scientific topics and preventing
unnecessary re-invention of previously conducted science. The Researcher
Index aims to establish a data-driven statistical metric for
quantifying researcher impact. The Unique Researcher Identity is a
blockchain technology based solution for creating a single unifying
identity for each researcher, which is connected to the researcher's
profile, research activities, and publications. The Research Network is a
social networking platform for researchers.
Preprint Servers
Preprint Servers come in many varieties, but the standard traits
across them are stable: they seek to create a quick, free mode of
communicating scientific knowledge to the public. Preprint servers act
as a venue to quickly disseminate research and vary on their policies
concerning when articles may be submitted relative to journal acceptance.
Also typical of preprint servers is their lack of a peer-review process
- typically, preprint servers have some type of quality check in place
to ensure a minimum standard of publication, but this mechanism is not
the same as a peer-review mechanism. Some preprint servers have
explicitly partnered with the broader open science movement. Preprint servers can offer service similar to those of journals, and Google Scholar indexes many preprint servers and collects information about citations to preprints. The case for preprint servers is often made based on the slow pace of conventional publication formats.
The motivation to start Socarxiv, an open-access preprint server for
social science research, is the claim that valuable research being
published in traditional venues often times takes several months to
years to get published, which slows down the process of science
significantly. Another argument made in favor of preprint servers like
Socarxiv is the quality and quickness of feedback offered to scientists
on their pre-published work.
The founders of Socarxiv claim that their platform allows researchers
to gain easy feedback from their colleagues on the platform, thereby
allowing scientists to develop their work into the highest possible
quality before formal publication and circulation. The founders of
Socarxiv further claim that their platform affords the authors the
greatest level of flexibility in updating and editing their work to
ensure that the latest version is available for rapid dissemination. The
founders claim that this is not traditionally the case with formal
journals, which instate formal procedures to make updates to published
articles.
Perhaps the strongest advantage of some preprint servers is their
seamless compatibility with Open Science software such as the Open
Science Framework. The founders of SocArXiv claim that their preprint
server connects all aspects of the research life cycle in OSF with the
article being published on the preprint server. According to the
founders, this allows for greater transparency and minimal work on the
authors' part.
One criticism of pre-print servers is their potential to foster a
culture of plagiarism. For example, the popular physics preprint server
ArXiv had to withdraw 22 papers whence it came to light that they were
plagiarized . In June 2002, a high-energy physicist in Japan was
contacted by a man called Ramy Naboulsi, a non-institutionally
affiliated mathematical physicist. Naboulsi requested Watanabe to upload
his papers on ArXiv as he was not able to do so, because of his lack of
an institutional affiliation. Later, the papers were realized to have
been copied from the proceedings of a physics conference.
Preprint servers are increasingly developing measures to circumvent
this plagiarism problem. In developing nations like India and China,
explicit measures are being taken to combat it.
These measures usually involve creating some type of central repository
for all available pre-prints, allowing the use of traditional
plagiarism detecting algorithms to detect the fraud. Nonetheless, this is a pressing issue in the discussion of pre-print servers, and consequently for Open Science.
Okun's law in macroeconomics
is an example of the simple linear regression. Here the dependent
variable (GDP growth) is presumed to be in a linear relationship with
the changes in the unemployment rate.
In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables.
The adjective simple refers to the fact that the outcome variable is related to a single predictor.
It is common to make the additional stipulation that the ordinary least squares method should be used: the accuracy of each predicted value is measured by its squared residual
(vertical distance between the point of the data set and the fitted
line), and the goal is to make the sum of these squared deviations as
small as possible. Other regression methods that can be used in place of
ordinary least squares include least absolute deviations (minimizing the sum of absolute values of residuals) and the Theil–Sen estimator (which chooses a line whose slope is the median of the slopes determined by pairs of sample points). Deming regression
(total least squares) also finds a line that fits a set of
two-dimensional sample points, but (unlike ordinary least squares, least
absolute deviations, and median slope regression) it is not really an
instance of simple linear regression, because it does not separate the
coordinates into one dependent and one independent variable and could
potentially return a vertical line as its fit.
The remainder of the article assumes an ordinary least squares regression.
In this case, the slope of the fitted line is equal to the correlation between y and x
corrected by the ratio of standard deviations of these variables. The
intercept of the fitted line is such that the line passes through the
center of mass (x, y) of the data points.
Fitting the regression line
Consider the model function
which describes a line with slope β and y-intercept α.
In general such a relationship may not hold exactly for the largely
unobserved population of values of the independent and dependent
variables; we call the unobserved deviations from the above equation the
errors. Suppose we observe n data pairs and call them {(xi, yi), i = 1, ..., n}. We can describe the underlying relationship between yi and xi involving this error term εi by
This relationship between the true (but unobserved) underlying parameters α and β and the data points is called a linear regression model.
The goal is to find estimated values and for the parameters α and β
which would provide the "best" fit in some sense for the data points.
As mentioned in the introduction, in this article the "best" fit will be
understood as in the least-squares approach: a line that minimizes the sum of squared residuals (differences between actual and predicted values of the dependent variable y), each of which is given by, for any candidate parameter values and ,
In other words, and solve the following minimization problem:
By expanding to get a quadratic expression in and we can derive values of and that minimize the objective function Q (these minimizing values are denoted and ):
This shows that rxy is the slope of the regression line of the standardized data points (and that this line passes through the origin).
Generalizing the
notation, we can write a horizontal bar over an expression to indicate
the average value of that expression over the set of samples. For
example:
This notation allows us a concise formula for rxy:
The coefficient of determination ("R squared") is equal to when the model is linear with a single independent variable.
Simple linear regression without the intercept term (single regressor)
Sometimes it is appropriate to force the regression line to pass through the origin, because x and y are assumed to be proportional. For the model without the intercept term, y = βx, the OLS estimator for β simplifies to
Substituting (x − h, y − k) in place of (x, y) gives the regression through (h, k):
where Cov and Var refer to the covariance and variance of the sample data (uncorrected for bias).
The last form above demonstrates how moving the line away from the center of mass of the data points affects the slope.
Numerical properties
The regression line goes through the center of mass point, , if the model includes an intercept term (i.e., not forced through the origin).
The sum of the residuals is zero if the model includes an intercept term:
The residuals and x values are uncorrelated, meaning (whether or not there is an intercept term in the model):
Model-based properties
Description of the statistical properties of estimators from the simple linear regression estimates requires the use of a statistical model.
The following is based on assuming the validity of a model under which
the estimates are optimal. It is also possible to evaluate the
properties under other assumptions, such as inhomogeneity, but this is discussed elsewhere.
To formalize this assertion we must define a framework in which
these estimators are random variables. We consider the residuals εi as random variables drawn independently from some distribution with mean zero. In other words, for each value of x, the corresponding value of y is generated as a mean response α + βx plus an additional random variable ε called the error term, equal to zero on average. Under such interpretation, the least-squares estimators and will themselves be random variables whose means will equal the "true values" α and β. This is the definition of an unbiased estimator.
Confidence intervals
The formulas given in the previous section allow one to calculate the point estimates of α and β
— that is, the coefficients of the regression line for the given set of
data. However, those formulas don't tell us how precise the estimates
are, i.e., how much the estimators and vary from sample to sample for the specified sample size. Confidence intervals
were devised to give a plausible set of values to the estimates one
might have if one repeated the experiment a very large number of times.
The standard method of constructing confidence intervals for
linear regression coefficients relies on the normality assumption, which
is justified if either:
the errors in the regression are normally distributed (the so-called classic regression assumption), or
the number of observations n is sufficiently large, in which case the estimator is approximately normally distributed.
Under
the first assumption above, that of the normality of the error terms,
the estimator of the slope coefficient will itself be normally
distributed with mean β and variance where σ2 is the variance of the error terms. At the same time the sum of squared residuals Q is distributed proportionally to χ2 with n − 2 degrees of freedom, and independently from . This allows us to construct a t-value
where
is the standard error of the estimator .
This t-value has a Student's t-distribution with n − 2 degrees of freedom. Using it we can construct a confidence interval for β:
at confidence level (1 − γ), where is the quantile of the tn−2 distribution. For example, if γ = 0.05 then the confidence level is 95%.
Similarly, the confidence interval for the intercept coefficient α is given by
at confidence level (1 − γ), where
The US "changes in unemployment – GDP growth" regression with the 95% confidence bands.
The confidence intervals for α and β give us the general idea where these regression coefficients are most likely to be. For example, in the Okun's law regression shown here the point estimates are
The 95% confidence intervals for these estimates are
In order to represent this information graphically, in the form of
the confidence bands around the regression line, one has to proceed
carefully and account for the joint distribution of the estimators. It
can be shown that at confidence level (1 − γ) the confidence band has hyperbolic form given by the equation
Asymptotic assumption
The alternative second assumption states that when the number of points in the dataset is "large enough", the law of large numbers and the central limit theorem
become applicable, and then the distribution of the estimators is
approximately normal. Under this assumption all formulas derived in the
previous section remain valid, with the only exception that the quantile
t*n−2 of Student's t distribution is replaced with the quantile q* of the standard normal distribution. Occasionally the fraction 1/n−2 is replaced with 1/n. When n is large such a change does not alter the results appreciably.
Numerical example
This data set gives average masses for women as a function of their
height in a sample of American women of age 30–39. Although the OLS
article argues that it would be more appropriate to run a quadratic
regression for this data, the simple linear regression model is applied
here instead.
Height (m), xi
1.47
1.50
1.52
1.55
1.57
1.60
1.63
1.65
1.68
1.70
1.73
1.75
1.78
1.80
1.83
Mass (kg), yi
52.21
53.12
54.48
55.84
57.20
58.57
59.93
61.29
63.11
64.47
66.28
68.10
69.92
72.19
74.46
There are n = 15 points in this data set. Hand calculations would be started by finding the following five sums:
These quantities would be used to calculate the estimates of the regression coefficients, and their standard errors.
The 0.975 quantile of Student's t-distribution with 13 degrees of freedom is t*13 = 2.1604, and thus the 95% confidence intervals for α and β are
This example also demonstrates that sophisticated calculations will
not overcome the use of badly prepared data. The heights were originally
given in inches, and have been converted to the nearest centimetre.
Since the conversion has introduced rounding error, this is not
an exact conversion. The original inches can be recovered by
Round(x/0.0254) and then re-converted to metric without rounding: if
this is done, the results become
Thus a seemingly small variation in the data has a real effect.