The movement was committed to abstaining from all partisan politics and communist revolution. It gained strength in the 1930s but in 1940, due to opposition to the Second World War,
was banned in Canada. The ban was lifted in 1943 when it was apparent
that 'Technocracy Inc. was committed to the war effort, proposing a
program of total conscription.' The movement continued to expand during the remainder of the war and new sections were formed in Ontario and the Maritime Provinces.
The Technocracy movement survives into the present day and, as of 2013, was continuing to publish a newsletter, maintain a website, and hold member meetings. Smaller groups included the Technical Alliance, The New Machine and the Utopian Society of America.
Overview
Technocracy advocates contended that price system-based
forms of government and economy are structurally incapable of effective
action, and promoted a society headed by technical experts, which they
argued would be more rational and productive.
The coming of the Great Depression ushered in radically different ideas of social engineering, culminating in reforms introduced by the New Deal. By late 1932, various groups across the United States were calling themselves technocrats and proposing reforms.
By the mid-1930s, interest in the technocracy movement was declining. Some historians have attributed the decline of the technocracy movement to the rise of Roosevelt's New Deal.
Historian William E. Akin rejects that thesis arguing instead that the
movement declined in the mid-1930s as a result of the failure of its
proponents to devise a 'viable political theory for achieving change'
(p. 111 Technocracy and the American Dream: The Technocrat Movement, 1900–1941
by William E. Akin), although many technocrats in the United States
were sympathetic to the electoral efforts of anti-New Deal third
parties.
One of the most widely circulated images in Technocracy Inc.’s
promotional materials used the example of a streetcar to argue that
engineering solutions will always succeed where legislation or fines
fail to adequately deal with social problems. If passengers insist on
riding on the car’s dangerous outer platform, the solution consists in
designing cars without platforms.
Origins
The technocratic movement has its origins with the progressive engineers of the early twentieth century and the writings of Edward Bellamy, along with some of the later works of Thorstein Veblen such as Engineers And The Price System written in 1921. William H. Smyth, a California engineer, invented the word technocracy in 1919 to describe "the rule of the people made effective through the agency of their servants, the scientists and engineers", and in the 1920s it was used to describe the works of Thorstein Veblen.
Early technocratic organisations formed after the First World War. These included Henry Gantt’s "The New Machine" and Veblen’s "Soviet of Technicians". These organisations folded after a short time.
Writers such as Henry Gantt, Thorstein Veblen, and Howard Scott
suggested that businesspeople were incapable of reforming their
industries in the public interest and that control of industry should
thus be given to engineers.
United States and Canada
Howard Scott has been called the "founder of the technocracy movement" and he started the Technical Alliance
in New York near the end of 1919. Members of the Alliance were mostly
scientists and engineers. The Technical Alliance started an Energy Survey of North America, which aimed to provide a scientific background from which ideas about a new social structure could be developed. However the group broke up in 1921 before the survey was completed.
In 1932, Scott and others interested in the problems of
technological growth and economic change began meeting in New York City.
Their ideas gained national attention and the "Committee on
Technocracy" was formed at Columbia University, by Howard Scott and Walter Rautenstrauch. However, the group was short-lived and in January 1933 splintered into two other groups, the "Continental Committee on Technocracy" (led by Harold Loeb) and "Technocracy Incorporated" (led by Scott).
Smaller groups included the Technical Alliance, The New Machine and the
Utopian Society of America, though Bellamy had the most success due to
his nationalistic stances, and Veblen's rhetoric, removing the current
pricing system and his blueprint for a national directorate to
reorganize all produced goods and supply, and ultimately to radically
increase all industrial output.
At the core of Scott's vision was "an energy theory of value".
Since the basic measure common to the production of all goods and
services was energy, he reasoned "that the sole scientific foundation
for the monetary system was also energy", and that society could be
designed more efficiently by using an energy metric instead of a
monetary metric (energy certificates or 'energy accounting').
Technocracy Inc. officials wore a uniform, consisting of a
"well-tailored double-breasted suit, gray shirt, and blue necktie, with a
monad insignia on the lapel", and its members saluted Scott in public.
Public interest in technocracy peaked in the early 1930s:
Technocracy's heyday lasted only from June 16, 1932, when the New York Times
became the first influential press organ to report its activities,
until January 13, 1933, when Scott, attempting to silence his critics,
delivered what some critics called a confusing, and uninspiring address
on a well-publicized nationwide radio hookup.
Following Scott's radio address (Hotel Pierre Address),
the condemnation of both him and technocracy in general reached a peak.
The press and businesspeople reacted with ridicule and almost unanimous
hostility. The American Engineering Council charged the technocrats
with "unprofessional activity, questionable data, and drawing
unwarranted conclusions".
The technocrats made a believable case for a kind of technological
utopia, but their asking price was too high. The idea of political
democracy still represented a stronger ideal than technological elitism.
In the end, critics believed that the socially desirable goals that
technology made possible could be achieved without the sacrifice of
existing institutions and values and without incurring the apocalypse
that technocracy predicted.
The faction-ridden Continental Committee on Technocracy collapsed in October 1936. However, Technocracy Incorporated continued.
On October 7, 1940, the Royal Canadian Mounted Police
arrested members of Technocracy Incorporated, charging them with
belonging to an illegal organisation. One of the arrested was Joshua
Norman Haldeman, a Regina chiropractor, former director of Technocracy Incorporated, and the grandfather of Elon Musk.
There were some speaking tours of the US and Canada in 1946 and 1947, and a motorcade from Los Angeles to Vancouver:
Hundreds of cars, trucks, and trailers, all regulation grey, from all
over the Pacific Northwest, participated. An old school bus, repainted
and retrofitted with sleeping and office facilities, a two-way radio,
and a public address system, impressed observers. A huge war surplus
searchlight mounted on a truck bed was included, and grey-painted
motorcycles acted as parade marshals. A small grey aircraft, with a
Monad symbol on its wings, flew overhead. All this was recorded by the
Technocrats on 16-mm 900-foot colour film.
In 1948 activity declined while dissent increased within the
movement. One central factor contributing to dissent was that "the Price
System had not collapsed, and predictions about the expected demise
were becoming more and more vague". Some quite specific predictions about the price system collapse were made during the Great Depression, the first giving 1937 as the date, and the second forecasting the collapse as occurring "prior to 1940".
Membership and activity declined steadily in the years after
1948, but some activity persisted, mostly around Vancouver in Canada and
on the West Coast of the United States. Technocracy Incorporated
currently maintains a website and distributes a monthly newsletter and
holds membership meetings.
An extensive archive of Technocracy's materials is held at the University of Alberta.
Technocrats plan
In a publication from 1938 Technocracy Inc. the main organization made the following statement in defining their proposal:
Technocracy is the science of social engineering, the scientific
operation of the entire social mechanism to produce and distribute goods
and services to the entire population of this continent. For the first
time in human history it will be done as a scientific, technical,
engineering problem. There will be no place for Politics or Politicians,
Finance or Financiers, Rackets or Racketeers. Technocracy states that
this method of operating the social mechanism of the North American Continent
is now mandatory because we have passed from a state of actual scarcity
into the present status of potential abundance in which we are now held
to an artificial scarcity forced upon us in order to continue a Price System
which can distribute goods only by means of a medium of exchange.
Technocracy states that price and abundance are incompatible; the
greater the abundance the smaller the price. In a real abundance there
can be no price at all. Only by abandoning the interfering price control
and substituting a scientific method of production and distribution can
an abundance be achieved. Technocracy will distribute by means of a
certificate of distribution available to every citizen from birth to
death. The Technate will encompass the entire American Continent from Panama to the North Pole because the natural resources and the natural boundary of this area make it an independent, self-sustaining geographical unit.
Calendar
The Technocratic movement planned to reform the work schedule, to
achieve the goal of uninterrupted production, maximizing the efficiency
and profitability of resources, transport and entertainment facilities,
avoiding the "weekend effect".
According to the movement's calculations, it would be enough that
every citizen worked a cycle of four consecutive days, four hours a
day, followed by three days off. By "tiling" the days and working hours
of seven groups, industry and services could be operated 24 hours a day, seven days a week. This system would include holiday periods allocated to each citizen.
Europe
In
Germany before the Second World War, a technocratic movement based on
the American model introduced by Technocracy Incorporated existed but
ran afoul of the political system there.
There was also a Soviet movement whose early history resembled
the North American one during the interwar period. One of its leading
members was engineer Peter Palchinsky.
Technocratic ideology was also promoted by the Engineer's Herald
journal. The Soviet technocrats advanced the scientization of the
economic development, management as well as industrial and organizational psychology
under the slogan "The future belongs to the managing-engineers and the
engineering-managers.". Those viewpoints were supported by leading Right Opposition members Nikolai Bukharin and Alexei Rykov. The promotion of an alternative view on the country's industrialization and the engineer's role in society incurred Joseph Stalin's
wrath. Palchinsky was executed in 1929, and a year later leading Soviet
engineers were accused of an anti-government conspiracy in the Industrial Party Trial.
A large scale persecution of engineers followed, forcing them to focus
on narrow technical issues assigned to them by communist party leaders. The concept of Tectology developed by Alexander Bogdanov,
perhaps the most important of the non-Leninist Bolsheviks, bears some
semblance to technocratic ideas. Both Bogdanov's fiction and his
political writings as presented by Zenovia Sochor, imply that he expected a coming revolution against capitalism to lead to a technocratic society.
Year On, formerly UnCollege, is an organization which aims to equip students with the tools for self-directed learning and career building. Its flagship program is a yearlong gap year program involving training in work skills and life skills, volunteer service in a foreign country, and internship or personal project.
UnCollege was founded by Dale J. Stephens in 2011. Stephens is a self-described "elementary school dropout", as he was homeschooled with emphasis on real-world experience for the majority of his childhood. He briefly attended Hendrix College.
While there, Stephens had a night-long discussion with a friend
regarding the disconnect between the theoretical subject matter taught
in college and its real world applications. This discussion would become the basis for UnCollege, which Stephens founded in 2011 in San Francisco.
In 2010, Stephens also applied for the Thiel Fellowship, a program founded by Peter Thiel which grants fellows US$100,000 to forgo college for two years and focus on their passions.
After his initial proposal was rejected, Stephens was encouraged to
reapply in pursuit of his work as an educational futurist through
UnCollege. His second application succeeded and he was in the first
batch of Thiel Fellows.
The UnCollege website featured resources, forums, and workshops
designed to help students, both in and out of college, with the task of
self-directed learning.
The site also hosted The UnCollege Manifesto, a 25-page document
written by Stephens that covers subjects like "The value (or lack
thereof) of a college degree" and "Twelve steps to self-directed
learning." Stephens expanded upon the approach described therein in a book, Hacking Your Education: Ditch the Lectures, Save Tens of Thousands, and Learn More Than Your Peers Ever Will, which was published in 2013.
In 2013, UnCollege began offering its flagship year-long program.
The first iteration of the year-long program included four phases:
Launch, in which students learnt life skills, work skills, and how to
engage in self-directed learning; Voyage, in which students traveled to a foreign country, for a service learning trip;
Internship, in which the students, using work skills acquired at
Launch, seek out and work through an internship of their choosing;
and Project, in which students completed, from start to finish, a
capstone project of their choosing. The inaugural class of students in
the year-long program included ten students from four countries.
Later iterations of the yearlong program reversed the order of Launch
and Voyage, eliminated the Project, and/or allowed students to choose
between an Internship and a Project.
Initially, the year-long program was presented as an alternative
to university. Students were encouraged to take a pathway, similar to
Stephens' own, that avoided university studies. However, Stephens found that this pathway was neither practical nor desirable for most of his students.
Subsequently, the year-long program was rebranded as a gap year
program, designed for students to take in between high school and
university or vocational training, or for university students taking a
year-long break from their studies.
A semester-long program was subsequently introduced, branded as a gap
semester, consisting only of the Voyage and Launch phases of the
year-long program.
In 2018, UnCollege rebranded as Year On.
The goal of the programs shifted to preparing students for life at
university and in the world of work. The year-long program and the
semester-long program were retained; additional programs were added.
Programs
Year On currently offers several programs. The year-long program
has three phases: Explore, in which students take a service learning
trip abroad; Focus, in which students learn life skills and work skills,
and receive one-on-one mentorship; and Launch, in which students may
take an internship, take an apprenticeship, work on a personal project,
and/or apply for college. The semester-long program has students complete portions of the year-long program. The flexible program
has students complete portions of the life skills, work skills, and/or
self-directed learning curriculum, over a shorter period of time.
Open scientific data or open research data is a type of open data focused on publishing observations
and results of scientific activities available for anyone to analyze
and reuse. A major purpose of the drive for open data is to allow the
verification of scientific claims, by allowing others to look at the
reproducibility of results, and to allow data from many sources to be integrated to give new knowledge.
The modern concept of scientific data emerged in the second half
of the 20th century, with the development of large knowledge
infrastructure to compute scientific information and observation. The
sharing and distribution of data has been early identified as an
important stake but was impeded by the technical limitations of the
infrastructure and the lack of common standards for data communication.
The World Wide Web
was immediately conceived as a universal protocol for the sharing of
scientific data, especially coming from high-energy physics.
Definition
Scientific data
The concept of open scientific data has developed in parallel with the concept of scientific data.
Scientific data was not formally defined until the late
20th century. Before the generalization of computational analysis, data
has been mostly an informal terms, frequently used interchangeably with
knowledge or information.
Institutional and epistemological discourses favored alternative
concepts and outlooks on scientific activities: "Even histories of
science and epistemology comments, mention data only in passing. Other
foundational works on the making of meaning in science discuss facts,
representations, inscriptions, and publications, with little attention
to data per se."
The first influential policy definition of scientific data appeared as late as 1999, when the National Academies of Science described data as "facts, letters, numbers or symbols that describe an object, condition, situation or other factors". Terminologies have continued to evolve: in 2011, the National Academies updated the definition to include a large variety of dataified
objects such as "spectrographic, genomic sequencing, and electron
microscopy data; observational data, such as remote sensing, geospatial,
and socioeconomic data; and other forms of data either generated or
compiled, by humans or machines" as well as "digital representation of
literature"
While the forms and shapes of data remain expansive and
unsettled, standard definitions and policies have recently tended to
restrict scientific data to computational or digital data.
The open data pilot of Horizon 2020 has been voluntarily restricted to
digital research: "'Digital research data' is information in digital
form (in particular facts or numbers), collected to be examined and used
as a basis for reasoning, discussion or calculation; this includes
statistics, results of experiments, measurements, observations resulting
from fieldwork, survey results, interview recordings and images"
Overall, the status scientific data remains a flexible point of
discussion among individual researchers, communities and policy-makers:
"In broader terms, whatever 'data' is of interest to researchers should
be treated as 'research data'"
Important policy reports, like the 2012 collective synthesis of the
National Academies of science on data citation, have intentionally
adopted a relative and nominalist definition of data: "we will devote
little time to definitional issues (e.g., what are data?), except to
acknowledge that data often exist in the eyes of the beholder." For Christine Borgman,
the main issue is not to define scientific data ("what are data") but
to contextualize the point where data became a focal point of discussion
within a discipline, an institution or a national research program
("when are data").
In the 2010s, the expansion of available data sources and the
sophistication of data analysis method has expanded the range of
disciplines primarily affected by data management issues to "computational social science, digital humanities, social media data, citizen science research projects, and political science."
Open scientific data
Opening
and sharing have both been major topic of discussion in regard to
scientific data management, but also a motivation to make data emerge as a relevant issue within an institution, a discipline or a policy framework.
For Paul Edwards, whether or not to share the data, to what extent it should be shared and to whom have been major causes of data friction,
that revealed the otherwise hidden infrastructures of science:
"Edwards' metaphor of data friction describes what happens at the
interfaces between data 'surfaces': the points where data move between
people, substrates, organizations, or machines (...) Every movement of
data across an interface comes at some cost in time, energy, and human
attention. Every interface between groups and organizations, as well as
between machines, represents a point of resistance where data can be
garbled, misinterpreted, or lost. In social systems, data friction
consumes energy and produces turbulence and heat – that is, conflicts,
disagreements, and inexact, unruly processes."
The opening of scientific data is both a data friction in itself and a
way to collectively manage data frictions by weakening complex issues of
data ownership. Scientific or epistemic cultures have been
acknowledged as primary factors in the adoption of open data policies:
"data sharing practices would be expected to be community-bound and
largely determined by epistemic culture."
In the 2010s, new concepts have been introduced by scientist and
policy-makers to more accurately define what open scientific data. Since
its introduction in 2016, FAIR data has become a major focus of open research policies. The acronym describe an ideal-type of Findable, Accessible, Interoperable, and Reusable data. Open scientific data has been categorized as a commons or a public good,
which is primarily maintained, enriched and preserved by collective
rather than individual action: "What makes collective action useful in
understanding scientific data sharing is its focus on how the
appropriation of individual gains is determined by adjusting the costs
and benefits that accrue with contributions to a common resource"
History
Development of knowledge infrastructures (1945-1960)
The emergence of scientific data is associated with a semantic shift in the way core scientific concepts like data, information and knowledge are commonly understood. Following the development of computing technologies, data and information are increasingly described as "things":
"Like computation, data always have a material aspect. Data are things.
They are not just numbers but also numerals, with dimensionality,
weight, and texture".
After the Second World War large scientific projects have increasingly relied on knowledge infrastructure
to collect, process and analyze important amount of data. Punch-cards
system were first used experimentally on climate data in the 1920s and
were applied on a large scale in the following decade: "In one of the
first Depression-era government make-work projects, Civil Works
Administration workers punched some 2 million ship log observations for
the period 1880–1933." By 1960, the meteorological data collections of the US National Weather Records Center
has expanded to 400 millions cards and had a global reach. The
physically of scientific data was by then fully apparent and threatened
the stability of entire buildings: "By 1966 the cards occupied so much
space that the Center began to fill its main entrance hall with card
storage cabinets (figure 5.4). Officials became seriously concerned that
the building might collapse under their weight".
By the end of the 1960s, knowledge infrastructure have been
embedded in a various set of disciplines and communities. The first
initiative to create a database of electronic bibliography of open
access data was the Educational Resources Information Center (ERIC) in 1966. In the same year, MEDLINE was created – a free access online database managed by the National Library of Medicine and the National Institute of Health (USA) with bibliographical citations from journals in the biomedical area, which later would be called PubMed, currently with over 14 million complete articles.
Knowledge infrastructures were also set up in space engineering (with
NASA/RECON), library search (with OCLC Worldcat) or the social sciences:
"The 1960s and 1970s saw the establishment of over a dozen services and
professional associations to coordinate quantitative data collection".
Opening and sharing data: early attempts (1960-1990)
Early
discourses and policy frameworks on open scientific data emerged
immediately in the wake of the creation of the first large knowledge
infrastructure. The World Data Center system (now the World Data System), aimed to make observation data more readily available in preparation for the International Geophysical Year of 1957–1958. The International Council of Scientific Unions (now the International Council for Science)
established several World Data Centers to minimize the risk of data
loss and to maximize data accessibility, further recommending in 1955
that data be made available in machine-readable form. In 1966, the International Council for Science created CODATA, an initiative to "promote cooperation in data management and use".
These early forms of open scientific data did not develop much
further. There were too many data frictions and technical resistance to
the integration of external data to implement a durable ecosystem of
data sharing. Data infrastructures were mostly invisible to researchers,
as most of the research was done by professional librarians. Not only
were the search operating systems complicated to use, but the search has
to be performed very efficiently given the prohibitive cost of
long-distance telecommunication.
While their conceptors have originally anticipated direct uses by
researcher, that could not really emerge due to technical and economic
impediment:
The designers of the first online
systems had presumed that searching would be done by end users; that
assumption undergirded system design. MEDLINE was intended to be used by
medical researchers and clinicians, NASA/RECON was designed for
aerospace engineers and scientists. For many reasons, however, most
users through the seventies were librarians and trained intermediaries
working on behalf of end users. In fact, some professional searchers
worried that even allowing eager end users to get at the terminals was a
bad idea.
Christine Borgman does not recall any significant policy debates over
the meaning, the production and the circulation of scientific data save
for a few specific fields (like climatology) after 1966. The insulated scientific infrastructures could hardly be connected before the advent of the web.
Projects, and communities relied on their own unconnected networks at a
national or institutional level: "the Internet was nearly invisible in
Europe because people there were pursuing a separate set of network
protocols".
Communication between scientific infrastructures was not only
challenging across space, but also across time. Whenever a communication
protocol was no longer maintained, the data and knowledge it
disseminated was likely to disappear as well: "the relationship between
historical research and computing has been durably affected by aborted
projects, data loss and unrecoverable formats".
Sharing scientific data on the web (1990-1995)
The World Wide Web
was originally conceived as an infrastructure for open scientific data.
Sharing of data and data documentation was a major focus in the initial
communication of the World Wide Web when the project was first unveiled
in August 1991 : "The WWW project was started to allow high energy
physicists to share data, news, and documentation. We are very
interested in spreading the web to other areas, and having gateway
servers for other data".
The project stemmed from a close knowledge infrastructure, ENQUIRE. It was an information management software commissioned to Tim Berners-Lee by the CERN
for the specific needs of high energy physics. The structure of ENQUIRE
was closer to an internal web of data: it connected "nodes" that "could
refer to a person, a software module, etc. and that could be interlined
with various relations such as made, include, describes and so forth".
While it "facilitated some random linkage between information" Enquire
was not able to "facilitate the collaboration that was desired for in
the international high-energy physics research community".
Like any significant computing scientific infrastructure before the
1990s, the development of ENQUIRE was ultimately impeded by the lack of
interoperability and the complexity of managing network communications:
"although Enquire provided a way to link documents and databases, and
hypertext provided a common format in which to display them, there was
still the problem of getting different computers with different
operating systems to communicate with each other".
The web rapidly superseded pre-existing closed infrastructure for
scientific data, even when they included more advanced computing
features. From 1991 to 1994, users of the Worm Community System,
a major biology database on worms, switched to the Web and Gopher.
While the Web did not include many advanced functions for data retrieval
and collaboration, it was easily accessible. Conversely, the Worm Community System
could only be browsed on specific terminals shared across scientific
institutions: "To take on board the custom-designed, powerful WCS (with
its convenient interface) is to suffer inconvenience at the intersection
of work habits, computer use, and lab resources (…) The World-Wide Web,
on the other hand, can be accessed from a broad variety of terminals
and connections, and Internet computer support is readily available at
most academic institutions and through relatively inexpensive commercial
services."
Publication on the web completely changed the economics of data
publishing. While in print "the cost of reproducing large datasets is
prohibitive", the storage expenses of most datasets is low.
In this new editorial environment, the main limiting factors for data
sharing becomes no longer technical or economic but social and cultural.
Defining open scientific data (1995-2010)
The development and the generalization of the World Wide Web lifted numerous technical barriers and frictions
had constrained the free circulation of data. Yet, scientific data had
yet to be defined and new research policy had to be implemented to
realize the original vision laid out by Tim Berners-Lee of a web of data.
At this point, scientific data has been largely defined through the
process of opening scientific data, as the implementation of open
policies created new incentives for setting up actionable guidelines,
principles and terminologies.
Climate research has been a pioneering field in the conceptual
definition of open scientific data, as it has been in the construction
of the first large knowledge infrastructure in the 1950s and the 1960s.
In 1995 the GCDIS articulated a clear commitment On the Full and Open Exchange of Scientific Data:
"International programs for global change research and environmental
monitoring crucially depend on the principle of full and open data
exchange (i.e., data and information are made available without
restriction, on a non-discriminatory basis, for no more than the cost of
reproduction and distribution).
The expansion of the scope and the management of knowledge
infrastructures also created to incentives to share data, as the
"allocation of data ownership" between a large number of individual and
institutional stakeholders has become increasingly complex. Open data creates a simplified framework to ensure that all contributors and users of the data have access to it.
Open data has been rapidly identified as a key objective of the
emerging open science movement. While initially focused on publications
and scholarly articles, the international initiatives in favor of open
access expanded their scope to all the main scientific productions.
In 2003 the Berlin Declaration supported the diffusion of "original
scientific research results, raw data and metadata, source materials and
digital representations of pictorial and graphical and scholarly
multimedia materials"
After 2000, international organizations, like the OECD
(Organisation for Economic Co-operation and Development), have played
an instrumental role in devising generic and transdisciplinary
definitions of scientific data, as open data policies have to be
implemented beyond the specific scale of a discipline of a country. One of the first influential definition of scientific data was coined in 1999
by a report of the National Academies of Science: "Data are facts,
numbers, letters, and symbols that describe an object, idea, condition,
situation, or other factors".
In 2004, the Science Ministers of all nations of the OECD signed a
declaration which essentially states that all publicly funded archive
data should be made publicly available. In 2007 the OECD "codified the principles for access to research data from public funding" through the Principles and Guidelines for Access to Research Data from Public Funding
which defined scientific data as "factual records (numerical scores,
textual records, images and sounds) used as primary sources for
scientific research, and that are commonly accepted in the scientific
community as necessary to validate research findings." The Principles acted as soft-law
recommendation and affirmed that "access to research data increases the
returns from public investment in this area; reinforces open scientific
inquiry; encourages diversity of studies and opinion; promotes new
areas of work and enables the exploration of topics not envisioned by
the initial investigators."
Policy implementations (2010-…)
After
2010, national and supra-national institutions took a more
interventionist stance. New policies have been implemented not only to
ensure and incentivize the opening of scientific data, usually in
continuation to existing open data program. In Europe, the "European
Union Commissioner for Research, Science, and Innovation, Carlos Moedas
made open research data one of the EU's priorities in 2015."
First published in 2016, the FAIR Guiding Principles have become an influential framework for opening scientific data. The principles have been originally designed two years earlier during a policy ad research workshop at Lorentz, Jointly Designing a Data FAIRport.
During the deliberations of the workshop, "the notion emerged that,
through the definition of, and widespread support for, a minimal set of
community-agreed guiding principles and practice"
The principles do not attempt to define scientific data, which
remains a relatively plastic concept, but strive to describe "what
constitutes 'good data management'".
They cover four foundational principles, "that serve to guide data
producer": Findability, Accessibility, Interoperability, and
Reusability. and also aim to provide a step toward machine-actionability by expliciting the underlying semantics of data.
As it fully acknowledge the complexity of data management, the
principles do not claim to introduce a set of rigid recommendations but
rather "degrees of FAIRness", that can be adjusted depending on the
organizational costs but also external restrictions in regards to
copyright or privacy.
The FAIR principles have immediately been coopted by major
international organization: "FAIR experienced rapid development, gaining
recognition from the European Union, G7, G20 and US-based Big Data to
Knowledge (BD2K)" In August 2016, the European Commission set up an expert group to turn "FAIR Data into reality". As of 2020, the FAIR principles remain "the most advanced technical standards for open scientific data to date"
In 2022, the French Open Science Monitor started to publish an
experimental survey of research data publications from text mining
tools. Retrospective analysis showed that the rate of publications
mentioning sharing of their associated has nearly doubled in 10 years,
from 13% (in 2013) to 22% (in 2021).
By the end of the 2010s, open data policy are well supported by
scientific communities. Two large surveys commissioned by the European
Commission in 2016 and 2018 find a commonly perceived benefit: "74% of
researchers say that having access to other data would benefit them"
Yet, more qualitative observations gathered in the same investigation
also showed that "what scientists proclaim ideally, versus what they
actually practice, reveals a more ambiguous situation."
Until the 2010s, the publication of scientific data referred mostly
to "the release of datasets associated with an individual journal
article" This release is documented by a Data Accessibility Statement or DAS. Several typologies or data accessibility statements have been proposed. In 2021, Colavizza et al. identified three categories or levels of access:
DAS 1: "Data available on request or similar"
DAS 2: "Data available with the paper and its supplementary files"
DAS 3: "Data available in a repository"
Supplementary data files have appeared in the early phase of the
transition to scientific digital publishing. While the format of
publications have largely kept the constraints of the printing format,
additional materials could be included in "supplementary information".
As a publication supplementary data files have an ambiguous status. In
theory they are meant to be raw documents, giving access to the
background of research. In practice, the released datasets have often to
be specially curated for publication. They will usually focus on the
primary data sources, not on the entire range of observations or
measurements done for the purpose of the research: "Identifying what are
"the data" associated with any individual article, conference paper,
book, or other publication is often difficult [as] investigators collect
data continually."
The selection of the data is also further influenced by the publisher.
Editorial policy of the journal largely determines "goes in the main
text, what in the supplemental information" and editors are especially
weary on including large datasets which may be difficult to maintain in
the long run.
Scientific datasets have been increasingly acknowledged as an
autonomous scientific publication. The assimilation of data to academic
articles aimed to increase the prestige and recognition of published
datasets: "implicit in this argument is that familiarity will encourage
data release".
This approach has been favored by several publishers and repositories
as it made it possible to easily integrate data in existing publishing
infrastructure and to extensively reuse editorial concepts initially
created around articles Data papers were explicitly introduced as "a mechanism to incentivize data publishing in biodiversity science".
Citation and indexation
The
first digital databases of the 1950s and the 1960s have immediately
raised issues of citability and bibliographic descriptions.
The mutability of computer memory was especially challenging: in
contrast with printed publications, digital data could not be expected
to remain stable on the long run. In 1965, Ralph Bisco
underlined that this uncertainty affected all the associated documents
like code notebooks, which may become increasingly out of date. Data
management have to find a middle ground between continuous enhancements
and some form of generic stability: "the concept of a fluid, changeable,
continually improving data archive means that study cleaning and other
processing must be carried to such a point that changes will not
significantly affect prior analyses"
Structured bibliographic metadata for database has been a debated topic since the 1960s.
In 1977, the American Standard for Bibliographic Reference adopted a
definition of "data file" with a strong focus on the materiability and
the mutability of the dataset: neither dates nor authors were indicated
but the medium or "Packaging Method" had to be specified.
Two years later, Sue Dodd introduced an alternative convention, that
brought the citation of data closer to the standard of references of
other scientific publications:
Dodd's recommendation included the use of titles, author, editions and
date, as well as alternative mentions for sub-documentations like code
notebook.
The indexation of dataset has been radically transformed by the
development of the web, as barriers to data sharing were substantially
reduced.
In this process, data archiving, sustainability and persistence have
become critical issues. Permanent digital object identifiers (or DOI)
have been introduced for scientific articles to avoid broken links, as
website structures continuously evolved. In the early 2000s, pilot
programs started to allocate DOIs to dataset as well.
While it solves concrete issues of link sustainability, the creation of
data DOI and norms of data citation is also part of legitimization
process, that assimilate dataset to standard scientific publications and
can draw from similar sources of motivation (like the bibliometric
indexes)
Accessible and findable datasets yield a significant citation
advantage. A 2021 study of 531,889 articles published by PLOS estimated
that there is a "25.36% relative gain in citation counts in general" for
a journal article with "a link to archived data in a public
repository".
Diffusion of data as a supplementary materials does not yield a
significant citation advantage which suggest that "the citation
advantage of DAS [Data Availability Statement] is not as much related to
their mere presence, but to their contents"
As of 2022, the recognition of open scientific data is still an ongoing process. The leading reference software Zotero does not have yet a specific item for dataset.
Reuse and economic impact
Within
academic research, storage and redundancy has proven to be a
significant benefit of open scientific data. In contrast, non-open
scientific data is weakly preserved and can only "be retrieved only with
considerable effort by the authors" if not completely lost.
Analysis of the uses of open scientific data run into the same
issues as for any open content: while free, universal and indiscriminate
access has demonstrably expanded the scope, range and intensity of the
reception it has also made it harder to track, due to the lack of
transaction process.
These issues are further complicated by the novelty of data as a
scientific publication: "In practice, it can be difficult to monitor
data reuse, mainly because researchers rarely cite the repository"
In 2018, a report of the European Commission
estimated the cost of not opening scientific data in accordance with
the FAIR principles: it amounted at 10.2 billion annually in direct
impact and 16 billions in indirect impact over the entire innovation
economy.
Implementing open scientific open data at a global scale "would have a
considerable impact on the time we spent manipulating data and the way
we store data."
Practices and data culture
The sharing of scientific data is rooted in scientific cultures or communities of practice.
As digital tools have become widespread, the infrastructures, the
practices and the common representations of research communities have
increasingly relied of shared meanings of what is data and what can be
done with it.
Pre-existing epistemic machineries can be more or less
predisposed to data sharing. Important factors may include shared values
(individualistic or collective), data ownership allocation and frequent
collaborations with external actors which may be reluctant to data
sharing.
The emergence of an open data culture
The
development of scientific open data is not limited to scientific
research. It involves a diverse set of stakeholders: "Arguments for
sharing data come from many quarters: funding agencies—both public and
private—policy bodies such as national academies and funding councils,
journal publishers, educators, the public at large, and from researchers
themselves." As such, the movement for scientific open data largely intersects with more global movements for open data.
Standards definition of open data used by a wide range of public and
private actors have been partly elaborated by researchers around
concrete scientific issues.
The concept of transparency has especially contributed to create
convergences between open science, open data and open government. In
2015, the OECD describe transparency as a common "rationale for open science and open data".
Christine Borgman has identified four major rationales for
sharing data commonly used across the entire regulatory and public
debate over scientific open data:
Research reproducibility: lack of reproducibility is frequently
attributed to deficiencies in research transparency and data analysis
process. Consequently, as "a rationale for sharing research data,
[research reproducibility] is powerful yet problematic". Reproducibility only applies to "certain kinds of research", mostly in regards to experimental sciences.
Public accessibility: this rationale that "products of public
funding should be available to the public" is "found in arguments for
open government".
While directly inspired by similar arguments made in favor of open
access to publications, its range is more limited as scientific open
data "has direct benefits to far fewer people, and those benefits vary
by stakeholder"
Research valorization: open scientific data may bring a substantial
value to the private sector. This argument is especially used to support
"the need for more repositories that can accept and curate research
data, for better tools and services to exploit data, and for other
investments in knowledge infrastructure".
Increased research and innovation: open scientific data may
significantly enhanced the quality of private and public research. This
argument aims for "investing in knowledge infrastructure to sustain
research data, curated to high standards of professional practices"
Yet collaboration between the different actors and stakeholders of
the data lifecycle is partial. Even within academic institution,
cooperation remains limited: "most researchers are making [data related
search] without consulting a data manager or librarian."
The global open data movement has partly lost its cohesiveness
and identity during the 2010s, as debates over data availability and
licensing have been overcome by domain specific issues: "When the focus
shifts from calling for access to data to creating data infrastructure
and putting data to work, the divergent goals of those who formed an
initial open data movement come clearly into view and managing the
tensions that emerge can be complex." The very generic scope of open data definition that aims to embrace a very wide set of preexisting data cultures
does not well take into account the higher threshold of accessibility
and contextualization necessitated by scientific research: "open data in
the sense of being free for reuse is a necessary but not sufficient
condition for research purposes."
Ideal and implementation: the paradox of data sharing
Since
the 2000s, surveys of scientific communities have underlined a
consistent discrepancy between the ideals of data sharing and their
implementation in practice: "When present-day researchers are asked
whether they are willing to share their data, most say yes, they are
willing to do so. When the same researchers are asked if they do release
their data, they typically acknowledge that they have not done so"
Open data culture does not emerge in a vacuum and has to content with
preexisting culture of scientific data and a range of systemic factors
that can discourage data sharing: "In some fields, scholars are actively
discouraged from reusing data. (…) Careers are made by charting
territory that was previously uncharted."
In 2011, 67% of 1329 scientist agree that lack of data sharing is a "major impediment to progress in science." and yet "only about a third (36%) of the respondents agree that others can access their data easily"
In 2016, a survey of researchers in the environment science find
overwhelming support easily accessible open data (99% as at least
somewhat important) and institutional mandates for open data (88%).
Yet, "even with willingness to share data there are discrepancies with
common practices, e.g. willingness to spend time and resources
preparing and up-loading data".
A 2022 study of 1792 data sharing statements from BioMed Central found
that less 7% of the authors (123) actually provided the data upon
requests.
The prevalence of accessible and findable data is even lower:
"Despite several decades of policy moves toward open access to data, the
few statistics available reflect low rates of data release or deposit" In a 2011 poll for Science,
only 7.6% of researchers shared their data on community repositories
with local websites hosted by universities or laboratories being favored
instead. Consequently "many bemoaned the lack of common metadata and archives as a main impediment to using and storing data".
According to Borgmann, the paradox of data sharing is partly due
to the limitation of open data policies which tends to focus on
"mandating or encouraging investigators to release their data" without
meeting the "expected demand for data or the infrastructure necessary to
support release and reuse"
Incentives and barriers to scientific open data
In
2022, Pujol Priego, Wareham and Romasanta stressed that incentives for
the sharing of scientific data were primarily collective and include
reproducibility, scientific efficiency, scientific quality, along with
more individual retributions such as personal credit
Individual benefits include increased visibility: open dataset yield a
significant citation advantage but only when they have been shared on an
open repository
Important barriers include the need to publish first, legal constraints and concerns about loss of credit of recognition. For individual researchers, datasets may be major assets to barter for "new jobs or new collaborations" and their publication may be difficult to justify unless they "get something of value in return".
Lack of familiarity with data sharing, rather than a straight
rejection of the principles of open science is also ultimately a leading
obstacle. Several surveys in the early 2010s have shown that
researchers "rarely seek data from other investigators and (…) they
rarely are asked for their own data."
This creates a negative feedback loop as researchers make little effort
to ensure data sharing which in turns discouraged effective use whereas
"the heaviest demand for reusing data exists in fields with high mutual
dependence."
The reality of data reuse may also be underestimated as data is not
considered to be a prestigious data publication and the original sources
are not quoted.
According to a 2021 empirical study of 531,889 articles published
by PLOS show that soft incentives and encouragements have a limited
impact on data sharing: "journal policies that encourage rather than
require or mandate DAS [Data Availability Statement] have only a small
effect".
Legal status
The
opening of scientific data has raised a variety of legal issues in
regards to ownership rights, copyrights, privacy and ethics. While it is
commonly considered that researchers "own the data they collect in the
course of their research", this "view is incorrect":
the creation of dataset involves potentially the rights of numerous
additional actors such as institutions (research agencies, funders,
public bodies), associated data producers, personal data on private
citizens.
The legal situation of digital data has been consequently described as a
"bundle of rights" due to the fact that the "legal category of
"property" (...) is not a suitable model for dealing with the complexity
of data governance problems"
Copyright
Copyright
has been the primary focus of the legal literature of open scientific
data until the 2010s. The legality of data sharing was early on
identified a crucial issue. In contrast with the sharing of scientific
publication, the main impediment was not copyright but uncertainty: "the
concept of 'data' [was] a new concept, created in the computer age,
while copyright law emerged at the time of printed publications."
In theory, copyright and author rights provisions do not apply to
simple collections of facts and figures. In practice, the notion of data
is much more expansive and could include protected content or creative
arrangement of non-copyrightable contents.
The status of data in international conventions on intellectual
property is ambiguous. According to the Article 2 of the Berne
Convention "every production in the literary, scientific and artistic
domain" are protected.
Yet, research data is often not an original creation entirely produced
by one or several authors, but rather a "collection of facts, typically
collated using automated or semiautomated instruments or scientific
equipment."
Consequently, there are no universal convention on data copyright and
debates over "the extent to which copyright applies" are still
prevalent, with different outcomes depending on the jurisdiction or the
specifics of the dataset.
This lack of harmonization stems logically from the novelty of
"research data" as a key concept of scientific research: "the concept of
'data' is a new concept, created in the computer age, while copyright
law emerged at the time of printed publications."
In the United States, the European Union and several other
jurisdictions, copyright laws have acknowledged a distinction between
data itself (which can be an unprotected "fact") and the compilation of
the data (which can be a creative arrangement).
This principle largely predates the contemporary policy debate over
scientific data, as the earliest court cases ruled in favor of
compilation rights go back to the 19th century.
In the United States compilation rights have been defined in the Copyright Act of 1976
with an explicit mention of datasets: "a work formed by the collection
and assembling of pre-existing materials or of data" (Par 101). In its 1991 decision, Feist Publications, Inc., v. Rural Telephone Service Co.,
the Supreme Court has clarified the extents and the limitations on
database copyrights, as the "assembling" should be demonstrably original
and the "raw facts" contained in the compilation are still unprotected.
Even in the jurisdiction where the application of the copyright
to data outputs remains unsettled and partly theoretical, it has
nevertheless created significant legal uncertainties. The frontier
between a set of raw facts and an original compilation is not clearly
delineated.
Although scientific organizations are usually well aware of copyright
laws, the complexity of data rights create unprecedented challenges.
After 2010, national and supra-national jurisdiction have partly
changed their stance in regard to the copyright protection of research
data. As the sharing is encouraged, scientific data has been also
acknowledged as an informal public good: "policymakers, funders, and
academic institutions are working to increase awareness that, while the
publications and knowledge derived from research data pertain to the
authors, research data needs to be considered a public good so that its
potential social and scientific value can be realised"
Database rights
The
European Union provides one of the strongest intellectual property
framework for data, with a double layer of rights: copyrights for
original compilations (similarly to the United States) and sui generis database rights. Criteria for the originality of compilations have been harmonized across the membership states, by the 1996 Database Directive and by several major case laws settled by the European court of justice such as Infopaq International A/S v Danske Dagblades Forening c or Football Dataco Ltd et al. v Yahoo! UK Ltd.
Overall, it has been acknowledged that significant efforts in the
making of the dataset are not sufficient to claim compilation rights, as
the structure has to "express his creativity in an original manner" The Database Directive has also introduced an original framework of protection for dataset, the sui generis rights that are conferred to any dataset that required a "substantial investment".
While they last 15 year, sui generis rights have the potential to
become permanent, as they can be renewed for every update of the
dataset.
Due to their large scope in length and protection, sui generis
rights have initially not been largely acknowledged by the European
jurisprudence, which has raised a high bar its enforcement. This
cautious approach has been reversed in the 2010s, as the 2013 decision Innoweb BV v Wegener ICT Media BV and Wegener Mediaventions strengthened the positions of database owners and condemned the reuse of non-protected data in web search engines.
The consolidation and expansion of database rights remain a
controversial topic in European regulations, as it is partly at odds
with the commitment of the European Union in favor of data-driven
economy and open science.
While a few exceptions exists for scientific and pedagogic uses, they
are limited in scope (no rights for further reutilization) and they have
not been activated in all member states.
Ownership
Copyright
issues with scientific datasets have been further complicated by
uncertainties regarding ownership. Research is largely a collaborative
activity that involves a wide range of contributions. Initiatives like
CRediT (Contributor Roles Taxonomy)
have identified 14 different roles, of which 4 are explicitly related
to data management (Formal Analysis, Investigation, Data curation and
Visualization).
In the United States, ownership of research data is usually
"determined by the employer of the researcher", with the principal
investigator acting as the caretaker of the data rather than the owner.
Until the development of research open data, US institutions have been
usually more reluctant to waive copyrights on data than on publications,
as they are considered strategic assets. In the European Union, there is no largely agreed framework on the ownership of data.
The additional rights of external stakeholders has also been
raised, especially in the context of medical research. Since the 1970s,
patients have claimed some form of ownership of the data produced in the
context of clinical trials, notably with important controversies
concerning 'whether research subjects and patients actually own their
own tissue or DNA."
Privacy
Numerous
scientific projects rely on data collection of persons, notably in
medical research and the social sciences. In such cases, any policy of
data sharing has to be necessarily balanced with the preservation and
protection of personal data.
Researchers and, most specifically, principal investigators have
been subjected to obligations of confidentiality in several
jurisdictions.
Health data has been increasingly regulated since the late 20th
century, either by law or by sectorial agreements. In 2014, the European
Medicines Agency have introduced important changes to the sharing of
clinical trial data, in order to prevent the release of all personal
details and all commercially relevant information. Such evolution of the
European regulation "are likely to influence the global practice of
sharing clinical trial data as open data".
Research management plans and practices have to be open, transparent and confidential by design.
Free licenses
Open
licenses have been the preferred legal framework to clear the
restrictions and ambiguities in the legal definition of scientific data.
In 2003, the Berlin Declaration called for a universal waiver of reuse
rights on scientific contributions that explicitly included "raw data
and metadata".
In contrast with the development of open licenses for
publications which occurred on short time frame, the creation of
licenses for open scientific data has been a complicated process.
Specific rights, like the sui generis database rights in the
European Union or specific legal principles, like the distinction
between simple facts and original compilation have not been initially
anticipated. Until the 2010s, free licenses could paradoxically add more
restrictions to the reuse of datasets, especially in regard with
attributions (which is not required for non-copyrighted objects like raw facts): "in such cases, when no rights are attached to research data, then there is no ground for licencing the data"
To circumvent the issue several institutions like the Harvard-MIT Data Center started to share the data in the Public Domain.
This approach ensures that no right is applied on non-copyrighted
items. Yet, the public domain and some associated tools like the Public Domain Mark are not a properly defined legal contract and varies significantly from one jurisdiction to another. First introduced in 2009, the Creative Commons Zero (or CC0) license has been immediately contemplated for data licensing. It has since become "the recommended tool for releasing research data into the public domain".
In accordance with the principles of the Berlin Declaration it is not a
license but a waiver, as the producer of the data "overtly, fully,
permanently, irrevocably and unconditionally waives, abandons, and
surrenders all of Affirmer's Copyright and Related Rights".
Alternative approaches have included the design of new free
license to disentangle the attribution stacking specific to database
rights. In 2009, the Open Knowledge Foundation published the Open Database License which has been adopted by major online projects like OpenStreetMap.
Since 2015, all the different Creative Commons licenses have been
updated to become fully effective on dataset, as database rights have
been explicitly anticipated in the 4.0 version.
Open scientific data management
Data management has recently become a primary focus of the policy and research debate on open scientific data. The influential FAIR principles are voluntarily centered on the key features of "good data management" in a scientific context. In a research context, data management is frequently associated to data lifecycles.
Various models of lifecycles in different stage have been theorized by
institutions, infrastructures and scientific communities. However, "such
lifecycles are a simplification of real life, which is far less linear
and more iterative in practice."
Integration to the research workflow
In
contrast with the broad incitations for data sharing included in the
early policies in favor of open scientific data, the complexity and the
underlying costs and requirements of scientific data management are
increasingly acknowledged: "Data sharing is difficult to do and to
justify by the return on investment."
Open data is not simply a supplementary task but has to be envisioned
throughout the entire research process as it "requires changes in
methods and practices of research."
The opening of research data creates a new settlement of costs
and benefits. Public data sharing introduces a new communication setting
that primarily contrasts with private data exchange with research
collaborators or partners. The collection, the purpose, and the
limitation of data have to be explicit as it is impossible to rely on
pre-existing informal knowledge: "the documentation and representations
are the only means of communicating between data creator and user."
Lack of proper documentation means that the burden of
recontextualization falls on the potential users and may render the
dataset useless.
Publication requires further verification regarding the ownership
of the data and the potential legal liability if the data is misused.
This clarification phase becomes even more complex in international
research projects that may overlap several jurisdictions.
Data sharing and applying open science principles also bring
significant long-term advantages that may not be immediately visible.
Documentation of the dataset helps clarify the chain of provenance and
ensure that the original data has not been significantly altered or that
all the further treatments are fully documented if this is the case. Publication under a free license also allows delegating tasks such as long-term preservation to external actors.
By the end of the 2010s, a new specialized literature on data
management for research had emerged to codify existing practices and
regulatory principles.
Storage and preservation
The
availability of non-open scientific data decays rapidly: in 2014 a
retrospective study of biological datasets showed that "the odds of a
data set being reported as extant fell by 17% per year" Consequently, the "proportion of data sets that still existed dropped from 100% in 2011 to 33% in 1991". Data loss has also been singled out as a significant issue in major journals like Nature or Science
Surveys of research practices have consistently shown that
storage norms, infrastructures, and workflow remain unsatisfying in most
disciplines. The storage and preservation of scientific data have been
identified early on as critical issues, especially concerning observational data, which are considered essential to preserve because they are the most difficult to replicate. A 2017-2018 survey of 1372 researchers contacted through the American Geophysical Union shows that only "a quarter and a fifth of the respondents" report good data storage practices.
Short-term and unsustainable storage remains widespread, with 61% of
the respondents storing most or all of their data on personal computers.
Due to their ease of use at an individual scale, unsustainable storage
solutions are viewed favorably in most disciplines: "This mismatch
between good practices and satisfaction may show that data storage is
less important to them than data collection and analysis".
First published in 2012, the reference model of Open Archival Information System
states that scientific infrastructure should seek long-term
preservation, that is, "long enough to be concerned with the impacts of
changing technologies, including support for new media and data formats,
or with a changing user community".
Consequently, good practices of data management imply both on storage
(to materially preserve the data) and, even more crucially on curation,
"to preserve knowledge about the data to facilitate reuse".
Data sharing on public repositories has contributed to mitigating
preservation risks due to the long-term commitment of data
infrastructures and the potential redundancy of open data. A 2021 study
of 50,000 data availability statements published in PLOS One showed that
80% of the dataset could be retrieved automatically, and 98% of the
dataset with a data DOI could be retrieved automatically or manually.
Moreover, accessibility did not decay significantly for older
publications: "URLs and DOIs make the data and code associated with
papers more likely to be available over time".
Significant benefits have not been found when the open data was not
correctly linked or documented: "Simply requiring that data be shared in
some form may not have the desired impact of making scientific data
FAIR, as studies have repeatedly demonstrated that many datasets that
are ostensibly shared may not actually be accessible."
Research data management can be laid out in a data management plan or DMP.
Data management plans were incepted in 1966 for the specific
needs of aeronautic and engineering research, which already faced
increasingly complex data frictions.
These first examples were focused on material issues associated with
the access, transfer, and storage of the data: "Until the early 2000s,
DMPs were utilised in this manner: in limited fields, for projects of
great technical complexity, and for limited mid-study data collection
and processing purposes"
After 2000, the implementation of extensive research
infrastructure and the development of open science changed the scope and
the purpose of data management plans. Policy-makers, rather than
scientists, have been instrumental in this development: "The first
publications to provide general advice and guidance to researchers
around the creation of DMPs were published from 2009 following the
publications from JISC and the OECD (…) DMP use, we infer, has been
imposed onto the research community through external forces"
Empirical studies of data practices in research have "highlighted
the need for organizations to offer more formal training and assistance
in data management to scientists"
In a 2017-2018 international survey of 1372 scientist, most requests
for help and formalization were associated with data management plan:
"creating data management plans (33.3%); training on best practices in
data management (31.3%); assistance on creating metadata to describe
data or datasets (27.6%)"
The expansion of data collection and data analysis processes have
increasingly strained a large range of unformal and non-codified data
practices.
The implication of external shareholders in research projects
creates significant potential tensions with the principles of sharing
open data. Contributions from commercial actors can especially rely on
some form of exclusivity and appropriation of the final research
results. In 2022, Pujol Priego, Wareham, and Romasanta created several
accommodation strategies to overcome these issues, such as data
modularity (with sharing limited to some part of the data) and time
delay (with year-long embargoes before the final release of the data).
The Unesco
recommendation of Open Science approved in November 2021 defines open
science infrastructures as "shared research infrastructures that are
needed to support open science and serve the needs of different
communities"
Open science infrastructures have been recognized as a major factor in
the implementation and the development of data sharing policies.
Leading infrastructures for open scientific data include data repositories, data analysis platforms, indexes, digitized libraries, or digitized archives. Infrastructures ensure that individual researchers and institutions do
not entirely support the costs of publishing, maintaining, and indexing
datasets. They are also critical stakeholders in the definition and
adoption of open data standards, especially regarding licensing or
documentation.
By the end of the 1990s, the creation of public scientific computing infrastructure became a major policy issue:
"The lack of infrastructure to support release and reuse was
acknowledged in some of the earliest policy reports on data sharing."
The first wave of web-based scientific projects in the 1990s and the
early 2000s revealed critical issues of sustainability. As funding was
allocated on a specific period, critical databases, online tools or
publishing platforms could hardly be maintained. Project managers were faced with a valley of death "between grant funding and ongoing operational funding".
After 2010, the consolidation and expansion of commercial, scientific
infrastructure such as the acquisition of the open repositories Digital Commons and SSRN by Elsevier had further entailed calls to secure "community-controlled infrastructure". In 2015, Cameron Neylon, Geoffrey Bilder and Jenifer Lin defined an influential series of Principles for Open Scholarly Infrastructure that has been endorsed by leading infrastructures such as Crossref, OpenCitations or Data Dryad
By 2021, public services and infrastructures for research have largely
endorsed open science as an integral part of their activity and
identity: "open science is the dominant discourse to which new online
services for research refer." According to the 2021 Roadmap of the European Strategy Forum on Research Infrastructures
(ESFRI), major legacy infrastructures in Europe have embraced open
science principles. "Most of the Research Infrastructures on the ESFRI
Roadmap are at the forefront of Open Science movement and make important
contributions to the digital transformation by transforming the whole
research process according to the Open Science paradigm."
Open science infrastructure represents a higher level of
commitment to data sharing. They rely on significant and recurrent
investments to ensure that data is effectively maintained and documented
and "add value to data through metadata, provenance, classification,
standards for data structures, and migration".
Furthermore, infrastructures need to be integrated into the norms and
expected uses of the scientific communities they mean to serve: "The
most successful become reference collections that attract longer-term
funding and can set standards for their communities"
Maintaining open standards is one of the main challenge identified by
leading European open infrastructures, as it implies choosing among
competing standards in some case, as well as ensuring that the standards
are correctly updated and accessible through APIs or other endpoints.
The conceptual definition of open science infrastructures has been influenced mainly by the analysis of Elinor Ostrom on the commons and, more specifically, on the knowledge commons. In accordance with Ostrom, Cameron Neylon
understates that open infrastructures are not only characterized by the
management of a pool of shared resources but also by the elaboration of
joint governance and norms.
The diffusion of open scientific data also raise stringent issues of
governance. In regards to the determination of the ownership of the
data, the adoption of free license and the enforcement of regulations in
regard to privacy, "continual negotiation is necessary" and involve a
wide range of stakeholders.
Beyond their integration in specific scientific communities, open science infrastructure have strong ties with the open source
and the open data movements. 82% of the European infrastructures
surveyed by SPARC claim to have partially built open source software and
53% have their entire technological infrastructure in open source.
Open science infrastructures preferably integrate standards from other
open science infrastructures. Among European infrastructures: "The most
commonly cited systems – and thus essential infrastructure for many –
are ORCID, Crossref, DOAJ, BASE, OpenAIRE, Altmetric, and Datacite, most of which are not-for-profit".
Open science infrastructure are then part of an emerging "truly
interoperable Open Science commons" that hold the premise of
"researcher-centric, low-cost, innovative, and interoperable tools for
research, superior to the present, largely closed system."