From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Digital_preservationIn library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and technologies, and it combines policies, strategies and actions to ensure access to reformatted and "born-digital"
content, regardless of the challenges of media failure and
technological change. The goal of digital preservation is the accurate
rendering of authenticated content over time.
The Association for Library Collections and Technical Services Preservation and Reformatting Section of the American Library Association,
defined digital preservation as combination of "policies, strategies
and actions that ensure access to digital content over time." According to the Harrod's Librarian Glossary,
digital preservation is the method of keeping digital material alive so
that they remain usable as technological advances render original
hardware and software specification obsolete.
The need for digital preservation mainly arises because of the relatively short lifespan of digital media. Widely used hard drives can become unusable in a few years due to a variety of reasons such as damaged spindle motors, and flash memory (found on SSDs, phones, USB flash drives,
and in memory cards such as SD, microSD, and CompactFlash cards) can
start to lose data around a year after its last use, depending on its
storage temperature and how much data has been written to it during its
lifetime. Currently, 5D optical data storage has the potential to store digital data for thousands of years. Archival disc-based
media is available, but it is only designed to last for 50 years and it
is a proprietary format, sold by just two Japanese companies, Sony and
Panasonic. M-DISC
is a DVD-based format that claims to retain data for 1,000 years, but
writing to it requires special optical disc drives and reading the data
it contains requires increasingly uncommon optical disc drives, in
addition the company behind the format went bankrupt. Data stored on LTO tapes require periodic migration, as older tapes cannot be read by newer LTO tape drives. RAID
arrays could be used to protect against failure of single hard drives,
although care needs to be taken to not mix the drives of one array with
those of another.
Fundamentals
Appraisal
Archival appraisal (or, alternatively, selection)
refers to the process of identifying records and other materials to be
preserved by determining their permanent value. Several factors are
usually considered when making this decision.
It is a difficult and critical process because the remaining selected
records will shape researchers' understanding of that body of records,
or fonds. Appraisal is identified as A4.2 within the Chain of Preservation (COP) model created by the InterPARES 2 project. Archival appraisal is not the same as monetary appraisal, which determines fair market value.
Archival appraisal may be performed once or at the various stages of acquisition and processing. Macro appraisal,
a functional analysis of records at a high level, may be performed even
before the records have been acquired to determine which records to
acquire. More detailed, iterative appraisal may be performed while the
records are being processed.
Appraisal is performed on all archival materials, not just
digital. It has been proposed that, in the digital context, it might be
desirable to retain more records than have traditionally been retained
after appraisal of analog records, primarily due to a combination of the
declining cost of storage and the availability of sophisticated
discovery tools which will allow researchers to find value in records of
low information density.
In the analog context, these records may have been discarded or only a
representative sample kept. However, the selection, appraisal, and
prioritization of materials must be carefully considered in relation to
the ability of an organization to responsibly manage the totality of
these materials.
Often libraries, and to a lesser extent, archives, are offered
the same materials in several different digital or analog formats. They
prefer to select the format that they feel has the greatest potential
for long-term preservation of the content. The Library of Congress has created a set of recommended formats for long-term preservation. They would be used, for example, if the Library was offered items for copyright deposit directly from a publisher.
Identification (identifiers and descriptive metadata)
In
digital preservation and collection management, discovery and
identification of objects is aided by the use of assigned identifiers
and accurate descriptive metadata. An identifier
is a unique label that is used to reference an object or record,
usually manifested as a number or string of numbers and letters. As a
crucial element of metadata
to be included in a database record or inventory, it is used in tandem
with other descriptive metadata to differentiate objects and their
various instantiations.
Descriptive metadata refers to information about an object's content such as title, creator, subject, date etc...
Determination of the elements used to describe an object are
facilitated by the use of a metadata schema. Extensive descriptive
metadata about a digital object helps to minimize the risks of a digital
object becoming inaccessible.
Another common type of file identification is the filename.
Implementing a file naming protocol is essential to maintaining
consistency and efficient discovery and retrieval of objects in a
collection, and is especially applicable during digitization of analog
media. Using a file naming convention, such as the 8.3 filename or the Warez standard naming,
will ensure compatibility with other systems and facilitate migration
of data, and deciding between descriptive (containing descriptive words
and numbers) and non-descriptive (often randomly generated numbers) file
names is generally determined by the size and scope of a given
collection.
However, filenames are not good for semantic identification, because
they are non-permanent labels for a specific location on a system and
can be modified without affecting the bit-level profile of a digital
file.
Integrity
The cornerstone of digital preservation, "data integrity"
refers to the assurance that the data is "complete and unaltered in all
essential respects"; a program designed to maintain integrity aims to
"ensure data is recorded exactly as intended, and upon later retrieval,
ensure the data is the same as it was when it was originally recorded".
Unintentional changes to data are to be avoided, and responsible
strategies put in place to detect unintentional changes and react as
appropriately determined. However, digital preservation efforts may
necessitate modifications to content or metadata through
responsibly-developed procedures and by well-documented policies.
Organizations or individuals may choose to retain original,
integrity-checked versions of content and/or modified versions with
appropriate preservation metadata. Data integrity practices also apply
to modified versions, as their state of capture must be maintained and
resistant to unintentional modifications.
The integrity of a record can be preserved through bit-level
preservation, fixity checking, and capturing a full audit trail of all
preservation actions performed on the record. These strategies can
ensure protection against unauthorised or accidental alteration.
Fixity
File fixity
is the property of a digital file being fixed, or unchanged. File
fixity checking is the process of validating that a file has not changed
or been altered from a previous state. This effort is often enabled by the creation, validation, and management of checksums.
While checksums are the primary mechanism for monitoring fixity
at the individual file level, an important additional consideration for
monitoring fixity is file attendance. Whereas checksums identify if a
file has changed, file attendance identifies if a file in a designated
collection is newly created, deleted, or moved. Tracking and reporting
on file attendance is a fundamental component of digital collection
management and fixity.
Characterization
Characterization
of digital materials is the identification and description of what a
file is and of its defining technical characteristics often captured by technical metadata, which records its technical attributes like creation or production environment.
Sustainability
Digital sustainability encompasses a range of issues and concerns that contribute to the longevity of digital information.
Unlike traditional, temporary strategies, and more permanent solutions,
digital sustainability implies a more active and continuous process.
Digital sustainability concentrates less on the solution and technology
and more on building an infrastructure and approach that is flexible
with an emphasis on interoperability, continued maintenance and continuous development. Digital sustainability incorporates activities in the present that will facilitate access and availability in the future.
The ongoing maintenance necessary to digital preservation is analogous
to the successful, centuries-old, community upkeep of the Uffington White Horse (according to Stuart M. Shieber) or the Ise Grand Shrine (according to Jeffrey Schnapp).
Renderability
Renderability
refers to the continued ability to use and access a digital object
while maintaining its inherent significant properties.
Physical media obsolescence
Physical media obsolescence
can occur when access to digital content requires external dependencies
that are no longer manufactured, maintained, or supported. External
dependencies can refer to hardware, software, or physical carriers. For example, DLT tape was used for backups and data preservation, but is no longer used.
Format obsolescence
File
format obsolescence can occur when adoption of new encoding formats
supersedes use of existing formats, or when associated presentation
tools are no longer readily available.
While the use of file formats will vary among archival
institutions given their capabilities, there is documented acceptance
among the field that chosen file formats should be "open, standard,
non-proprietary, and well-established" to enable long-term archival use.
Factors that should enter consideration when selecting sustainable file
formats include disclosure, adoption, transparency, self-documentation,
external dependencies, impact of patents, and technical protection
mechanisms.
Other considerations for selecting sustainable file formats include
"format longevity and maturity, adaptation in relevant professional
communities, incorporated information standards, and long-term
accessibility of any required viewing software". For example, the Smithsonian Institution Archives considers uncompressed TIFFs
to be "a good preservation format for born-digital and digitized still
images because of its maturity, wide adaptation in various communities,
and thorough documentation".
Formats proprietary to one software vendor are more likely to be affected by format obsolescence. Well-used standards such as Unicode and JPEG are more likely to be readable in future.
Significant properties
Significant
properties refer to the "essential attributes of a digital object which
affect its appearance, behavior, quality and usability" and which "must
be preserved over time for the digital object to remain accessible and
meaningful."
"Proper understanding of the significant properties of digital
objects is critical to establish best practice approaches to digital
preservation. It assists appraisal and selection, processes in which
choices are made about which significant properties of digital objects
are worth preserving; it helps the development of preservation metadata,
the assessment of different preservation strategies and informs future
work on developing common standards across the preservation community."
Authenticity
Whether
analog or digital, archives strive to maintain records as trustworthy
representations of what was originally received. Authenticity has been
defined as ". . . the trustworthiness of a record as a record; i.e., the
quality of a record that is what it purports to be and that is free
from tampering or corruption". Authenticity should not be confused with accuracy;
an inaccurate record may be acquired by an archives and have its
authenticity preserved. The content and meaning of that inaccurate
record will remain unchanged.
A combination of policies, security procedures, and documentation
can be used to ensure and provide evidence that the meaning of the
records has not been altered while in the archives' custody.
Access
Digital
preservation efforts are largely to enable decision-making in the
future. Should an archive or library choose a particular strategy to
enact, the content and associated metadata must persist to allow for
actions to be taken or not taken at the discretion of the controlling
party.
Preservation metadata
Preservation metadata
is a key enabler for digital preservation, and includes technical
information for digital objects, information about a digital object's
components and its computing environment, as well as information that
documents the preservation process and underlying rights basis. It
allows organizations or individuals to understand the chain of custody. Preservation Metadata: Implementation Strategies (PREMIS),
is the de facto standard that defines the implementable, core
preservation metadata needed by most repositories and institutions. It
includes guidelines and recommendations for its usage, and has developed
shared community vocabularies.
Intellectual foundations
Preserving Digital Information (1996)
The challenges of long-term preservation of digital information have been recognized by the archival community for years. In December 1994, the Research Libraries Group
(RLG) and Commission on Preservation and Access (CPA) formed a Task
Force on Archiving of Digital Information with the main purpose of
investigating what needed to be done to ensure long-term preservation
and continued access to the digital records. The final report published
by the Task Force (Garrett, J. and Waters, D., ed. (1996). "Preserving
digital information: Report of the task force on archiving of digital
information.")
became a fundamental document in the field of digital preservation that
helped set out key concepts, requirements, and challenges.
The Task Force proposed development of a national system of
digital archives that would take responsibility for long-term storage
and access to digital information; introduced the concept of trusted
digital repositories and defined their roles and responsibilities;
identified five features of digital information integrity (content,
fixity, reference, provenance, and context) that were subsequently
incorporated into a definition of Preservation Description Information
in the Open Archival Information System Reference Model; and defined
migration as a crucial function of digital archives. The concepts and
recommendations outlined in the report laid a foundation for subsequent
research and digital preservation initiatives.
OAIS
To
standardize digital preservation practice and provide a set of
recommendations for preservation program implementation, the Reference
Model for an Open Archival Information System (OAIS)
was developed, and published in 2012. OAIS is concerned with all
technical aspects of a digital object's life cycle: ingest, archival
storage, data management, administration, access and preservation
planning.
The model also addresses metadata issues and recommends that five
types of metadata be attached to a digital object: reference
(identification) information, provenance (including preservation
history), context, fixity (authenticity indicators), and representation
(formatting, file structure, and what "imparts meaning to an object's
bitstream").
Trusted Digital Repository Model
In March 2000, the Research Libraries Group (RLG) and Online Computer Library Center
(OCLC) began a collaboration to establish attributes of a digital
repository for research organizations, building on and incorporating the
emerging international standard of the Reference Model for an Open
Archival Information System (OAIS). In 2002, they published "Trusted
Digital Repositories: Attributes and Responsibilities." In that document
a "Trusted Digital Repository" (TDR) is defined as "one whose mission
is to provide reliable, long-term access to managed digital resources to
its designated community, now and in the future." The TDR must include
the following seven attributes: compliance with the reference model for
an Open Archival Information System (OAIS), administrative
responsibility, organizational viability, financial sustainability,
technological and procedural suitability, system security, procedural
accountability. The Trusted Digital Repository Model outlines
relationships among these attributes. The report also recommended the
collaborative development of digital repository certifications, models
for cooperative networks, and sharing of research and information on
digital preservation with regard to intellectual property rights.
In 2004 Henry M. Gladney proposed another approach to digital
object preservation that called for the creation of "Trustworthy Digital
Objects" (TDOs). TDOs are digital objects that can speak to their own
authenticity since they incorporate a record maintaining their use and
change history, which allows the future users to verify that the
contents of the object are valid.
InterPARES
International
Research on Permanent Authentic Records in Electronic Systems
(InterPARES) is a collaborative research initiative led by the
University of British Columbia that is focused on addressing issues of
long-term preservation of authentic digital records. The research is
being conducted by focus groups from various institutions in North
America, Europe, Asia, and Australia, with an objective of developing
theories and methodologies that provide the basis for strategies,
standards, policies, and procedures necessary to ensure the
trustworthiness, reliability, and accuracy of digital records over time.
Under the direction of archival science professor Luciana Duranti,
the project began in 1999 with the first phase, InterPARES 1, which ran
to 2001 and focused on establishing requirements for authenticity of
inactive records generated and maintained in large databases and
document management systems created by government agencies.
InterPARES 2 (2002–2007) concentrated on issues of reliability,
accuracy and authenticity of records throughout their whole life cycle,
and examined records produced in dynamic environments in the course of
artistic, scientific and online government activities.
The third five-year phase (InterPARES 3) was initiated in 2007. Its
goal is to utilize theoretical and methodological knowledge generated by
InterPARES and other preservation research projects for developing
guidelines, action plans, and training programs on long-term
preservation of authentic records for small and medium-sized archival
organizations.
Challenges
Society's
heritage has been presented on many different materials, including
stone, vellum, bamboo, silk, and paper. Now a large quantity of
information exists in digital forms, including emails, blogs, social
networking websites, national elections websites, web photo albums, and
sites which change their content over time.
With digital media it is easier to create content and keep it
up-to-date, but at the same time there are many challenges in the
preservation of this content, both technical and economic.
Unlike traditional analog objects such as books or photographs
where the user has unmediated access to the content, a digital object
always needs a software environment to render it. These environments
keep evolving and changing at a rapid pace, threatening the continuity
of access to the content.
Physical storage media, data formats, hardware, and software all become
obsolete over time, posing significant threats to the survival of the
content. This process can be referred to as digital obsolescence.
In the case of born-digital
content (e.g., institutional archives, websites, electronic audio and
video content, born-digital photography and art, research data sets,
observational data), the enormous and growing quantity of content
presents significant scaling issues to digital preservation efforts.
Rapidly changing technologies can hinder digital preservationists' work
and techniques due to outdated and antiquated machines or technology.
This has become a common problem and one that is a constant worry for a
digital archivist—how to prepare for the future.
Digital content can also present challenges to preservation
because of its complex and dynamic nature, e.g., interactive Web pages, virtual reality and gaming environments, learning objects, social media sites. In many cases of emergent technological advances there are substantial difficulties in maintaining the authenticity, fixity,
and integrity of objects over time deriving from the fundamental issue
of experience with that particular digital storage medium and while
particular technologies may prove to be more robust in terms of storage
capacity, there are issues in securing a framework of measures to ensure
that the object remains fixed while in stewardship.
For the preservation of software as digital content, a specific challenge is the typically non-availability of the source code as commercial software is normally distributed only in compiled binary form. Without the source code an adaption (Porting) on modern computing hardware or operating system is most often impossible, therefore the original hardware and software context needs to be emulated. Another potential challenge for software preservation can be the copyright which prohibits often the bypassing of copy protection mechanisms (Digital Millennium Copyright Act) in case software has become an orphaned work (Abandonware).
An exemption from the United States Digital Millennium Copyright Act to
permit to bypass copy protection was approved in 2003 for a period of 3
years to the Internet Archive who created an archive of "vintage software", as a way to preserve them. The exemption was renewed in 2006, and as of 27 October 2009, has been indefinitely extended pending further rulemakings "for the purpose of preservation or archival reproduction of published digital works by a library or archive". The GitHub Archive Program has stored all of GitHub's open source code in a secure vault at Svalbard, on the frozen Norwegian island of Spitsbergen, as part of the Arctic World Archive, with the code stored as QR codes.
Another challenge surrounding preservation of digital content
resides in the issue of scale. The amount of digital information being
created along with the "proliferation of format types"
makes creating trusted digital repositories with adequate and
sustainable resources a challenge. The Web is only one example of what
might be considered the "data deluge". For example, the Library of Congress currently amassed 170 billion tweets between 2006 and 2010 totaling 133.2 terabytes and each Tweet is composed of 50 fields of metadata.
The economic challenges of digital preservation are also great.
Preservation programs require significant up front investment to create,
along with ongoing costs for data ingest, data management, data
storage, and staffing. One of the key strategic challenges to such
programs is the fact that, while they require significant current and
ongoing funding, their benefits accrue largely to future generations.
Layers of archiving
The various levels of security may be represented as three layers: the "hot" (accessible online repositories) and "warm" (e.g. Internet Archive) layers both have the weakness of being founded upon electronics - both would be wiped out in a repeat of the powerful 19th-century geomagnetic storm known as the "Carrington Event". The Arctic World Archive, stored on specially developed film coated with silver halide with a lifespan of 500+ years, represents more secure snapshot of data, with archiving intended at five-year intervals.
Strategies
In 2006, the Online Computer Library Center developed a four-point strategy for the long-term preservation of digital objects that consisted of:
- Assessing the risks for loss of content posed by technology
variables such as commonly used proprietary file formats and software
applications.
- Evaluating the digital content objects to determine what type and degree of format conversion or other preservation actions should be applied.
- Determining the appropriate metadata needed for each object type and how it is associated with the objects.
- Providing access to the content.
There are several additional strategies that individuals and
organizations may use to actively combat the loss of digital
information.
Refreshing
Refreshing is the transfer of data between two types of the same storage medium so there are no bitrot changes or alteration of data. For example, transferring census data from an old preservation CD to a new one. This strategy may need to be combined with migration when the software or hardware
required to read the data is no longer available or is unable to
understand the format of the data. Refreshing will likely always be
necessary due to the deterioration of physical media.
Migration
Migration
is the transferring of data to newer system environments (Garrett et
al., 1996). This may include conversion of resources from one file format to another (e.g., conversion of Microsoft Word to PDF or OpenDocument) or from one operating system to another (e.g., Windows to Linux)
so the resource remains fully accessible and functional. Two
significant problems face migration as a plausible method of digital
preservation in the long terms. Due to the fact that digital objects
are subject to a state of near continuous change, migration may cause
problems in relation to authenticity and migration has proven to be
time-consuming and expensive for "large collections of heterogeneous
objects, which would need constant monitoring and intervention.
Migration can be a very useful strategy for preserving data stored on
external storage media (e.g. CDs, USB flash drives, and 3.5" floppy
disks). These types of devices are generally not recommended for
long-term use, and the data can become inaccessible due to media and
hardware obsolescence or degradation.
Replication
Creating duplicate copies of data on one or more systems is called replication.
Data that exists as a single copy in only one location is highly
vulnerable to software or hardware failure, intentional or accidental
alteration, and environmental catastrophes like fire, flooding, etc.
Digital data is more likely to survive if it is replicated in several
locations. Replicated data may introduce difficulties in refreshing,
migration, versioning, and access control since the data is located in multiple places.
Understanding digital preservation means comprehending how
digital information is produced and reproduced. Because digital
information (e.g., a file) can be exactly replicated down to the bit
level, it is possible to create identical copies of data. Exact
duplicates allow archives and libraries to manage, store, and provide
access to identical copies of data across multiple systems and/or
environments.
Emulation
Emulation
is the replicating of functionality of an obsolete system. According to
van der Hoeven, "Emulation does not focus on the digital object, but on
the hard- and software environment in which the object is rendered. It
aims at (re)creating the environment in which the digital object was
originally created." Examples are having the ability to replicate or imitate another operating system. Examples include emulating an Atari 2600 on a Windows system or emulating WordPerfect 1.0 on a Macintosh. Emulators
may be built for applications, operating systems, or hardware
platforms. Emulation has been a popular strategy for retaining the
functionality of old video game systems, such as with the MAME project. The feasibility of emulation as a catch-all solution has been debated in the academic community. (Granger, 2000)
Raymond A. Lorie has suggested a Universal Virtual Computer (UVC) could be used to run any software in the future on a yet unknown platform.
The UVC strategy uses a combination of emulation and migration. The UVC
strategy has not yet been widely adopted by the digital preservation
community.
Jeff Rothenberg, a major proponent of Emulation for digital preservation in libraries, working in partnership with Koninklijke Bibliotheek and Nationaal Archief of the Netherlands,
developed a software program called Dioscuri, a modular emulator that
succeeds in running MS-DOS, WordPerfect 5.1, DOS games, and more.
Another example of emulation as a form of digital preservation can be seen in the example of Emory University and the Salman Rushdie's papers. Rushdie donated an outdated computer to the Emory University library,
which was so old that the library was unable to extract papers from the
harddrive. In order to procure the papers, the library emulated the old
software system and was able to take the papers off his old computer.
Encapsulation
This
method maintains that preserved objects should be self-describing,
virtually "linking content with all of the information required for it
to be deciphered and understood".
The files associated with the digital object would have details of how
to interpret that object by using "logical structures called
"containers" or "wrappers" to provide a relationship between all
information components that could be used in future development of emulators, viewers or converters through machine readable specifications. The method of encapsulation is usually applied to collections that will go unused for long periods of time.
Persistent archives concept
Developed by the San Diego Supercomputer Center and funded by the National Archives and Records Administration,
this method requires the development of comprehensive and extensive
infrastructure that enables "the preservation of the organisation of
collection as well as the objects that make up that collection,
maintained in a platform independent form".
A persistent archive includes both the data constituting the digital
object and the context that the defines the provenance, authenticity,
and structure of the digital entities.
This allows for the replacement of hardware or software components with
minimal effect on the preservation system. This method can be based on
virtual data grids and resembles OAIS Information Model (specifically the Archival Information Package).
Metadata attachment
Metadata
is data on a digital file that includes information on creation, access
rights, restrictions, preservation history, and rights management. Metadata attached to digital files may be affected by file format obsolescence. ASCII is considered to be the most durable format for metadata because it is widespread, backwards compatible when used with Unicode,
and utilizes human-readable characters, not numeric codes. It retains
information, but not the structure information it is presented in. For
higher functionality, SGML or XML should be used. Both markup languages are stored in ASCII format, but contain tags that denote structure and format.
Preservation repository assessment and certification
A
few of the major frameworks for digital preservation repository
assessment and certification are described below. A more detailed list
is maintained by the U.S. Center for Research Libraries.
Specific tools and methodologies
TRAC
In 2007, CRL/OCLC published Trustworthy Repositories Audit & Certification: Criteria & Checklist (TRAC),
a document allowing digital repositories to assess their capability to
reliably store, migrate, and provide access to digital content. TRAC is
based upon existing standards and best practices for trustworthy digital
repositories and incorporates a set of 84 audit and certification
criteria arranged in three sections: Organizational Infrastructure;
Digital Object Management; and Technologies, Technical Infrastructure,
and Security.
TRAC "provides tools for the audit, assessment, and potential
certification of digital repositories, establishes the documentation
requirements required for audit, delineates a process for certification,
and establishes appropriate methodologies for determining the soundness
and sustainability of digital repositories".
DRAMBORA
Digital Repository Audit Method Based On Risk Assessment (DRAMBORA), introduced by the Digital Curation Centre (DCC) and DigitalPreservationEurope (DPE) in 2007, offers a methodology and a toolkit for digital repository risk assessment. The tool enables repositories to either conduct the assessment in-house (self-assessment) or to outsource the process.
The DRAMBORA process is arranged in six stages and concentrates
on the definition of mandate, characterization of asset base,
identification of risks and the assessment of likelihood and potential
impact of risks on the repository. The auditor is required to describe
and document the repository's role, objectives, policies, activities and
assets, in order to identify and assess the risks associated with these
activities and assets and define appropriate measures to manage them.
European Framework for Audit and Certification of Digital Repositories
The European Framework for Audit and Certification of Digital Repositories
was defined in a memorandum of understanding signed in July 2010
between Consultative Committee for Space Data Systems (CCSDS), Data Seal
of Approval (DSA) Board and German Institute for Standardization (DIN) "Trustworthy Archives – Certification" Working Group.
The framework is intended to help organizations in obtaining
appropriate certification as a trusted digital repository and
establishes three increasingly demanding levels of assessment:
- Basic Certification: self-assessment using 16 criteria of the Data Seal of Approval (DSA).
- Extended Certification: Basic Certification and additional
externally reviewed self-audit against ISO 16363 or DIN 31644
requirements.
- Formal Certification: validation of the self-certification with a third-party official audit based on ISO 16363 or DIN 31644.
nestor catalogue of criteria
A German initiative, nestor (the Network of Expertise in Long-Term Storage of Digital Resources) sponsored by the German Ministry of Education and Research,
developed a catalogue of criteria for trusted digital repositories in
2004. In 2008 the second version of the document was published. The
catalogue, aiming primarily at German cultural heritage and higher
education institutions, establishes guidelines for planning,
implementing, and self-evaluation of trustworthy long-term digital
repositories.
The nestor catalogue of criteria conforms to the OAIS
reference model terminology and consists of three sections covering
topics related to Organizational Framework, Object Management, and
Infrastructure and Security.
PLANETS Project
In 2002 the Preservation and Long-term Access through Networked Services (PLANETS) project, part of the EU Framework Programmes for Research and Technological Development 6, addressed core digital preservation challenges. The primary goal for Planets
was to build practical services and tools to help ensure long-term
access to digital cultural and scientific assets. The Open Planets
project ended May 31, 2010. The outputs of the project are now sustained by the follow-on organisation, the Open Planets Foundation.
On October 7, 2014 the Open Planets Foundation announced that it would
be renamed the Open Preservation Foundation to align with the
organization's current direction.
PLATTER
Planning
Tool for Trusted Electronic Repositories (PLATTER) is a tool released
by DigitalPreservationEurope (DPE) to help digital repositories in
identifying their self-defined goals and priorities in order to gain
trust from the stakeholders.
PLATTER is intended to be used as a complementary tool to
DRAMBORA, NESTOR, and TRAC. It is based on ten core principles for
trusted repositories and defines nine Strategic Objective Plans,
covering such areas as acquisition, preservation and dissemination of
content, finance, staffing, succession planning, technical
infrastructure, data and metadata specifications, and disaster planning.
The tool enables repositories to develop and maintain documentation
required for an audit.
ISO 16363
A system for the "audit and certification of trustworthy digital repositories" was developed by the Consultative Committee for Space Data Systems (CCSDS) and published as ISO standard 16363 on 15 February 2012.
Extending the OAIS reference model, and based largely on the TRAC
checklist, the standard was designed for all types of digital
repositories. It provides a detailed specification of criteria against
which the trustworthiness of a digital repository can be evaluated.
The CCSDS Repository Audit and Certification Working Group also
developed and submitted a second standard, defining operational
requirements for organizations intending to provide repository auditing
and certification as specified in ISO 16363.
This standard was published as ISO 16919 – "requirements for bodies
providing audit and certification of candidate trustworthy digital
repositories" – on 1 November 2014.
Best practices
Although
preservation strategies vary for different types of materials and
between institutions, adhering to nationally and internationally
recognized standards and practices is a crucial part of digital
preservation activities. Best or recommended practices define strategies
and procedures that may help organizations to implement existing
standards or provide guidance in areas where no formal standards have
been developed.
Best practices in digital preservation continue to evolve and may
encompass processes that are performed on content prior to or at the
point of ingest into a digital repository as well as processes performed
on preserved files post-ingest over time. Best practices may also apply
to the process of digitizing analog material and may include the
creation of specialized metadata (such as technical, administrative and
rights metadata) in addition to standard descriptive metadata. The
preservation of born-digital content may include format transformations
to facilitate long-term preservation or to provide better access.
No one institution can afford to develop all of the software
tools needed to ensure the accessibility of digital materials over the
long term. Thus the problem arises of maintaining a repository of
shared tools. The Library of Congress has been doing that for years, until that role was assumed by the Community Owned Digital Preservation Tool Registry.
Audio preservation
Various best practices and guidelines for digital audio preservation have been developed, including:
- Guidelines on the Production and Preservation of Digital Audio Objects IASA-TC 04 (2009),
which sets out the international standards for optimal audio signal
extraction from a variety of audio source materials, for analogue to
digital conversion and for target formats for audio preservation
- Capturing Analog Sound for Digital Preservation: Report of a
Roundtable Discussion of Best Practices for Transferring Analog Discs
and Tapes (2006),
which defined procedures for reformatting sound from analog to digital
and provided recommendations for best practices for digital preservation
- Digital Audio Best Practices (2006) prepared by the
Collaborative Digitization Program Digital Audio Working Group, which
covers best practices and provides guidance both on digitizing existing
analog content and on creating new digital audio resources
- Sound Directions: Best Practices for Audio Preservation (2007) published by the Sound Directions Project,
which describes the audio preservation workflows and recommended best
practices and has been used as the basis for other projects and
initiatives
- Documents developed by the International Association of Sound and Audiovisual Archives (IASA), the European Broadcasting Union (EBU), the Library of Congress, and the Digital Library Federation (DLF).
The Audio Engineering Society
(AES) also issues a variety of standards and guidelines relating to the
creation of archival audio content and technical metadata.
Moving image preservation
The
term "moving images" includes analog film and video and their
born-digital forms: digital video, digital motion picture materials, and
digital cinema. As analog videotape and film become obsolete,
digitization has become a key preservation strategy, although many
archives do continue to perform photochemical preservation of film
stock.
"Digital preservation" has a double meaning for audiovisual
collections: analog originals are preserved through digital
reformatting, with the resulting digital files preserved; and
born-digital content is collected, most often in proprietary formats
that pose problems for future digital preservation.
There is currently no broadly accepted standard target digital preservation format for analog moving images.
The complexity of digital video as well as the varying needs and
capabilities of an archival institution are reasons why no
"one-size-fits-all" format standard for long-term preservation exists
for digital video like there is for other types of digital records
"(e.g., word-processing converted to PDF/A or TIFF for images)".
Library and archival institutions, such as the Library of Congress and New York University,
have made significant efforts to preserve moving images; however, a
national movement to preserve video has not yet materialized". The preservation of audiovisual materials "requires much more than merely putting objects in cold storage". Moving image media must be projected and played, moved and shown. Born-digital materials require a similar approach".
The following resources offer information on analog to digital reformatting and preserving born-digital audiovisual content.
- The Library of Congress tracks the sustainability of digital formats, including moving images.
- The Digital Dilemma 2: Perspectives from Independent Filmmakers, Documentarians and Nonprofit Audiovisual Archives (2012).
The section on nonprofit archives reviews common practices on digital
reformatting, metadata, and storage. There are four case studies.
- Federal Agencies Digitization Guidelines Initiative (FADGI).
Started in 2007, this is a collaborative effort by federal agencies to
define common guidelines, methods, and practices for digitizing
historical content. As part of this, two working groups are studying
issues specific to two major areas, Still Image and Audio Visual.
- PrestoCenter publishes general audiovisual information and advice at
a European level. Its online library has research and white papers on
digital preservation costs and formats.
- The Association of Moving Image Archivists (AMIA) sponsors conferences, symposia, and events on all aspects of moving image preservation, including digital. The AMIA Tech Review contains articles reflecting current thoughts and practices from the archivists' perspectives. Video Preservation for the Millennia (2012), published in the AMIA Tech Review, details the various strategies and ideas behind the current state of video preservation.
- The National Archives of Australia
produced the Preservation Digitisation Standards which set out the
technical requirements for digitisation outputs produced under the
National Digitisation Plan. This includes video and audio formats, as
well as non-audiovisual formats.
- The Smithsonian Institution Archives
published guidelines regarding file formats used for the long-term
preservation of electronic records, which are regarded as open,
standard, non-proprietary, and well-established. The guidelines are used
for video and audio formats, and other non-audiovisual materials.
Codecs and containers
Moving images require a codec for the decoding process; therefore, determining a codec is essential to digital preservation. In "A Primer on Codecs for Moving Image and Sound Archives: 10 Recommendations for Codec Selection and Management"
written by Chris Lacinak and published by AudioVisual Preservation
Solutions, Lacinak stresses the importance of archivists choosing the
correct codec as this can "impact the ability to preserve the digital
object". Therefore, the codec selection process is critical, "whether dealing with born digital content, reformatting older content, or converting analog materials".
Lacinak's ten recommendations for codec selection and management are
the following: adoption, disclosure, transparency, external
dependencies, documentation and metadata, pre-planning, maintenance,
obsolescence monitoring, maintenance of the original, and avoidance of
unnecessary trans-coding or re-encoding.
There is a lack of consensus to date among the archival community as to
what standard codec should be used for the digitization of analog video
and the long-term preservation of digital video nor is there a single
"right" codec for a digital object; each archival institution must "make
the decision as part of an overall preservation strategy".
A digital container format or wrapper is also required for moving images and must be chosen carefully just like the codec.
According to an international survey conducted in 2010 of over 50
institutions involved with film and video reformatting, "the three main
choices for preservation products were AVI, QuickTime (.MOV) or MXF (Material Exchange Format)". These are just a few examples of containers. The National Archives and Records Administration
(NARA) has chosen the AVI wrapper as its standard container format for
several reasons including that AVI files are compatible with numerous
open source tools such as VLC.
Uncertainty about which formats will or will not become obsolete
or become the future standard makes it difficult to commit to one codec
and one container." Choosing a format should "be a trade off for which the best quality requirements and long-term sustainability are ensured."
Considerations for content creators
By
considering the following steps, content creators and archivists can
ensure better accessibility and preservation of moving images in the
long term:
- Create uncompressed video if possible. While this does create
large files, their quality will be retained. Storage must be considered
with this approach.
- If uncompressed video is not possible, use lossless instead of lossy
compression. The compressed data gets restored while lossy compression
alters data and quality is lost.
- Use higher bit rates (This affects resolution of the image and size of file.)
- Use technical and descriptive metadata.
- Use containers and codecs that are stable and widely used within the archival and digital preservation communities.
Email preservation
Email poses special challenges for preservation: email client software
varies widely; there is no common structure for email messages; email
often communicates sensitive information; individual email accounts may
contain business and personal messages intermingled; and email may
include attached documents in a variety of file formats. Email messages
can also carry viruses or have spam content. While email transmission
is standardized, there is no formal standard for the long-term
preservation of email messages.
Approaches to preserving email may vary according to the purpose
for which it is being preserved. For businesses and government entities,
email preservation may be driven by the need to meet retention and
supervision requirements for regulatory compliance and to allow for
legal discovery. (Additional information about email archiving
approaches for business and institutional purposes may be found under
the separate article, Email archiving.)
For research libraries and archives, the preservation of email that is
part of born-digital or hybrid archival collections has as its goal
ensuring its long-term availability as part of the historical and
cultural record.
Several projects developing tools and methodologies for email
preservation have been conducted based on various preservation
strategies: normalizing email into XML format, migrating email to a new
version of the software and emulating email environments: Memories Using Email (MUSE), Collaborative Electronic Records Project (CERP), E-Mail Collection And Preservation (EMCAP), PeDALS Email Extractor Software (PeDALS), XML Electronic Normalizing of Archives tool (XENA).
Some best practices and guidelines for email preservation can be found in the following resources:
- Curating E-Mails: A Life-cycle Approach to the Management and Preservation of E-mail Messages (2006) by Maureen Pennock.
- Technology Watch Report 11-01: Preserving Email (2011) by Christopher J Prom.
- Best Practices: Email Archiving by Jo Maitland.
Video game preservation
In 2007 the Keeping Emulation Environments Portable (KEEP) project, part of the EU Framework Programmes for Research and Technological Development
7, developed tools and methodologies to keep digital software objects
available in their original context. Digital software objects as video games might get lost because of digital obsolescence and non-availability of required legacy hardware or operating system software; such software is referred to as abandonware. Because the source code is often not available any longer,
emulation is the only preservation opportunity. KEEP provided an
emulation framework to help the creation of such emulators. KEEP was
developed by Vincent Joguin, first launched in February 2009 and was
coordinated by Elisabeth Freyre of the French National Library.
A community project, MAME,
aims to emulate any historic computer game, including arcade games,
console games and the like, at a hardware level, for future archiving.
In January 2012 the POCOS project funded by JISC organised a
workshop on the preservation of gaming environments and virtual worlds.
Personal archiving
There are many things consumers and artists can do themselves to help care for their collections at home.
- The Software Preservation Society is a group of computer
enthusiasts that is concentrating on finding old software disks (mostly
games) and taking a snapshot of the disks in a format that can be
preserved for the future.
- "Resource Center: Caring For Your Treasures" by American Institute
for Conservation of Historic and Artistic Works details simple
strategies for artists and consumers to care for and preserve their work
themselves.
The Library of Congress also hosts a list for the self-preserver
which includes direction toward programs and guidelines from other
institutions that will help the user preserve social media, email, and
formatting general guidelines (such as caring for CDs).
Some of the programs listed include:
- HTTrack:
Software tool which allows the user to download a World Wide Web site
from the Internet to a local directory, building recursively all
directories, getting HTML, images, and other files from the server to
their computer.
- Muse: Muse (short for Memories Using Email) is a program that helps
users revive memories, using their long-term email archives, run by
Stanford University.
Scientific research
In 2020, researchers reported in a preprint that they found "176 Open Access journals
that, through lack of comprehensive and open archives, vanished from
the Web between 2000-2019, spanning all major research disciplines and
geographic regions of the world" and that in 2019 only about a third of
the 14,068 DOAJ-indexed journals ensured the long-term preservation of their content.
Some of the scientific research output is not located at the scientific
journal's website but on other sites like source-code repositories such
as GitLab. The Internet Archive archived many – but not all – of the lost academic publications and makes them available on the Web.
According to an analysis by the Internet Archive "18 per cent of all
open access articles since 1945, over three million, are not
independently archived by us or another preservation organization, other
than the publishers themselves". Sci-Hub does academic archiving outside the bounds of contemporary copyright law and also provides access to academic works that do not have an open access license.
Digital Building Preservation
"The creation of a 3D model of a historical building needs a lot of effort."
Recent advances in technology have led to developments of 3-D rendered
buildings in virtual space. Traditionally the buildings in video games
had to be rendered via code, and many game studios have done highly
detailed renderings (see Assassin's Creed).
But due to most preservationist not being highly capable teams of
professional coders, Universities have begun developing methods by doing
3-D laser scanning. Such work was attempted by the National Taiwan University of Science and Technology
in 2009. Their goal was "to build as-built 3D computer models of a
historical building, the Don Nan-Kuan House, to fulfill the need of
digital preservation."
To rather great success, they were capable of scanning the Don Nan-Kuan
House with bulky 10 kg (22 lbs.) cameras and with only minor touch-ups
where the scanners were not detailed enough. More recently in 2018 in Calw,
Germany, a team conducted a scanning of the historic Church of St.
Peter and Paul by collecting data via laser scanning and photogrammetry.
"The current church's tower is about 64 m high, and its architectonic
style is neo-gothic of the late nineteenth century. This church counts
with a main nave, a chorus and two lateral naves in each side with
tribunes in height. The church shows a rich history, which is visible in
the different elements and architectonic styles used. Two small windows
between the choir and the tower are the oldest parts preserved, which
date to thirteenth century. The church was reconstructed and extended
during the sixteenth (expansion of the nave) and seventeenth centuries
(construction of tribunes), after the destruction caused by the Thirty
Years' War (1618-1648). However, the church was again burned by the
French Army under General Mélac at the end of the seventeenth century.
The current organ and pulpit are preserved from this time. In the late
nineteenth century, the church was rebuilt and the old dome Welsch was
replaced by the current neo-gothic tower. Other works from this period
are the upper section of the pulpit, the choir seats and the organ case.
The stained-glass windows of the choir are from the late nineteenth and
early twentieth centuries, while some of the nave's windows are from
middle of the twentieth century. Second World War having ended, some
neo-gothic elements were replaced by pure gothic ones, such as the altar
of the church, and some drawings on the walls and ceilings."
With this much architectural variance it presented a challenge and a
chance to combine different technologies in a large space with the goal
of high-resolution. The results were rather good and are available to
view online.
Education
The
Digital Preservation Outreach and Education (DPOE), as part of the
Library of Congress, serves to foster preservation of digital content
through a collaborative network of instructors and collection management
professionals working in cultural heritage institutions. Composed of
Library of Congress staff, the National Trainer Network, the DPOE
Steering Committee, and a community of Digital Preservation Education
Advocates, as of 2013 the DPOE has 24 working trainers across the six
regions of the United States.
In 2010 the DPOE conducted an assessment, reaching out to archivists,
librarians, and other information professionals around the country. A
working group of DPOE instructors then developed a curriculum based on the assessment results and other similar digital preservation curricula designed by other training programs, such as LYRASIS, Educopia Institute, MetaArchive Cooperative, University of North Carolina, DigCCurr (Digital Curation Curriculum) and Cornell University-ICPSR
Digital Preservation Management Workshops. The resulting core
principles are also modeled on the principles outlined in "A Framework
of Guidance for Building Good Digital Collections" by the National Information Standards Organization (NISO).
In Europe, Humboldt-Universität zu Berlin and King's College London offer a joint program in Digital Curation that emphasizes both digital humanities and the technologies necessary for long term curation. The MSc in Information Management and Preservation (Digital) offered by the HATII at the University of Glasgow has been running since 2005 and is the pioneering program in the field.
Examples of initiatives
A number of open source products have been developed to assist with digital preservation, including Archivematica, DSpace, Fedora Commons, OPUS, SobekCM and EPrints. The commercial sector also offers digital preservation software tools, such as Ex Libris Ltd.'s Rosetta,
Preservica's Cloud, Standard and Enterprise Editions, CONTENTdm,
Digital Commons, Equella, intraLibrary, Open Repository and Vital.
Large-scale initiatives
Many
research libraries and archives have begun or are about to begin
large-scale digital preservation initiatives (LSDIs). The main players
in LSDIs are cultural institutions, commercial companies such as Google
and Microsoft, and non-profit groups including the Open Content Alliance (OCA), the Million Book Project (MBP), and HathiTrust. The primary motivation of these groups is to expand access to scholarly resources.
Approximately 30 cultural entities, including the 12-member Committee on Institutional Cooperation
(CIC), have signed digitization agreements with either Google or
Microsoft. Several of these cultural entities are participating in the
Open Content Alliance and the Million Book Project. Some libraries are
involved in only one initiative and others have diversified their
digitization strategies through participation in multiple initiatives.
The three main reasons for library participation in LSDIs are: access,
preservation, and research and development. It is hoped that digital
preservation will ensure that library materials remain accessible for
future generations. Libraries have a responsibility to guarantee perpetual access
for their materials and a commitment to archive their digital
materials. Libraries plan to use digitized copies as backups for works
in case they go out of print, deteriorate, or are lost and damaged.
Arctic World Archive
The Arctic World Archive is a facility for data preservation of historical and cultural data from several countries, including open source code.