Software cracking (known as "breaking" mostly in the 1980s) is an act of removing copy protection from a software. Copy protection can be removed by applying a specific crack. A crack
can mean any tool that enables breaking software protection, a stolen
product key, or guessed password. Cracking software generally involves
circumventing licensing and usage restrictions on commercial software by
illegal methods. These methods can include modifying code directly
through disassembling and bit editing, sharing stolen product keys, or
developing software to generate activation keys. Examples of cracks are: applying a patch or by creating reverse-engineered serial number generators known as keygens,
thus bypassing software registration and payments or converting a
trial/demo version of the software into fully-functioning software
without paying for it. Software cracking contributes to the rise of online piracy where pirated software is distributed to end-users through filesharing sites like BitTorrent, One click hosting (OCH), or via Usenet downloads, or by downloading bundles of the original software with cracks or keygens.
Some of these tools are called keygen, patch, loader, or no-disc crack.
A keygen is a handmade product serial number generator that often
offers the ability to generate working serial numbers in your own name. A
patch is a small computer program that modifies the machine code of
another program. This has the advantage for a cracker to not include a
large executable in a release when only a few bytes are changed. A loader modifies the startup flow of a program and does not remove the protection but circumvents it. A well-known example of a loader is a trainer used to cheat in games. Fairlight pointed out in one of their .nfo files that these type of cracks are not allowed for warez scene game releases. A nukewar has shown that the protection may not kick in at any point for it to be a valid crack.
Software cracking is closely related to reverse engineering because the process of attacking a copy protection technology, is similar to the process of reverse engineering. The distribution of cracked copies is illegal in most countries. There have been lawsuits over cracking software. It might be legal to use cracked software in certain circumstances. Educational resources for reverse engineering and software cracking are, however, legal and available in the form of Crackme programs.
History
Software
are inherently expensive to produce but cheap to duplicate and
distribute. Therefore, software producers generally tried to implement
some form of copy protection
before releasing it to the market. In 1984, Laind Huntsman, the head of
software development for Formaster, a software protection company,
commented that "no protection system has remained uncracked by
enterprising programmers for more than a few months". In 2001, Dan S. Wallach, a professor from Rice University, argued that "those determined to bypass copy-protection have always found ways to do so – and always will".
Most of the early software crackers were computer hobbyists who
often formed groups that competed against each other in the cracking and
spreading of software. Breaking a new copy protection scheme as quickly
as possible was often regarded as an opportunity to demonstrate one's
technical superiority rather than a possibility of money-making.
Software crackers usually did not benefit materially from their actions
and their motivation was the challenge itself of removing the
protection.
Some low skilled hobbyists would take already cracked software and edit
various unencrypted strings of text in it to change messages a game
would tell a game player, often something considered vulgar. Uploading
the altered copies on file sharing networks provided a source of laughs
for adult users. The cracker groups of the 1980s started to advertise
themselves and their skills by attaching animated screens known as crack intros in the software programs they cracked and released.
Once the technical competition had expanded from the challenges of
cracking to the challenges of creating visually stunning intros, the
foundations for a new subculture known as demoscene
were established. Demoscene started to separate itself from the illegal
"warez scene" during the 1990s and is now regarded as a completely
different subculture. Many software crackers have later grown into
extremely capable software reverse engineers; the deep knowledge of
assembly required in order to crack protections enables them to reverse engineerdrivers in order to port them from binary-only drivers for Windows to drivers with source code for Linux and other free
operating systems. Also because music and game intro was such an
integral part of gaming the music format and graphics became very
popular when hardware became affordable for the home user.
With the rise of the Internet,
software crackers developed secretive online organizations. In the
latter half of the nineties, one of the most respected sources of
information about "software protection reversing" was Fravia's website.
In 2017, a group of software crackers started a project to preserve Apple II software by removing the copy protection.
+HCU
The High Cracking University (+HCU) was founded by Old Red Cracker (+ORC), considered a genius of reverse engineering and a legendary figure in Reverse Code Engineering
(RCE), to advance research into RCE. He had also taught and authored
many papers on the subject, and his texts are considered classics in the
field and are mandatory reading for students of RCE.
The addition of the "+" sign in front of the nickname of a
reverser signified membership in the +HCU. Amongst the students of +HCU
were the top of the elite Windows reversers worldwide.
+HCU published a new reverse engineering problem annually and a small
number of respondents with the best replies qualified for an
undergraduate position at the university.
+Fravia was a professor at +HCU. Fravia's website was known as
"+Fravia's Pages of Reverse Engineering" and he used it to challenge
programmers as well as the wider society to "reverse engineer" the
"brainwashing of a corrupt and rampant materialism". In its heyday, his
website received millions of visitors per year and its influence was
"widespread". On his site, +Fravia also maintained a database of the tutorials generated by +HCU students for posterity.
Nowadays most of the graduates of +HCU have migrated to Linux and
few have remained as Windows reversers. The information at the
university has been rediscovered by a new generation of researchers and
practitioners of RCE who have started new research projects in the
field.
Methods
The
most common software crack is the modification of an application's
binary to cause or prevent a specific key branch in the program's
execution. This is accomplished by reverse engineering the compiled program code using a debugger such as SoftICE, OllyDbg, GDB, or MacsBug until the software cracker reaches the subroutine that contains the primary method of protecting the software (or by disassembling an executable file with a program such as IDA). The binary is then modified using the debugger or a hex editor such as HIEW or monitor in a manner that replaces a prior branching opcode with its complement or a NOPopcode so the key branch will either always execute a specific subroutine
or skip over it. Almost all common software cracks are a variation of
this type. A region of code that must not be entered is often called a
"bad boy" while one that should be followed is a "good boy".
Proprietary software developers are constantly developing techniques such as code obfuscation, encryption, and self-modifying code to make binary modification increasingly difficult.
Even with these measures being taken, developers struggle to combat
software cracking. This is because it is very common for a professional
to publicly release a simple cracked EXE or Retrium Installer for public
download, eliminating the need for inexperienced users to crack the
software themselves.
A specific example of this technique is a crack that removes the
expiration period from a time-limited trial of an application. These
cracks are usually programs that alter the program executable and
sometimes the .dll or .so linked to the application and the process of altering the original binary files is called patching. Similar cracks are available for software that requires a hardware dongle. A company can also break the copy protection of programs that they have legally purchased but that are licensed
to particular hardware, so that there is no risk of downtime due to
hardware failure (and, of course, no need to restrict oneself to running
the software on bought hardware only).
Another method is the use of special software such as CloneCD
to scan for the use of a commercial copy protection application. After
discovering the software used to protect the application, another tool
may be used to remove the copy protection from the software on the CD or DVD. This may enable another program such as Alcohol 120%, CloneDVD, Game Jackal, or Daemon Tools
to copy the protected software to a user's hard disk. Popular
commercial copy protection applications which may be scanned for include
SafeDisc and StarForce.
In other cases, it might be possible to decompile a program in order to get access to the original source code or code on a level higher than machine code. This is often possible with scripting languages and languages utilizing JIT compilation. An example is cracking (or debugging) on the .NET platform where one might consider manipulating CIL to achieve one's needs. Java'sbytecode
also works in a similar fashion in which there is an intermediate
language before the program is compiled to run on the platform dependent
machine code.
Advanced reverse engineering for protections such as SecuROM, SafeDisc, StarForce, or Denuvo
requires a cracker, or many crackers to spend much more time studying
the protection, eventually finding every flaw within the protection
code, and then coding their own tools to "unwrap" the protection
automatically from executable (.EXE) and library (.DLL) files.
There are a number of sites on the Internet that let users download cracks produced by warez groups
for popular games and applications (although at the danger of acquiring
malicious software that is sometimes distributed via such sites).
Although these cracks are used by legal buyers of software, they can
also be used by people who have downloaded or otherwise obtained
unauthorized copies (often through P2P networks).
Software piracy
Software
cracking led to the distribution of pirated software around the world
(software piracy). It was estimated that the United States lost US$2.3
billion in business application software in 1996. Software piracy rates
were especially prevalent in African, Asian, East European, and Latin
American countries. In certain countries such as Indonesia, Pakistan,
Kuwait, China, and El Salvador, 90% of the software used was pirated.
A disk image is a snapshot of a storage device's structure and
data typically stored in one or more computer files on another storage
device.
Traditionally, disk images were bit-by-bit copies of every sector
on a hard disk often created for digital forensic purposes, but it is
now common to only copy allocated data to reduce storage space. Compression and deduplication are commonly used to reduce the size of the image file set.
Disk imaging is done for a variety of purposes including digital forensics, cloud computing, system administration, as part of a backup strategy, and legacy emulation as part of a digital preservation strategy.
Disk images can be made in a variety of formats depending on the
purpose. Virtual disk images (such as VHD and VMDK) are intended to be
used for cloud computing, ISO images are intended to emulate optical media and raw disk images are used for forensic purposes. Proprietary formats are typically used by disk imaging software.
Despite the benefits of disk imaging the storage costs can be high, management can be difficult and they can be time consuming to create.
Background
Disk images were originally (in the late 1960s) used for backup and disk cloning of mainframe disk media. Early ones were as small as 5 megabytes and as large as 330 megabytes, and the copy medium was magnetic tape, which ran as large as 200 megabytes per reel.
Disk images became much more popular when floppy disk media became
popular, where replication or storage of an exact structure was
necessary and efficient, especially in the case of copy protected floppy disks.
Disk image creation is called disk imaging and is often time
consuming, even with a fast computer, because the entire disk must be
copied.
Typically, disk imaging requires a third party disk imaging program or
backup software. The software required varies according to the type of
disk image that needs to be created. For example, RawWrite and WinImage
create floppy disk image files for MS-DOS and Microsoft Windows. In Unix or similar systems the dd program can be used to create raw disk images. Apple Disk Copy can be used on Classic Mac OS and macOS systems to create and write disk image files.
Authoring software for CDs/DVDs such as Nero Burning ROM can generate and load disk images for optical media. A virtual disk writer or virtual burner
is a computer program that emulates an actual disc authoring device
such as a CD writer or DVD writer. Instead of writing data to an actual
disc, it creates a virtual disk image.
A virtual burner, by definition, appears as a disc drive in the system
with writing capabilities (as opposed to conventional disc authoring
programs that can create virtual disk images), thus allowing software
that can burn discs to create virtual discs.
Uses
Digital forensics
Forensic
imaging is the process of creating a bit-by-bit copy of the data on the
drive, including files, metadata, volume information, filesystems and
their structure.
Often, these images are also hashed to verify their integrity and that
they have not been altered since being created. Unlike disk imaging for
other purposes, digital forensic applications take a bit-by-bit copy to
ensure forensic soundness. The purposes of imaging the disk is to not
only discover evidence preserved in digital information but also to
examine the drive to gather clues of how the crime was committed.
Virtualization
Creating a virtual disk image of optical media or a hard disk drive is typically done to make the content available to one or more virtual machines. Virtual machines emulate a CD/DVD drive by reading an ISO image. This can also be faster than reading from the physical optical medium. Further, there are less issues with wear and tear. A hard disk drive or solid-state drive in a virtual machine is implemented as a disk image (i.e. either the VHD format used by Microsoft's Hyper-V, the VDI format used by Oracle Corporation's VirtualBox, the VMDK format used for VMware virtual machines, or the QCOW format used by QEMU).
Virtual hard disk images tend to be stored as either a collection of
files (where each one is typically 2GB in size), or as a single file.
Virtual machines treat the image set as a physical drive.
Rapid deployment of systems
Educational
institutions and businesses can often need to buy or replace computer
systems in large numbers. Disk imaging is commonly used to rapidly
deploy the same configuration across workstations.
Disk imaging software is used to create an image of a
completely-configured system (such an image is sometimes called a golden
image). This image is then written to a computer's hard disk (which is sometimes described as restoring an image).
Network-based image deployment
Image restoration can be done using network-based image deployment. This method uses a PXE
server to boot an operating system over a computer network that
contains the necessary components to image or restore storage media in a
computer. This is usually used in conjunction with a DHCP server to automate the configuration of network parameters including IP addresses. Multicasting, broadcasting or unicasting tend to be used to restore an image to many computers simultaneously. These approaches do not work well if one or more computers experience packet loss. As a result, some imaging solutions use the BitTorrent protocol to overcome this problem.
Network-based image deployment reduces the need to maintain and
update individual systems manually. Imaging is also easier than
automated setup methods because an administrator does not need to have
knowledge of the prior configuration to copy it.
A disk image contains all files and data (i.e., file attributes and the file fragmentation state). For this reason, it is also used for backing up optical media (CDs and DVDs, etc.), and allows the exact and efficient recovery after experimenting with modifications to a system or virtual machine. Typically, disk imaging can be used to quickly restore an entire system to an operational state after a disaster.
Digital preservation
Libraries and museums are typically required to archive and digitally preserve information without altering it in any manner. Emulators
frequently use disk images to emulate floppy disks that have been
preserved. This is usually simpler to program than accessing a real
floppy drive (particularly if the disks are in a format not supported by
the host operating system), and allows a large library of software to
be managed. Emulation also allows existing disk images to be put into a
usable form even though the data contained in the image is no longer
readable without emulation.
Limitations
Disk
imaging is time consuming, the space requirements are high and reading
from them can be slower than reading from the disk directly because of a
performance overhead.
Other limitations can be the lack of access to software required
to read the contents of the image. For example, prior to Windows 8,
third party software was required to mount disk images. When imaging multiple computers with only minor differences, much data is duplicated unnecessarily, wasting space.
Speed and failure
Disk
imaging can be slow, especially for older storage devices. A typical
4.7 GB DVD can take an average of 18 minutes to duplicate.
Floppy disks read and write much slower than hard disks. Therefore,
despite their small size, it can take several minutes to copy a single
disk. In some cases, disk imaging can fail due to bad sectors or
physical wear and tear on the source device. Unix utilities (such as dd) are not designed to cope with failures, causing the disk image creation process to fail. When data recovery is the end goal, it is instead recommended to use more specialised tools (such as ddrescue).
Logical failures occur when the hard drive devices are functional
but the user or automated-OS cannot retrieve or access data stored on
them. Logical failures can occur due to corruption of the engineering
chip, lost partitions, firmware failure, or failures during
formatting/re-installation.
Data recovery can be a very simple or technical challenge. This
is why there are specific software companies specialized in this field.
About
The
most common data recovery scenarios involve an operating system
failure, malfunction of a storage device, logical failure of storage
devices, accidental damage or deletion, etc. (typically, on a
single-drive, single-partition,
single-OS system), in which case the ultimate goal is simply to copy
all important files from the damaged media to another new drive. This
can be accomplished using a Live CD, or DVD by booting directly from a ROM
or a USB drive instead of the corrupted drive in question. Many Live
CDs or DVDs provide a means to mount the system drive and backup drives
or removable media, and to move the files from the system drive to the
backup media with a file manager or optical disc authoring software. Such cases can often be mitigated by disk partitioning and consistently storing valuable data files (or copies of them) on a different partition from the replaceable OS system files.
Another scenario involves a drive-level failure, such as a compromised file system or drive partition, or a hard disk drive failure.
In any of these cases, the data is not easily read from the media
devices. Depending on the situation, solutions involve repairing the
logical file system, partition table, or master boot record, or updating the firmware
or drive recovery techniques ranging from software-based recovery of
corrupted data, to hardware- and software-based recovery of damaged
service areas (also known as the hard disk drive's "firmware"), to
hardware replacement on a physically damaged drive which allows for the
extraction of data to a new drive. If a drive recovery is necessary, the
drive itself has typically failed permanently, and the focus is rather
on a one-time recovery, salvaging whatever data can be read.
In a third scenario, files have been accidentally "deleted"
from a storage medium by the users. Typically, the contents of deleted
files are not removed immediately from the physical drive; instead,
references to them in the directory structure are removed, and
thereafter space the deleted data occupy is made available for later
data overwriting. In the mind of end users,
deleted files cannot be discoverable through a standard file manager,
but the deleted data still technically exists on the physical drive. In
the meantime, the original file contents remain, often several
disconnected fragments, and may be recoverable if not overwritten by other data files.
The term "data recovery" is also used in the context of forensic applications or espionage, where data which have been encrypted,
hidden, or deleted, rather than damaged, are recovered. Sometimes data
present in the computer gets encrypted or hidden due to reasons like
virus attacks which can only be recovered by some computer forensic
experts.
Physical damage
A
wide variety of failures can cause physical damage to storage media,
which may result from human errors and natural disasters. CD-ROMs
can have their metallic substrate or dye layer scratched off; hard
disks can suffer from a multitude of mechanical failures, such as head crashes, PCB failure, and failed motors; tapes can simply break.
Physical damage to a hard drive, even in cases where a head crash
has occurred, does not necessarily mean there will be a permanent loss
of data. The techniques employed by many professional data recovery
companies can typically salvage most, if not all, of the data that had
been lost when the failure occurred.
Of course, there are exceptions to this, such as cases where severe damage to the hard drive platters
may have occurred. However, if the hard drive can be repaired and a
full image or clone created, then the logical file structure can be
rebuilt in most instances.
Most physical damage cannot be repaired by end users. For
example, opening a hard disk drive in a normal environment can allow
airborne dust to settle on the platter and become caught between the
platter and the read/write head. During normal operation, read/write heads float 3 to 6 nanometers
above the platter surface, and the average dust particles found in a
normal environment are typically around 30,000 nanometers in diameter.
When these dust particles get caught between the read/write heads and
the platter, they can cause new head crashes that further damage the
platter and thus compromise the recovery process. Furthermore, end users
generally do not have the hardware or technical expertise required to
make these repairs. Consequently, data recovery companies are often
employed to salvage important data with the more reputable ones using class 100 dust- and static-free cleanrooms.
Recovery techniques
Recovering
data from physically damaged hardware can involve multiple techniques.
Some damage can be repaired by replacing parts in the hard disk. This
alone may make the disk usable, but there may still be logical damage. A
specialized disk-imaging procedure is used to recover every readable
bit from the surface. Once this image is acquired and saved on a
reliable medium, the image can be safely analyzed for logical damage and
will possibly allow much of the original file system to be
reconstructed.
Hardware repair
A common misconception is that a damaged printed circuit board
(PCB) may be simply replaced during recovery procedures by an identical
PCB from a healthy drive. While this may work in rare circumstances on
hard disk drives manufactured before 2003, it will not work on newer
drives. Electronics boards of modern drives usually contain
drive-specific adaptation data
(generally a map of bad sectors and tuning parameters) and other
information required to properly access data on the drive. Replacement
boards often need this information to effectively recover all of the
data. The replacement board may need to be reprogrammed. Some
manufacturers (Seagate, for example) store this information on a serial EEPROM chip, which can be removed and transferred to the replacement board.
Each hard disk drive has what is called a system area or service area;
this portion of the drive, which is not directly accessible to the end
user, usually contains drive's firmware and adaptive data that helps the
drive operate within normal parameters.
One function of the system area is to log defective sectors within the
drive; essentially telling the drive where it can and cannot write
data.
The sector lists are also stored on various chips attached to the
PCB, and they are unique to each hard disk drive. If the data on the
PCB do not match what is stored on the platter, then the drive will not
calibrate properly. In most cases the drive heads will click because they are unable to find the data matching what is stored on the PCB.
Logical damage
The term "logical damage" refers to situations in which the error is
not a problem in the hardware and requires software-level solutions.
Corrupt partitions and file systems, media errors
In some cases, data on a hard disk drive can be unreadable due to damage to the partition table or file system,
or to (intermittent) media errors. In the majority of these cases, at
least a portion of the original data can be recovered by repairing the
damaged partition table or file system using specialized data recovery
software such as Testdisk; software like ddrescue
can image media despite intermittent errors, and image raw data when
there is partition table or file system damage. This type of data
recovery can be performed by people without expertise in drive hardware
as it requires no special physical equipment or access to platters.
Sometimes data can be recovered using relatively simple methods and tools; more serious cases can require expert intervention, particularly if parts of files are irrecoverable. Data carving is the recovery of parts of damaged files using knowledge of their structure.
After data has been physically overwritten on a hard disk drive, it
is generally assumed that the previous data are no longer possible to
recover. In 1996, Peter Gutmann, a computer scientist, presented a paper that suggested overwritten data could be recovered through the use of magnetic force microscopy. In 2001, he presented another paper on a similar topic.
To guard against this type of data recovery, Gutmann and Colin Plumb
designed a method of irreversibly scrubbing data, known as the Gutmann method and used by several disk-scrubbing software packages.
Substantial criticism has followed, primarily dealing with the
lack of any concrete examples of significant amounts of overwritten data
being recovered.
Gutmann's article contains a number of errors and inaccuracies,
particularly regarding information about how data is encoded and
processed on hard drives.
Although Gutmann's theory may be correct, there is no practical
evidence that overwritten data can be recovered, while research has
shown to support that overwritten data cannot be recovered.
Solid-state drives
(SSD) overwrite data differently from hard disk drives (HDD) which
makes at least some of their data easier to recover. Most SSDs use flash memory to store data in pages and blocks, referenced by logical block addresses (LBA) which are managed by the flash translation layer
(FTL). When the FTL modifies a sector it writes the new data to another
location and updates the map so the new data appear at the target LBA.
This leaves the pre-modification data in place, with possibly many
generations, and recoverable by data recovery software.
Lost, deleted, and formatted data
Sometimes, data present in the physical drives (Internal/External Hard disk, Pen Drive,
etc.) gets lost, deleted and formatted due to circumstances like virus
attack, accidental deletion or accidental use of SHIFT+DELETE. In these
cases, data recovery software is used to recover/restore the data files.
Logical bad sector
In
the list of logical failures of hard disks, a logical bad sector is the
most common fault leading data not to be readable. Sometimes it is
possible to sidestep error detection even in software, and perhaps with
repeated reading and statistical analysis recover at least some of the
underlying stored data. Sometimes prior knowledge of the data stored and
the error detection and correction codes can be used to recover even
erroneous data. However, if the underlying physical drive is degraded
badly enough, at least the hardware surrounding the data must be
replaced, or it might even be necessary to apply laboratory techniques
to the physical recording medium. Each of the approaches is
progressively more expensive, and as such progressively more rarely
sought.
Eventually, if the final, physical storage medium has indeed been
disturbed badly enough, recovery will not be possible using any means;
the information has irreversibly been lost.
Remote data recovery
Recovery
experts do not always need to have physical access to the damaged
hardware. When the lost data can be recovered by software techniques,
they can often perform the recovery using remote access software over
the Internet, LAN or other connection to the physical location of the
damaged media. The process is essentially no different from what the
end user could perform by themselves.
Remote recovery requires a stable connection with an adequate
bandwidth. However, it is not applicable where access to the hardware is
required, as in cases of physical damage.
Four phases of data recovery
Usually,
there are four phases when it comes to successful data recovery, though
that can vary depending on the type of data corruption and recovery
required.
Phase 1
Repair the hard disk drive
The hard drive is repaired in order to get it running in some form,
or at least in a state suitable for reading the data from it. For
example, if heads are bad they need to be changed; if the PCB is faulty
then it needs to be fixed or replaced; if the spindle motor is bad the
platters and heads should be moved to a new drive.
Phase 2
Image the drive to a new drive or a disk image file
When a hard disk drive fails, the importance of getting the data off
the drive is the top priority. The longer a faulty drive is used, the
more likely further data loss is to occur. Creating an image of the
drive will ensure that there is a secondary copy of the data on another
device, on which it is safe to perform testing and recovery procedures
without harming the source.
Phase 3
Logical recovery of files, partition, MBR and filesystem structures
After the drive has been cloned to a new drive, it is suitable to
attempt the retrieval of lost data. If the drive has failed logically,
there are a number of reasons for that. Using the clone it may be
possible to repair the partition table or master boot record (MBR) in order to read the file system's data structure and retrieve stored data.
Phase 4
Repair damaged files that were retrieved
Data damage can be caused when, for example, a file is written to a
sector on the drive that has been damaged. This is the most common cause
in a failing drive, meaning that data needs to be reconstructed to
become readable. Corrupted documents can be recovered by several
software methods or by manually reconstructing the document using a hex
editor.
Restore disk
The Windows
operating system can be reinstalled on a computer that is already
licensed for it. The reinstallation can be done by downloading the
operating system or by using a "restore disk" provided by the computer
manufacturer. Eric Lundgren was fined and sentenced to U.S. federal
prison in April 2018 for producing 28,000 restore disks and intending to
distribute them for about 25 cents each as a convenience to computer
repair shops.
Data recovery cannot always be done on a running system. As a result, a boot disk, live CD, live USB, or any other type of live distro contains a minimal operating system.
Knoppix: contains utilities for data recovery under Linux
SystemRescueCD: an Arch Linux based live CD, useful for repairing unbootable computer systems and retrieving data after a system crash
Windows Preinstallation Environment
(WinPE): A customizable Windows Boot DVD (made by Microsoft and
distributed for free). Can be modified to boot to any of the programs
listed.
Consistency checkers
CHKDSK: a consistency checker for DOS and Windows systems
The Coroner's Toolkit: a suite of utilities for assisting in forensic analysis of a UNIX system after a break-in
The Sleuth Kit:
also known as TSK, a suite of forensic analysis tools developed by
Brian Carrier for UNIX, Linux and Windows systems. TSK includes the
Autopsy forensic browser.
Data sanitization involves the secure and permanent erasure of
sensitive data from datasets and media to guarantee that no residual
data can be recovered even through extensive forensic analysis. Data sanitization has a wide range of applications but is mainly used for clearing out end-of-life
electronic devices or for the sharing and use of large datasets that
contain sensitive information. The main strategies for erasing personal
data from devices are physical destruction, cryptographic erasure, and
data erasure. While the term data sanitization may lead some to believe
that it only includes data on electronic media, the term also broadly
covers physical media, such as paper copies. These data types are termed
soft for electronic files and hard for physical media paper copies.
Data sanitization methods are also applied for the cleaning of sensitive
data, such as through heuristic-based methods, machine-learning based
methods, and k-source anonymity.
This erasure is necessary as an increasing amount of data is
moving to online storage, which poses a privacy risk in the situation
that the device is resold to another individual. The importance of data
sanitization has risen in recent years as private information is
increasingly stored in an electronic format and larger, more complex
datasets are being utilized to distribute private information.
Electronic storage has expanded and enabled more private data to be
stored. Therefore it requires more advanced and thorough data
sanitization techniques to ensure that no data is left on the device
once it is no longer in use. Technological tools that enable the
transfer of large amounts of data also allow more private data to be
shared. Especially with the increasing popularity of cloud-based
information sharing and storage, data sanitization methods that ensure
that all data shared is cleaned has become a significant concern.
Therefore it is only sensible that governments and private industry
create and enforce data sanitization policies to prevent data loss or
other security incidents.
Data sanitization policy in public and private sectors
While
the practice of data sanitization is common knowledge in most technical
fields, it is not consistently understood across all levels of business
and government. Thus, the need for a comprehensive Data Sanitization
policy in government contracting and private industry is required in
order to avoid the possible loss of data, leaking of state secrets to
adversaries, disclosing proprietary technologies, and possibly being
barred for contract competition by government agencies.
With the increasingly connected world, it has become even more
critical that governments, companies, and individuals follow specific
data sanitization protocols to ensure that the confidentiality of
information is sustained throughout its lifecycle. This step is critical
to the core Information Security triad of Confidentiality, Integrity,
and Availability. This CIA Triad
is especially relevant to those who operate as government contractors
or handle other sensitive private information. To this end, government
contractors must follow specific data sanitization policies and use
these policies to enforce the National Institute of Standards and Technology recommended guidelines for Media Sanitization covered in NIST Special Publication 800-88.
This is especially prevalent for any government work which requires CUI
(Controlled Unclassified Information) or above and is required by DFARSClause 252.204-7012, Safeguarding Covered Defense Information and Cyber Incident Reporting.
While private industry may not be required to follow NIST 800-88
standards for data sanitization, it is typically considered to be a best
practice across industries with sensitive data. To further compound the
issue, the ongoing shortage of cyber specialists and confusion on
proper cyber hygiene has created a skill and funding gap for many
government contractors.
However, failure to follow these recommended sanitization
policies may result in severe consequences, including losing data,
leaking state secrets to adversaries, losing proprietary technologies,
and preventing contract competition by government agencies.
Therefore, the government contractor community must ensure its data
sanitization policies are well defined and follow NIST guidelines for
data sanitization. Additionally, while the core focus of data
sanitization may seem to focus on electronic “soft copy” data, other
data sources such as “hard copy” documents must be addressed in the same
sanitization policies.
Data sanitization trends
To
examine the existing instances of data sanitization policies and
determine the impacts of not developing, utilizing, or following these
policy guidelines and recommendation, research data was not only
coalesced from the government contracting sector but also other critical
industries such as Defense, Energy, and Transportation. These were
selected as they typically also fall under government regulations, and
therefore NIST (National Institute of Standards and Technology)
guidelines and policies would also apply in the United States. Primary
Data is from the study performed by an independent research company
Coleman Parkes Research in August 2019.
This research project targeted many different senior cyber executives
and policy makers while surveying over 1,800 senior stakeholders. The
data from Coleman Parkes shows that 96% of organizations have a data
sanitization policy in place; however, in the United States, only 62% of
respondents felt that the policy is communicated well across the
business. Additionally, it reveals that remote and contract workers were
the least likely to comply with data sanitization policies. This trend
has become a more pressing issue as many government contractors and
private companies have been working remotely due to the Covid-19
pandemic. The likelihood of this continuing after the return to normal
working conditions is likely.
On June 26, 2021, a basic Google search for “data lost due to
non-sanitization” returned over 20 million results. These included
articles on; data breaches and the loss of business, military secrets
and proprietary data losses, PHI (Protected Health Information), PII (Personally Identifiable Information),
and many articles on performing essential data sanitization. Many of
these articles also point to existing data sanitization and security
policies of companies and government entities, such as the U.S.
Environmental Protection Agency, "Sample Policy and Guidance Language
for Federal Media Sanitization".
Based on these articles and NIST 800-88 recommendations, depending on
its data security level or categorization, data should be:
Cleared – Provide a basic level of data sanitization by
overwriting data sectors to remove any previous data remnants that a
basic format would not include. Again, the focus is on electronic media.
This method is typically utilized if the media is going to be re-used
within the organization at a similar data security level.
Purged – May use physical (degaussing) or logical methods (sector
overwrite) to make the target media unreadable. Typically utilized when
media is no longer needed and is at a lower level of data security
level.
Destroyed – Permanently renders the data irretrievable and is
commonly used when media is leaving an organization or has reached its
end of life, i.e., paper shredding or hard drive/media crushing and
incineration. This method is typically utilized for media containing
highly sensitive information and state secrets which could cause grave
damage to national security or to the privacy and safety of individuals.
Data sanitization road blocks
The International Information Systems Security Certification Consortium
2020 Cyber Workforce study shows that the global cybersecurity industry
still has over 3.12 million unfilled positions due to a skills
shortage.
Therefore, those with the correct skillset to implement NIST 800-88 in
policies may come at a premium labor rate. In addition, staffing and
funding need to adjust to meet policy needs to properly implement these
sanitization methods in tandem with appropriate Data level
categorization to improve data security outcomes and reduce data loss.
In order to ensure the confidentiality of customer and client data,
government and private industry must create and follow concrete data
sanitization policies which align with best practices, such as those
outlined in NIST 800-88. Without consistent and enforced policy
requirements, the data will be at increased risk of compromise. To
achieve this, entities must allow for a cybersecurity wage premium to
attract qualified talent. In order to prevent the loss of data and
therefore Proprietary Data, Personal Information, Trade Secrets, and
Classified Information, it is only logical to follow best practices.
Data sanitization policy best practices
Data sanitization policy must be comprehensive and include data
levels and correlating sanitization methods. Any data sanitization
policy created must be comprehensive and include all forms of media to
include soft and hard copy data. Categories of data should also be
defined so that appropriate sanitization levels will be defined under a
sanitization policy. This policy should be defined so that all levels of
data can align to the appropriate sanitization method. For example, controlled unclassified information on electronic storage devices may be cleared or purged, but those devices storing secret or top secret classified materials should be physically destroyed.
Any data sanitization policy should be enforceable and show what
department and management structure has the responsibility to ensure
data is sanitized accordingly. This policy will require a high-level
management champion (typically the Chief Information Security Officer or
another C-suite equivalent) for the process and to define
responsibilities and penalties for parties at all levels. This policy
champion will include defining concepts such as the Information System
Owner and Information Owner to define the chain of responsibility for
data creation and eventual sanitization.
The CISO or other policy champion should also ensure funding is
allocated to additional cybersecurity workers to implement and enforce
policy compliance. Auditing requirements are also typically included to
prove media destruction and should be managed by these additional staff.
For small business and those without a broad cyber background resources
are available in the form of editable Data Sanitization policy
templates. Many groups such as the IDSC (International Data Sanitization
Consortium) provide these free of charge on their website https://www.datasanitization.org/.
Without training in data security and sanitization principles, it
is unfeasible to expect users to comply with the policy. Therefore, the
Sanitization Policy should include a matrix of instruction and
frequency by job category to ensure that users, at every level,
understand their part in complying with the policy. This task should be
easy to accomplish as most government contractors are already required
to perform annual Information Security training for all employees.
Therefore, additional content can be added to ensure data sanitization
policy compliance.
Sanitizing devices
The
primary use of data sanitization is for the complete clearing of
devices and destruction of all sensitive data once the storage device is
no longer in use or is transferred to another Information system. This is an essential stage in the Data Security Lifecycle (DSL)
and Information Lifecycle Management (ILM). Both are approaches for
ensuring privacy and data management throughout the usage of an
electronic device, as it ensures that all data is destroyed and
unrecoverable when devices reach the end of their lifecycle.
There are three main methods of data sanitization for complete
erasure of data: physical destruction, cryptographic erasure, and data
erasure.
All three erasure methods aim to ensure that deleted data cannot be
accessed even through advanced forensic methods, which maintains the
privacy of individuals’ data even after the mobile device is no longer
in use.
Physical destruction
Physical erasure involves the manual destruction of stored data. This method uses mechanical shredders or degaussers
to shred devices, such as phones, computers, hard drives, and printers,
into small individual pieces. Varying levels of data security levels
require different levels of destruction.
Degaussing is most commonly used on hard disk drives
(HDDs), and involves the utilization of high energy magnetic fields to
permanently disrupt the functionality and memory storage of the device.
When data is exposed to this strong magnetic field, any memory storage
is neutralized and can not be recovered or used again. Degaussing does
not apply to solid state disks
(SSDs) as the data is not stored using magnetic methods. When
particularly sensitive data is involved it is typical to utilize
processes such as paper pulp, special burn, and solid state conversion.
This will ensure proper destruction of all sensitive media including
paper, Hard and Soft copy media, optical media, specialized computing
hardware.
Physical destruction often ensures that data is completely erased
and cannot be used again. However, the physical by-products of
mechanical waste from mechanical shredding can be damaging to the
environment, but a recent trend in increasing the amount of e-waste material recovered by e-cycling
has helped to minimize the environmental impact. Furthermore, once data
is physically destroyed, it can no longer be resold or used again.
Cryptographic erasure
Cryptographic erasure
involves the destruction of the secure key or passphrase, that is used
to protect stored information. Data encryption involves the development
of a secure key that only enables authorized parties to gain access to
the data that is stored. The permanent erasure of this key ensures that
the private data stored can no longer be accessed. Cryptographic erasure
is commonly installed through manufactures of the device itself as
encryption software is often built into the device. Encryption
with key erasure involves encrypting all sensitive material in a way
that requires a secure key to decrypt the information when it needs to
be used. When the information needs to be deleted, the secure key
can be erased. This provides a greater ease of use, and a speedier data
wipe, than other software methods because it involves one deletion of
secure information rather than each individual file.
Cryptographic erasure is often used for data storage that does
not contain as much private information since there is a possibility
that errors can occur due to manufacturing failures or human error
during the process of key destruction. This creates a wider range of
possible results of data erasure. This method allows for data to
continue to be stored on the device and does not require that the device
be completely erased. This way, the device can be resold again to
another individual or company since the physical integrity of the device
itself is maintained. However this assumes that the level of data
encryption on the device is resistant to future encryption attacks. For
instance a hard drive utilizing Cryptographic erasure with a 128bit AES
key may be secure now but in 5 years, it may be common to break this
level of encryption. Therefore the level of data security should be
declared in a data sanitization policy to future proof the process.
The process of data erasure involves masking all information at the
byte level through the insertion of random 0s and 1s in on all sectors
of the electronic equipment that is no longer in use.
This software based method ensures that all data previous stored is
completely hidden and unrecoverable, which ensures full data
sanitization. The efficacy and accuracy of this sanitization method can
also be analyzed through auditable reports.
Data erasure often ensures complete sanitization while also
maintaining the physical integrity of the electronic equipment so that
the technology can be resold or reused. This ability to recycle
technological devices makes data erasure a more environmentally sound
version of data sanitization. This method is also the most accurate and
comprehensive since the efficacy of the data masking can be tested
afterwards to ensure complete deletion. However, data erasure through
software based mechanisms requires more time compared to other methods.
Secure erase
A
number of storage media sets support a command that, when passed to the
device, causes it to perform a built-in sanitization procedure. The
following command sets define such a standard command:
ATA (including SATA) defines a Security Erase command. Two levels of thoroughness are defined.
SCSI (including SAS and other physical connections) defines a SANITIZE command.
Opal Storage Specification specifies a command set for self-encrypting drives and cryptographic erase, available in addition to command-set methods.
The drive usually performs fast cryptographic erasure when data is
encrypted, and a slower data erasure by overwriting otherwise. SCSI allows for asking for a specific type of erasure.
If implemented correctly, the built-in sanitization feature is
sufficient to render data unrecoverable. The NIST approves of the use of
this feature.
There have been a few reported instances of failures to erase some or
all data due to buggy firmware, sometimes readily apparent in a sector
editor.
Necessity of data sanitization
There
has been increased usage of mobile devices, Internet of Things (IoT)
technologies, cloud-based storage systems, portable electronic devices,
and various other electronic methods to store sensitive information,
therefore implementing effective erasure methods once the device is not
longer in use has become crucial to protect sensitive data.
Due to the increased usage of electronic devices in general and the
increased storage of private information on these electronic devices,
the need for data sanitization has been much more urgent in recent
years.
There are also specific methods of sanitization that do not fully
clean devices of private data which can prove to be problematic. For
example, some remote wiping methods on mobile devices are vulnerable to
outside attacks and efficacy depends on the unique efficacy of each
individual software system installed.
Remote wiping involves sending a wireless command to the device when it
has been lost or stolen that directs the device to completely wipe out
all data. While this method can be very beneficial, it also has several
drawbacks. For example, the remote wiping method can be manipulated by
attackers to signal the process when it is not yet necessary. This
results in incomplete data sanitization. If attackers do gain access to
the storage on the device, the user risks exposing all private
information that was stored.
Cloud computing and storage has become an increasingly popular
method of data storage and transfer. However, there are certain privacy
challenges associated with cloud computing that have not been fully
explored.
Cloud computing is vulnerable to various attacks such as through code
injection, the path traversal attack, and resource depletion because of
the shared pool structure of these new techniques. These cloud storage
models require specific data sanitization methods to combat these
issues. If data is not properly removed from cloud storage models, it
opens up the possibility for security breaches at multiple levels.
Risks posed by inadequate data-set sanitization
Inadequate
data sanitization methods can result in two main problems: a breach of
private information and compromises to the integrity of the original
dataset. If data sanitization methods are unsuccessful at removing all
sensitive information, it poses the risk of leaking this information to
attackers.
Numerous studies have been conducted to optimize ways of preserving
sensitive information. Some data sanitization methods have a high
sensitivity to distinct points that have no closeness to data points.
This type of data sanitization is very precise and can detect anomalies
even if the poisoned data point is relatively close to true data.
Another method of data sanitization is one that also removes outliers
in data, but does so in a more general way. It detects the general trend
of data and discards any data that strays and it’s able to target
anomalies even when inserted as a group.
In general, data sanitization techniques use algorithms to detect
anomalies and remove any suspicious points that may be poisoned data or
sensitive information.
Furthermore, data sanitization methods may remove useful,
non-sensitive information, which then renders the sanitized dataset less
useful and altered from the original. There have been iterations of
common data sanitization techniques that attempt to correct the issue of
the loss of original dataset integrity. In particular, Liu, Xuan, Wen,
and Song offered a new algorithm for data sanitization called the
Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF)
method.
There is often a lot of emphasis that is put into protecting the
privacy of users, so this method brings a new perspective that focuses
on also protecting the integrity of the data. It functions in a way that
has three main advantages: it learns to optimize the process of
sanitization by only cleaning the item with the highest conflict count,
keeps parts of the dataset with highest utility, and also analyzes the
conflict degree of the sensitive material. Robust research was conducted
on the efficacy and usefulness of this new technique to reveal the ways
that it can benefit in maintaining the integrity of the dataset. This
new technique is able to firstly pinpoint the specific parts of the
dataset that are possibly poisoned data and also use computer algorithms
to make a calculation between the tradeoffs of how useful it is to
decide if it should be removed. This is a new way of data sanitization that takes into account the utility of the data before it is immediately discarded.
Applications of data sanitization
Data
sanitization methods are also implemented for privacy preserving data
mining, association rule hiding, and blockchain-based secure information
sharing. These methods involve the transfer and analysis of large
datasets that contain private information. This private information
needs to be sanitized before being made available online so that
sensitive material is not exposed. Data sanitization is used to ensure
privacy is maintained in the dataset, even when it is being analyzed.
Privacy preserving data mining
Privacy Preserving Data Mining (PPDM) is the process of data mining
while maintaining privacy of sensitive material. Data mining involves
analyzing large datasets to gain new information and draw conclusions.
PPDM has a wide range of uses and is an integral step in the transfer or
use of any large data set containing sensitive material.
Data sanitization is an integral step to privacy preserving data
mining because private datasets need to be sanitized before they can be
utilized by individuals or companies for analysis. The aim of privacy
preserving data mining is to ensure that private information cannot be
leaked or accessed by attackers and sensitive data is not traceable to
individuals that have submitted the data.
Privacy preserving data mining aims to maintain this level of privacy
for individuals while also maintaining the integrity and functionality
of the original dataset.
In order for the dataset to be used, necessary aspects of the original
data need to be protected during the process of data sanitization. This
balance between privacy and utility has been the primary goal of data
sanitization methods.
One approach to achieve this optimization of privacy and utility
is through encrypting and decrypting sensitive information using a
process called key generation.
After the data is sanitized, key generation is used to ensure that this
data is secure and cannot be tampered with. Approaches such as the
Rider optimization Algorithm (ROA), also called Randomized ROA (RROA)
use these key generation strategies to find the optimal key so that data
can be transferred without leaking sensitive information.
Some versions of key generation have also been optimized to fit
larger datasets. For example, a novel, method-based Privacy Preserving
Distributed Data Mining strategy is able to increase privacy and hide
sensitive material through key generation. This version of sanitization
allows large amount of material to be sanitized. For companies that are
seeking to share information with several different groups, this
methodology may be preferred over original methods that take much longer
to process.
Certain models of data sanitization delete or add information to
the original database in an effort to preserve the privacy of each
subject. These heuristic based algorithms are beginning to become more
popularized, especially in the field of association rule mining.
Heuristic methods involve specific algorithms that use pattern hiding,
rule hiding, and sequence hiding to keep specific information hidden.
This type of data hiding can be used to cover wide patterns in data, but
is not as effective for specific information protection. Heuristic
based methods are not as suited to sanitizing large datasets, however,
recent developments in the heuristics based field have analyzed ways to
tackle this problem. An example includes the MR-OVnTSA approach, a
heuristics based sensitive pattern hiding approach for big data, introduced by Shivani Sharma and Durga Toshniwa.
This approach uses a heuristics based method called the ‘MapReduce
Based Optimum Victim Item and Transaction Selection Approach’, also
called MR-OVnTSA, that aims to reduce the loss of important data while
removing and hiding sensitive information. It takes advantage of
algorithms that compare steps and optimize sanitization.
An important goal of PPDM is to strike a balance between
maintaining the privacy of users that have submitted the data while also
enabling developers to make full use of the dataset. Many measures of
PPDM directly modify the dataset and create a new version that makes the
original unrecoverable. It strictly erases any sensitive information
and makes it inaccessible for attackers.
Association rule mining
One
type of data sanitization is rule based PPDM, which uses defined
computer algorithms to clean datasets. Association rule hiding is the
process of data sanitization as applied to transactional databases.
Transactional databases are the general term for data storage used to
record transactions as organizations conduct their business. Examples
include shipping payments, credit card payments, and sales orders. This
source analyzes fifty four different methods of data sanitization and
presents its four major findings of its trends
Certain new methods of data sanitization that rely on machine
deep learning. There are various weaknesses in the current use of data
sanitization. Many methods are not intricate or detailed enough to
protect against more specific data attacks.
This effort to maintain privacy while dating important data is referred
to as privacy-preserving data mining. Machine learning develops methods
that are more adapted to different types of attacks and can learn to
face a broader range of situations. Deep learning
is able to simplify the data sanitization methods and run these
protective measures in a more efficient and less time consuming way.
There have also been hybrid models that utilize both rule based and machine deep learning methods to achieve a balance between the two techniques.
Blockchain-based secure information sharing
Browser
backed cloud storage systems are heavily reliant on data sanitization
and are becoming an increasingly popular route of data storage.
Furthermore, the ease of usage is important for enterprises and
workplaces that use cloud storage for communication and collaboration.
Blockchain
is used to record and transfer information in a secure way and data
sanitization techniques are required to ensure that this data is
transferred more securely and accurately. It’s especially applicable for
those working in supply chain management and may be useful for those
looking to optimize the supply chain process.
For example, the Whale Optimization Algorithm (WOA), uses a method of
secure key generation to ensure that information is shared securely
through the blockchain technique.
The need to improve blockchain methods is becoming increasingly
relevant as the global level of development increases and becomes more
electronically dependent.
Industry specific applications
Healthcare
The
healthcare industry is an important sector that relies heavily on data
mining and use of datasets to store confidential information about
patients. The use of electronic storage has also been increasing in
recent years, which requires more comprehensive research and
understanding of the risks that it may pose. Currently, data mining and
storage techniques are only able to store limited amounts of
information. This reduces the efficacy of data storage and increases the
costs of storing data. New advanced methods of storing and mining data
that involve cloud based systems are becoming increasingly popular as
they are able to both mine and store larger amounts of information.