Search This Blog

Saturday, September 21, 2024

Software cracking

From Wikipedia, the free encyclopedia
Software crack illustration

Software cracking (known as "breaking" mostly in the 1980s) is an act of removing copy protection from a software. Copy protection can be removed by applying a specific crack. A crack can mean any tool that enables breaking software protection, a stolen product key, or guessed password. Cracking software generally involves circumventing licensing and usage restrictions on commercial software by illegal methods. These methods can include modifying code directly through disassembling and bit editing, sharing stolen product keys, or developing software to generate activation keys. Examples of cracks are: applying a patch or by creating reverse-engineered serial number generators known as keygens, thus bypassing software registration and payments or converting a trial/demo version of the software into fully-functioning software without paying for it. Software cracking contributes to the rise of online piracy where pirated software is distributed to end-users through filesharing sites like BitTorrent, One click hosting (OCH), or via Usenet downloads, or by downloading bundles of the original software with cracks or keygens.

Some of these tools are called keygen, patch, loader, or no-disc crack. A keygen is a handmade product serial number generator that often offers the ability to generate working serial numbers in your own name. A patch is a small computer program that modifies the machine code of another program. This has the advantage for a cracker to not include a large executable in a release when only a few bytes are changed. A loader modifies the startup flow of a program and does not remove the protection but circumvents it. A well-known example of a loader is a trainer used to cheat in games. Fairlight pointed out in one of their .nfo files that these type of cracks are not allowed for warez scene game releases. A nukewar has shown that the protection may not kick in at any point for it to be a valid crack.

Software cracking is closely related to reverse engineering because the process of attacking a copy protection technology, is similar to the process of reverse engineering. The distribution of cracked copies is illegal in most countries. There have been lawsuits over cracking software. It might be legal to use cracked software in certain circumstances. Educational resources for reverse engineering and software cracking are, however, legal and available in the form of Crackme programs.

History

Software are inherently expensive to produce but cheap to duplicate and distribute. Therefore, software producers generally tried to implement some form of copy protection before releasing it to the market. In 1984, Laind Huntsman, the head of software development for Formaster, a software protection company, commented that "no protection system has remained uncracked by enterprising programmers for more than a few months". In 2001, Dan S. Wallach, a professor from Rice University, argued that "those determined to bypass copy-protection have always found ways to do so – and always will".

Most of the early software crackers were computer hobbyists who often formed groups that competed against each other in the cracking and spreading of software. Breaking a new copy protection scheme as quickly as possible was often regarded as an opportunity to demonstrate one's technical superiority rather than a possibility of money-making. Software crackers usually did not benefit materially from their actions and their motivation was the challenge itself of removing the protection. Some low skilled hobbyists would take already cracked software and edit various unencrypted strings of text in it to change messages a game would tell a game player, often something considered vulgar. Uploading the altered copies on file sharing networks provided a source of laughs for adult users. The cracker groups of the 1980s started to advertise themselves and their skills by attaching animated screens known as crack intros in the software programs they cracked and released. Once the technical competition had expanded from the challenges of cracking to the challenges of creating visually stunning intros, the foundations for a new subculture known as demoscene were established. Demoscene started to separate itself from the illegal "warez scene" during the 1990s and is now regarded as a completely different subculture. Many software crackers have later grown into extremely capable software reverse engineers; the deep knowledge of assembly required in order to crack protections enables them to reverse engineer drivers in order to port them from binary-only drivers for Windows to drivers with source code for Linux and other free operating systems. Also because music and game intro was such an integral part of gaming the music format and graphics became very popular when hardware became affordable for the home user.

With the rise of the Internet, software crackers developed secretive online organizations. In the latter half of the nineties, one of the most respected sources of information about "software protection reversing" was Fravia's website.

In 2017, a group of software crackers started a project to preserve Apple II software by removing the copy protection.

+HCU

The High Cracking University (+HCU) was founded by Old Red Cracker (+ORC), considered a genius of reverse engineering and a legendary figure in Reverse Code Engineering (RCE), to advance research into RCE. He had also taught and authored many papers on the subject, and his texts are considered classics in the field and are mandatory reading for students of RCE.

The addition of the "+" sign in front of the nickname of a reverser signified membership in the +HCU. Amongst the students of +HCU were the top of the elite Windows reversers worldwide. +HCU published a new reverse engineering problem annually and a small number of respondents with the best replies qualified for an undergraduate position at the university.

+Fravia was a professor at +HCU. Fravia's website was known as "+Fravia's Pages of Reverse Engineering" and he used it to challenge programmers as well as the wider society to "reverse engineer" the "brainwashing of a corrupt and rampant materialism". In its heyday, his website received millions of visitors per year and its influence was "widespread". On his site, +Fravia also maintained a database of the tutorials generated by +HCU students for posterity.

Nowadays most of the graduates of +HCU have migrated to Linux and few have remained as Windows reversers. The information at the university has been rediscovered by a new generation of researchers and practitioners of RCE who have started new research projects in the field.

Methods

The most common software crack is the modification of an application's binary to cause or prevent a specific key branch in the program's execution. This is accomplished by reverse engineering the compiled program code using a debugger such as SoftICE, OllyDbg, GDB, or MacsBug until the software cracker reaches the subroutine that contains the primary method of protecting the software (or by disassembling an executable file with a program such as IDA). The binary is then modified using the debugger or a hex editor such as HIEW or monitor in a manner that replaces a prior branching opcode with its complement or a NOP opcode so the key branch will either always execute a specific subroutine or skip over it. Almost all common software cracks are a variation of this type. A region of code that must not be entered is often called a "bad boy" while one that should be followed is a "good boy".

Proprietary software developers are constantly developing techniques such as code obfuscation, encryption, and self-modifying code to make binary modification increasingly difficult. Even with these measures being taken, developers struggle to combat software cracking. This is because it is very common for a professional to publicly release a simple cracked EXE or Retrium Installer for public download, eliminating the need for inexperienced users to crack the software themselves.

A specific example of this technique is a crack that removes the expiration period from a time-limited trial of an application. These cracks are usually programs that alter the program executable and sometimes the .dll or .so linked to the application and the process of altering the original binary files is called patching. Similar cracks are available for software that requires a hardware dongle. A company can also break the copy protection of programs that they have legally purchased but that are licensed to particular hardware, so that there is no risk of downtime due to hardware failure (and, of course, no need to restrict oneself to running the software on bought hardware only).

Another method is the use of special software such as CloneCD to scan for the use of a commercial copy protection application. After discovering the software used to protect the application, another tool may be used to remove the copy protection from the software on the CD or DVD. This may enable another program such as Alcohol 120%, CloneDVD, Game Jackal, or Daemon Tools to copy the protected software to a user's hard disk. Popular commercial copy protection applications which may be scanned for include SafeDisc and StarForce.

In other cases, it might be possible to decompile a program in order to get access to the original source code or code on a level higher than machine code. This is often possible with scripting languages and languages utilizing JIT compilation. An example is cracking (or debugging) on the .NET platform where one might consider manipulating CIL to achieve one's needs. Java's bytecode also works in a similar fashion in which there is an intermediate language before the program is compiled to run on the platform dependent machine code.

Advanced reverse engineering for protections such as SecuROM, SafeDisc, StarForce, or Denuvo requires a cracker, or many crackers to spend much more time studying the protection, eventually finding every flaw within the protection code, and then coding their own tools to "unwrap" the protection automatically from executable (.EXE) and library (.DLL) files.

There are a number of sites on the Internet that let users download cracks produced by warez groups for popular games and applications (although at the danger of acquiring malicious software that is sometimes distributed via such sites). Although these cracks are used by legal buyers of software, they can also be used by people who have downloaded or otherwise obtained unauthorized copies (often through P2P networks).

Software piracy

Software cracking led to the distribution of pirated software around the world (software piracy). It was estimated that the United States lost US$2.3 billion in business application software in 1996. Software piracy rates were especially prevalent in African, Asian, East European, and Latin American countries. In certain countries such as Indonesia, Pakistan, Kuwait, China, and El Salvador, 90% of the software used was pirated.

Disk image

From Wikipedia, the free encyclopedia

A disk image is a snapshot of a storage device's structure and data typically stored in one or more computer files on another storage device.

Traditionally, disk images were bit-by-bit copies of every sector on a hard disk often created for digital forensic purposes, but it is now common to only copy allocated data to reduce storage space. Compression and deduplication are commonly used to reduce the size of the image file set.

Disk imaging is done for a variety of purposes including digital forensics, cloud computing, system administration, as part of a backup strategy, and legacy emulation as part of a digital preservation strategy. Disk images can be made in a variety of formats depending on the purpose. Virtual disk images (such as VHD and VMDK) are intended to be used for cloud computing, ISO images are intended to emulate optical media and raw disk images are used for forensic purposes. Proprietary formats are typically used by disk imaging software.

Despite the benefits of disk imaging the storage costs can be high, management can be difficult and they can be time consuming to create.

Background

Disk images were originally (in the late 1960s) used for backup and disk cloning of mainframe disk media. Early ones were as small as 5 megabytes and as large as 330 megabytes, and the copy medium was magnetic tape, which ran as large as 200 megabytes per reel. Disk images became much more popular when floppy disk media became popular, where replication or storage of an exact structure was necessary and efficient, especially in the case of copy protected floppy disks.

Disk image creation is called disk imaging and is often time consuming, even with a fast computer, because the entire disk must be copied. Typically, disk imaging requires a third party disk imaging program or backup software. The software required varies according to the type of disk image that needs to be created. For example, RawWrite and WinImage create floppy disk image files for MS-DOS and Microsoft Windows. In Unix or similar systems the dd program can be used to create raw disk images. Apple Disk Copy can be used on Classic Mac OS and macOS systems to create and write disk image files.

Authoring software for CDs/DVDs such as Nero Burning ROM can generate and load disk images for optical media. A virtual disk writer or virtual burner is a computer program that emulates an actual disc authoring device such as a CD writer or DVD writer. Instead of writing data to an actual disc, it creates a virtual disk image. A virtual burner, by definition, appears as a disc drive in the system with writing capabilities (as opposed to conventional disc authoring programs that can create virtual disk images), thus allowing software that can burn discs to create virtual discs.

Uses

Digital forensics

Forensic imaging is the process of creating a bit-by-bit copy of the data on the drive, including files, metadata, volume information, filesystems and their structure. Often, these images are also hashed to verify their integrity and that they have not been altered since being created. Unlike disk imaging for other purposes, digital forensic applications take a bit-by-bit copy to ensure forensic soundness. The purposes of imaging the disk is to not only discover evidence preserved in digital information but also to examine the drive to gather clues of how the crime was committed.

Virtualization

Creating a virtual disk image of optical media or a hard disk drive is typically done to make the content available to one or more virtual machines. Virtual machines emulate a CD/DVD drive by reading an ISO image. This can also be faster than reading from the physical optical medium. Further, there are less issues with wear and tear. A hard disk drive or solid-state drive in a virtual machine is implemented as a disk image (i.e. either the VHD format used by Microsoft's Hyper-V, the VDI format used by Oracle Corporation's VirtualBox, the VMDK format used for VMware virtual machines, or the QCOW format used by QEMU). Virtual hard disk images tend to be stored as either a collection of files (where each one is typically 2GB in size), or as a single file. Virtual machines treat the image set as a physical drive.

Rapid deployment of systems

Educational institutions and businesses can often need to buy or replace computer systems in large numbers. Disk imaging is commonly used to rapidly deploy the same configuration across workstations. Disk imaging software is used to create an image of a completely-configured system (such an image is sometimes called a golden image). This image is then written to a computer's hard disk (which is sometimes described as restoring an image).

Network-based image deployment

Image restoration can be done using network-based image deployment. This method uses a PXE server to boot an operating system over a computer network that contains the necessary components to image or restore storage media in a computer. This is usually used in conjunction with a DHCP server to automate the configuration of network parameters including IP addresses. Multicasting, broadcasting or unicasting tend to be used to restore an image to many computers simultaneously. These approaches do not work well if one or more computers experience packet loss. As a result, some imaging solutions use the BitTorrent protocol to overcome this problem.

Network-based image deployment reduces the need to maintain and update individual systems manually. Imaging is also easier than automated setup methods because an administrator does not need to have knowledge of the prior configuration to copy it.

Backup strategy

A disk image contains all files and data (i.e., file attributes and the file fragmentation state). For this reason, it is also used for backing up optical media (CDs and DVDs, etc.), and allows the exact and efficient recovery after experimenting with modifications to a system or virtual machine. Typically, disk imaging can be used to quickly restore an entire system to an operational state after a disaster.

Digital preservation

Libraries and museums are typically required to archive and digitally preserve information without altering it in any manner. Emulators frequently use disk images to emulate floppy disks that have been preserved. This is usually simpler to program than accessing a real floppy drive (particularly if the disks are in a format not supported by the host operating system), and allows a large library of software to be managed. Emulation also allows existing disk images to be put into a usable form even though the data contained in the image is no longer readable without emulation.

Limitations

Disk imaging is time consuming, the space requirements are high and reading from them can be slower than reading from the disk directly because of a performance overhead.

Other limitations can be the lack of access to software required to read the contents of the image. For example, prior to Windows 8, third party software was required to mount disk images. When imaging multiple computers with only minor differences, much data is duplicated unnecessarily, wasting space.

Speed and failure

Disk imaging can be slow, especially for older storage devices. A typical 4.7 GB DVD can take an average of 18 minutes to duplicate. Floppy disks read and write much slower than hard disks. Therefore, despite their small size, it can take several minutes to copy a single disk. In some cases, disk imaging can fail due to bad sectors or physical wear and tear on the source device. Unix utilities (such as dd) are not designed to cope with failures, causing the disk image creation process to fail. When data recovery is the end goal, it is instead recommended to use more specialised tools (such as ddrescue).

Data recovery

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Data_recovery

In computing, data recovery is a process of retrieving deleted, inaccessible, lost, corrupted, damaged, or formatted data from secondary storage, removable media or files, when the data stored in them cannot be accessed in a usual way.  The data is most often salvaged from storage media such as internal or external hard disk drives (HDDs), solid-state drives (SSDs), USB flash drives, magnetic tapes, CDs, DVDs, RAID subsystems, and other electronic devices. Recovery may be required due to physical damage to the storage devices or logical damage to the file system that prevents it from being mounted by the host operating system (OS).

Logical failures occur when the hard drive devices are functional but the user or automated-OS cannot retrieve or access data stored on them. Logical failures can occur due to corruption of the engineering chip, lost partitions, firmware failure, or failures during formatting/re-installation.

Data recovery can be a very simple or technical challenge. This is why there are specific software companies specialized in this field.

About

The most common data recovery scenarios involve an operating system failure, malfunction of a storage device, logical failure of storage devices, accidental damage or deletion, etc. (typically, on a single-drive, single-partition, single-OS system), in which case the ultimate goal is simply to copy all important files from the damaged media to another new drive. This can be accomplished using a Live CD, or DVD by booting directly from a ROM or a USB drive instead of the corrupted drive in question. Many Live CDs or DVDs provide a means to mount the system drive and backup drives or removable media, and to move the files from the system drive to the backup media with a file manager or optical disc authoring software. Such cases can often be mitigated by disk partitioning and consistently storing valuable data files (or copies of them) on a different partition from the replaceable OS system files.

Another scenario involves a drive-level failure, such as a compromised file system or drive partition, or a hard disk drive failure. In any of these cases, the data is not easily read from the media devices. Depending on the situation, solutions involve repairing the logical file system, partition table, or master boot record, or updating the firmware or drive recovery techniques ranging from software-based recovery of corrupted data, to hardware- and software-based recovery of damaged service areas (also known as the hard disk drive's "firmware"), to hardware replacement on a physically damaged drive which allows for the extraction of data to a new drive. If a drive recovery is necessary, the drive itself has typically failed permanently, and the focus is rather on a one-time recovery, salvaging whatever data can be read.

In a third scenario, files have been accidentally "deleted" from a storage medium by the users. Typically, the contents of deleted files are not removed immediately from the physical drive; instead, references to them in the directory structure are removed, and thereafter space the deleted data occupy is made available for later data overwriting. In the mind of end users, deleted files cannot be discoverable through a standard file manager, but the deleted data still technically exists on the physical drive. In the meantime, the original file contents remain, often several disconnected fragments, and may be recoverable if not overwritten by other data files.

The term "data recovery" is also used in the context of forensic applications or espionage, where data which have been encrypted, hidden, or deleted, rather than damaged, are recovered. Sometimes data present in the computer gets encrypted or hidden due to reasons like virus attacks which can only be recovered by some computer forensic experts.

Physical damage

A wide variety of failures can cause physical damage to storage media, which may result from human errors and natural disasters. CD-ROMs can have their metallic substrate or dye layer scratched off; hard disks can suffer from a multitude of mechanical failures, such as head crashes, PCB failure, and failed motors; tapes can simply break.

Physical damage to a hard drive, even in cases where a head crash has occurred, does not necessarily mean there will be a permanent loss of data. The techniques employed by many professional data recovery companies can typically salvage most, if not all, of the data that had been lost when the failure occurred.

Of course, there are exceptions to this, such as cases where severe damage to the hard drive platters may have occurred. However, if the hard drive can be repaired and a full image or clone created, then the logical file structure can be rebuilt in most instances.

Most physical damage cannot be repaired by end users. For example, opening a hard disk drive in a normal environment can allow airborne dust to settle on the platter and become caught between the platter and the read/write head. During normal operation, read/write heads float 3 to 6 nanometers above the platter surface, and the average dust particles found in a normal environment are typically around 30,000 nanometers in diameter. When these dust particles get caught between the read/write heads and the platter, they can cause new head crashes that further damage the platter and thus compromise the recovery process. Furthermore, end users generally do not have the hardware or technical expertise required to make these repairs. Consequently, data recovery companies are often employed to salvage important data with the more reputable ones using class 100 dust- and static-free cleanrooms.

Recovery techniques

Recovering data from physically damaged hardware can involve multiple techniques. Some damage can be repaired by replacing parts in the hard disk. This alone may make the disk usable, but there may still be logical damage. A specialized disk-imaging procedure is used to recover every readable bit from the surface. Once this image is acquired and saved on a reliable medium, the image can be safely analyzed for logical damage and will possibly allow much of the original file system to be reconstructed.

Hardware repair

Media that has suffered a catastrophic electronic failure requires data recovery in order to salvage its contents.

A common misconception is that a damaged printed circuit board (PCB) may be simply replaced during recovery procedures by an identical PCB from a healthy drive. While this may work in rare circumstances on hard disk drives manufactured before 2003, it will not work on newer drives. Electronics boards of modern drives usually contain drive-specific adaptation data (generally a map of bad sectors and tuning parameters) and other information required to properly access data on the drive. Replacement boards often need this information to effectively recover all of the data. The replacement board may need to be reprogrammed. Some manufacturers (Seagate, for example) store this information on a serial EEPROM chip, which can be removed and transferred to the replacement board.

Each hard disk drive has what is called a system area or service area; this portion of the drive, which is not directly accessible to the end user, usually contains drive's firmware and adaptive data that helps the drive operate within normal parameters. One function of the system area is to log defective sectors within the drive; essentially telling the drive where it can and cannot write data.

The sector lists are also stored on various chips attached to the PCB, and they are unique to each hard disk drive. If the data on the PCB do not match what is stored on the platter, then the drive will not calibrate properly. In most cases the drive heads will click because they are unable to find the data matching what is stored on the PCB.

Logical damage

Result of a failed data recovery from a hard disk drive.

The term "logical damage" refers to situations in which the error is not a problem in the hardware and requires software-level solutions.

Corrupt partitions and file systems, media errors

In some cases, data on a hard disk drive can be unreadable due to damage to the partition table or file system, or to (intermittent) media errors. In the majority of these cases, at least a portion of the original data can be recovered by repairing the damaged partition table or file system using specialized data recovery software such as Testdisk; software like ddrescue can image media despite intermittent errors, and image raw data when there is partition table or file system damage. This type of data recovery can be performed by people without expertise in drive hardware as it requires no special physical equipment or access to platters.

Sometimes data can be recovered using relatively simple methods and tools; more serious cases can require expert intervention, particularly if parts of files are irrecoverable. Data carving is the recovery of parts of damaged files using knowledge of their structure.

Overwritten data

After data has been physically overwritten on a hard disk drive, it is generally assumed that the previous data are no longer possible to recover. In 1996, Peter Gutmann, a computer scientist, presented a paper that suggested overwritten data could be recovered through the use of magnetic force microscopy. In 2001, he presented another paper on a similar topic. To guard against this type of data recovery, Gutmann and Colin Plumb designed a method of irreversibly scrubbing data, known as the Gutmann method and used by several disk-scrubbing software packages.

Substantial criticism has followed, primarily dealing with the lack of any concrete examples of significant amounts of overwritten data being recovered. Gutmann's article contains a number of errors and inaccuracies, particularly regarding information about how data is encoded and processed on hard drives. Although Gutmann's theory may be correct, there is no practical evidence that overwritten data can be recovered, while research has shown to support that overwritten data cannot be recovered.

Solid-state drives (SSD) overwrite data differently from hard disk drives (HDD) which makes at least some of their data easier to recover. Most SSDs use flash memory to store data in pages and blocks, referenced by logical block addresses (LBA) which are managed by the flash translation layer (FTL). When the FTL modifies a sector it writes the new data to another location and updates the map so the new data appear at the target LBA. This leaves the pre-modification data in place, with possibly many generations, and recoverable by data recovery software.

Lost, deleted, and formatted data

Sometimes, data present in the physical drives (Internal/External Hard disk, Pen Drive, etc.) gets lost, deleted and formatted due to circumstances like virus attack, accidental deletion or accidental use of SHIFT+DELETE. In these cases, data recovery software is used to recover/restore the data files.

Logical bad sector

In the list of logical failures of hard disks, a logical bad sector is the most common fault leading data not to be readable. Sometimes it is possible to sidestep error detection even in software, and perhaps with repeated reading and statistical analysis recover at least some of the underlying stored data. Sometimes prior knowledge of the data stored and the error detection and correction codes can be used to recover even erroneous data. However, if the underlying physical drive is degraded badly enough, at least the hardware surrounding the data must be replaced, or it might even be necessary to apply laboratory techniques to the physical recording medium. Each of the approaches is progressively more expensive, and as such progressively more rarely sought.

Eventually, if the final, physical storage medium has indeed been disturbed badly enough, recovery will not be possible using any means; the information has irreversibly been lost.

Remote data recovery

Recovery experts do not always need to have physical access to the damaged hardware. When the lost data can be recovered by software techniques, they can often perform the recovery using remote access software over the Internet, LAN or other connection to the physical location of the damaged media. The process is essentially no different from what the end user could perform by themselves.

Remote recovery requires a stable connection with an adequate bandwidth. However, it is not applicable where access to the hardware is required, as in cases of physical damage.

Four phases of data recovery

Usually, there are four phases when it comes to successful data recovery, though that can vary depending on the type of data corruption and recovery required.

Phase 1
Repair the hard disk drive
The hard drive is repaired in order to get it running in some form, or at least in a state suitable for reading the data from it. For example, if heads are bad they need to be changed; if the PCB is faulty then it needs to be fixed or replaced; if the spindle motor is bad the platters and heads should be moved to a new drive.
Phase 2
Image the drive to a new drive or a disk image file
When a hard disk drive fails, the importance of getting the data off the drive is the top priority. The longer a faulty drive is used, the more likely further data loss is to occur. Creating an image of the drive will ensure that there is a secondary copy of the data on another device, on which it is safe to perform testing and recovery procedures without harming the source.
Phase 3
Logical recovery of files, partition, MBR and filesystem structures
After the drive has been cloned to a new drive, it is suitable to attempt the retrieval of lost data. If the drive has failed logically, there are a number of reasons for that. Using the clone it may be possible to repair the partition table or master boot record (MBR) in order to read the file system's data structure and retrieve stored data.
Phase 4
Repair damaged files that were retrieved
Data damage can be caused when, for example, a file is written to a sector on the drive that has been damaged. This is the most common cause in a failing drive, meaning that data needs to be reconstructed to become readable. Corrupted documents can be recovered by several software methods or by manually reconstructing the document using a hex editor.

Restore disk

The Windows operating system can be reinstalled on a computer that is already licensed for it. The reinstallation can be done by downloading the operating system or by using a "restore disk" provided by the computer manufacturer. Eric Lundgren was fined and sentenced to U.S. federal prison in April 2018 for producing 28,000 restore disks and intending to distribute them for about 25 cents each as a convenience to computer repair shops.

List of data recovery software

Bootable

Data recovery cannot always be done on a running system. As a result, a boot disk, live CD, live USB, or any other type of live distro contains a minimal operating system.

Consistency checkers

File recovery

Forensics

Imaging tools

  • Clonezilla: a free disk cloning, disk imaging, data recovery, and deployment boot disk
  • dd: common byte-to-byte cloning tool found on Unix-like systems
  • ddrescue: an open-source tool similar to dd but with the ability to skip over and subsequently retry bad blocks on failing storage devices
  • Team Win Recovery Project: a free and open-source recovery system for Android devices

Data sanitization

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Data_sanitization

Data sanitization involves the secure and permanent erasure of sensitive data from datasets and media to guarantee that no residual data can be recovered even through extensive forensic analysis. Data sanitization has a wide range of applications but is mainly used for clearing out end-of-life electronic devices or for the sharing and use of large datasets that contain sensitive information. The main strategies for erasing personal data from devices are physical destruction, cryptographic erasure, and data erasure. While the term data sanitization may lead some to believe that it only includes data on electronic media, the term also broadly covers physical media, such as paper copies. These data types are termed soft for electronic files and hard for physical media paper copies. Data sanitization methods are also applied for the cleaning of sensitive data, such as through heuristic-based methods, machine-learning based methods, and k-source anonymity.

This erasure is necessary as an increasing amount of data is moving to online storage, which poses a privacy risk in the situation that the device is resold to another individual. The importance of data sanitization has risen in recent years as private information is increasingly stored in an electronic format and larger, more complex datasets are being utilized to distribute private information. Electronic storage has expanded and enabled more private data to be stored. Therefore it requires more advanced and thorough data sanitization techniques to ensure that no data is left on the device once it is no longer in use. Technological tools that enable the transfer of large amounts of data also allow more private data to be shared. Especially with the increasing popularity of cloud-based information sharing and storage, data sanitization methods that ensure that all data shared is cleaned has become a significant concern. Therefore it is only sensible that governments and private industry create and enforce data sanitization policies to prevent data loss or other security incidents.

Data sanitization policy in public and private sectors

While the practice of data sanitization is common knowledge in most technical fields, it is not consistently understood across all levels of business and government. Thus, the need for a comprehensive Data Sanitization policy in government contracting and private industry is required in order to avoid the possible loss of data, leaking of state secrets to adversaries, disclosing proprietary technologies, and possibly being barred for contract competition by government agencies.

CIA Triad, By John M. Kennedy, Creative Commons Attribution-Share Alike 3.0, Wikimedia

With the increasingly connected world, it has become even more critical that governments, companies, and individuals follow specific data sanitization protocols to ensure that the confidentiality of information is sustained throughout its lifecycle. This step is critical to the core Information Security triad of Confidentiality, Integrity, and Availability. This CIA Triad is especially relevant to those who operate as government contractors or handle other sensitive private information. To this end, government contractors must follow specific data sanitization policies and use these policies to enforce the National Institute of Standards and Technology recommended guidelines for Media Sanitization covered in NIST Special Publication 800-88. This is especially prevalent for any government work which requires CUI (Controlled Unclassified Information) or above and is required by DFARS Clause 252.204-7012, Safeguarding Covered Defense Information and Cyber Incident Reporting.  While private industry may not be required to follow NIST 800-88 standards for data sanitization, it is typically considered to be a best practice across industries with sensitive data. To further compound the issue, the ongoing shortage of cyber specialists and confusion on proper cyber hygiene has created a skill and funding gap for many government contractors.

However, failure to follow these recommended sanitization policies may result in severe consequences, including losing data, leaking state secrets to adversaries, losing proprietary technologies, and preventing contract competition by government agencies. Therefore, the government contractor community must ensure its data sanitization policies are well defined and follow NIST guidelines for data sanitization. Additionally, while the core focus of data sanitization may seem to focus on electronic “soft copy” data, other data sources such as “hard copy” documents must be addressed in the same sanitization policies.

To examine the existing instances of data sanitization policies and determine the impacts of not developing, utilizing, or following these policy guidelines and recommendation, research data was not only coalesced from the government contracting sector but also other critical industries such as Defense, Energy, and Transportation. These were selected as they typically also fall under government regulations, and therefore NIST (National Institute of Standards and Technology) guidelines and policies would also apply in the United States. Primary Data is from the study performed by an independent research company Coleman Parkes Research in August 2019. This research project targeted many different senior cyber executives and policy makers while surveying over 1,800 senior stakeholders. The data from Coleman Parkes shows that 96% of organizations have a data sanitization policy in place; however, in the United States, only 62% of respondents felt that the policy is communicated well across the business. Additionally, it reveals that remote and contract workers were the least likely to comply with data sanitization policies. This trend has become a more pressing issue as many government contractors and private companies have been working remotely due to the Covid-19 pandemic. The likelihood of this continuing after the return to normal working conditions is likely.

On June 26, 2021, a basic Google search for “data lost due to non-sanitization” returned over 20 million results. These included articles on; data breaches and the loss of business, military secrets and proprietary data losses, PHI (Protected Health Information), PII (Personally Identifiable Information), and many articles on performing essential data sanitization. Many of these articles also point to existing data sanitization and security policies of companies and government entities, such as the U.S. Environmental Protection Agency, "Sample Policy and Guidance Language for Federal Media Sanitization". Based on these articles and NIST 800-88 recommendations, depending on its data security level or categorization, data should be:

  • Cleared – Provide a basic level of data sanitization by overwriting data sectors to remove any previous data remnants that a basic format would not include. Again, the focus is on electronic media. This method is typically utilized if the media is going to be re-used within the organization at a similar data security level.
  • Purged – May use physical (degaussing) or logical methods (sector overwrite) to make the target media unreadable. Typically utilized when media is no longer needed and is at a lower level of data security level.
  • Destroyed – Permanently renders the data irretrievable and is commonly used when media is leaving an organization or has reached its end of life, i.e., paper shredding or hard drive/media crushing and incineration. This method is typically utilized for media containing highly sensitive information and state secrets which could cause grave damage to national security or to the privacy and safety of individuals.

Data sanitization road blocks

The International Information Systems Security Certification Consortium 2020 Cyber Workforce study shows that the global cybersecurity industry still has over 3.12 million unfilled positions due to a skills shortage. Therefore, those with the correct skillset to implement NIST 800-88 in policies may come at a premium labor rate. In addition, staffing and funding need to adjust to meet policy needs to properly implement these sanitization methods in tandem with appropriate Data level categorization to improve data security outcomes and reduce data loss. In order to ensure the confidentiality of customer and client data, government and private industry must create and follow concrete data sanitization policies which align with best practices, such as those outlined in NIST 800-88. Without consistent and enforced policy requirements, the data will be at increased risk of compromise. To achieve this, entities must allow for a cybersecurity wage premium to attract qualified talent. In order to prevent the loss of data and therefore Proprietary Data, Personal Information, Trade Secrets, and Classified Information, it is only logical to follow best practices.

Data sanitization policy best practices

Secret-Restricted Data Cover Sheet, By Glunggenbauer, Shared under CC BY 2.0 Wikimedia

Data sanitization policy must be comprehensive and include data levels and correlating sanitization methods. Any data sanitization policy created must be comprehensive and include all forms of media to include soft and hard copy data. Categories of data should also be defined so that appropriate sanitization levels will be defined under a sanitization policy. This policy should be defined so that all levels of data can align to the appropriate sanitization method. For example, controlled unclassified information on electronic storage devices may be cleared or purged, but those devices storing secret or top secret classified materials should be physically destroyed.

Any data sanitization policy should be enforceable and show what department and management structure has the responsibility to ensure data is sanitized accordingly. This policy will require a high-level management champion (typically the Chief Information Security Officer or another C-suite equivalent) for the process and to define responsibilities and penalties for parties at all levels. This policy champion will include defining concepts such as the Information System Owner and Information Owner to define the chain of responsibility for data creation and eventual sanitization. The CISO or other policy champion should also ensure funding is allocated to additional cybersecurity workers to implement and enforce policy compliance. Auditing requirements are also typically included to prove media destruction and should be managed by these additional staff. For small business and those without a broad cyber background resources are available in the form of editable Data Sanitization policy templates. Many groups such as the IDSC (International Data Sanitization Consortium) provide these free of charge on their website https://www.datasanitization.org/.

Without training in data security and sanitization principles, it is unfeasible to expect users to comply with the policy. Therefore, the Sanitization Policy should include a matrix of instruction and frequency by job category to ensure that users, at every level, understand their part in complying with the policy. This task should be easy to accomplish as most government contractors are already required to perform annual Information Security training for all employees. Therefore, additional content can be added to ensure data sanitization policy compliance.

Sanitizing devices

The primary use of data sanitization is for the complete clearing of devices and destruction of all sensitive data once the storage device is no longer in use or is transferred to another Information system. This is an essential stage in the Data Security Lifecycle (DSL) and Information Lifecycle Management (ILM). Both are approaches for ensuring privacy and data management throughout the usage of an electronic device, as it ensures that all data is destroyed and unrecoverable when devices reach the end of their lifecycle.

There are three main methods of data sanitization for complete erasure of data: physical destruction, cryptographic erasure, and data erasure. All three erasure methods aim to ensure that deleted data cannot be accessed even through advanced forensic methods, which maintains the privacy of individuals’ data even after the mobile device is no longer in use.

Physical destruction

E-waste pending destruction and e-cycling

Physical erasure involves the manual destruction of stored data. This method uses mechanical shredders or degaussers to shred devices, such as phones, computers, hard drives, and printers, into small individual pieces. Varying levels of data security levels require different levels of destruction.

Degaussing is most commonly used on hard disk drives (HDDs), and involves the utilization of high energy magnetic fields to permanently disrupt the functionality and memory storage of the device. When data is exposed to this strong magnetic field, any memory storage is neutralized and can not be recovered or used again. Degaussing does not apply to solid state disks (SSDs) as the data is not stored using magnetic methods. When particularly sensitive data is involved it is typical to utilize processes such as paper pulp, special burn, and solid state conversion. This will ensure proper destruction of all sensitive media including paper, Hard and Soft copy media, optical media, specialized computing hardware.

Physical destruction often ensures that data is completely erased and cannot be used again. However, the physical by-products of mechanical waste from mechanical shredding can be damaging to the environment, but a recent trend in increasing the amount of e-waste material recovered by e-cycling has helped to minimize the environmental impact. Furthermore, once data is physically destroyed, it can no longer be resold or used again.

Cryptographic erasure

Cryptographic erasure involves the destruction of the secure key or passphrase, that is used to protect stored information. Data encryption involves the development of a secure key that only enables authorized parties to gain access to the data that is stored. The permanent erasure of this key ensures that the private data stored can no longer be accessed. Cryptographic erasure is commonly installed through manufactures of the device itself as encryption software is often built into the device. Encryption with key erasure involves encrypting all sensitive material in a way that requires a secure key to decrypt the information when it needs to be used. When the information needs to be deleted, the secure key can be erased. This provides a greater ease of use, and a speedier data wipe, than other software methods because it involves one deletion of secure information rather than each individual file.

Cryptographic erasure is often used for data storage that does not contain as much private information since there is a possibility that errors can occur due to manufacturing failures or human error during the process of key destruction. This creates a wider range of possible results of data erasure. This method allows for data to continue to be stored on the device and does not require that the device be completely erased. This way, the device can be resold again to another individual or company since the physical integrity of the device itself is maintained. However this assumes that the level of data encryption on the device is resistant to future encryption attacks. For instance a hard drive utilizing Cryptographic erasure with a 128bit AES key may be secure now but in 5 years, it may be common to break this level of encryption. Therefore the level of data security should be declared in a data sanitization policy to future proof the process.

Data erasure

The process of data erasure involves masking all information at the byte level through the insertion of random 0s and 1s in on all sectors of the electronic equipment that is no longer in use. This software based method ensures that all data previous stored is completely hidden and unrecoverable, which ensures full data sanitization. The efficacy and accuracy of this sanitization method can also be analyzed through auditable reports.

Data erasure often ensures complete sanitization while also maintaining the physical integrity of the electronic equipment so that the technology can be resold or reused. This ability to recycle technological devices makes data erasure a more environmentally sound version of data sanitization. This method is also the most accurate and comprehensive since the efficacy of the data masking can be tested afterwards to ensure complete deletion. However, data erasure through software based mechanisms requires more time compared to other methods.

Secure erase

A number of storage media sets support a command that, when passed to the device, causes it to perform a built-in sanitization procedure. The following command sets define such a standard command:

  • ATA (including SATA) defines a Security Erase command. Two levels of thoroughness are defined.
  • SCSI (including SAS and other physical connections) defines a SANITIZE command.
  • NVMe defines formatting with secure erase.
  • Opal Storage Specification specifies a command set for self-encrypting drives and cryptographic erase, available in addition to command-set methods.

The drive usually performs fast cryptographic erasure when data is encrypted, and a slower data erasure by overwriting otherwise. SCSI allows for asking for a specific type of erasure.

If implemented correctly, the built-in sanitization feature is sufficient to render data unrecoverable. The NIST approves of the use of this feature. There have been a few reported instances of failures to erase some or all data due to buggy firmware, sometimes readily apparent in a sector editor.

Necessity of data sanitization

There has been increased usage of mobile devices, Internet of Things (IoT) technologies, cloud-based storage systems, portable electronic devices, and various other electronic methods to store sensitive information, therefore implementing effective erasure methods once the device is not longer in use has become crucial to protect sensitive data. Due to the increased usage of electronic devices in general and the increased storage of private information on these electronic devices, the need for data sanitization has been much more urgent in recent years.

There are also specific methods of sanitization that do not fully clean devices of private data which can prove to be problematic. For example, some remote wiping methods on mobile devices are vulnerable to outside attacks and efficacy depends on the unique efficacy of each individual software system installed. Remote wiping involves sending a wireless command to the device when it has been lost or stolen that directs the device to completely wipe out all data. While this method can be very beneficial, it also has several drawbacks. For example, the remote wiping method can be manipulated by attackers to signal the process when it is not yet necessary. This results in incomplete data sanitization. If attackers do gain access to the storage on the device, the user risks exposing all private information that was stored.

Cloud computing and storage has become an increasingly popular method of data storage and transfer. However, there are certain privacy challenges associated with cloud computing that have not been fully explored. Cloud computing is vulnerable to various attacks such as through code injection, the path traversal attack, and resource depletion because of the shared pool structure of these new techniques. These cloud storage models require specific data sanitization methods to combat these issues. If data is not properly removed from cloud storage models, it opens up the possibility for security breaches at multiple levels.

Risks posed by inadequate data-set sanitization

Inadequate data sanitization methods can result in two main problems: a breach of private information and compromises to the integrity of the original dataset. If data sanitization methods are unsuccessful at removing all sensitive information, it poses the risk of leaking this information to attackers. Numerous studies have been conducted to optimize ways of preserving sensitive information. Some data sanitization methods have a high sensitivity to distinct points that have no closeness to data points. This type of data sanitization is very precise and can detect anomalies even if the poisoned data point is relatively close to true data. Another method of data sanitization is one that also removes outliers in data, but does so in a more general way. It detects the general trend of data and discards any data that strays and it’s able to target anomalies even when inserted as a group. In general, data sanitization techniques use algorithms to detect anomalies and remove any suspicious points that may be poisoned data or sensitive information.

Furthermore, data sanitization methods may remove useful, non-sensitive information, which then renders the sanitized dataset less useful and altered from the original. There have been iterations of common data sanitization techniques that attempt to correct the issue of the loss of original dataset integrity. In particular, Liu, Xuan, Wen, and Song offered a new algorithm for data sanitization called the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) method. There is often a lot of emphasis that is put into protecting the privacy of users, so this method brings a new perspective that focuses on also protecting the integrity of the data. It functions in a way that has three main advantages: it learns to optimize the process of sanitization by only cleaning the item with the highest conflict count, keeps parts of the dataset with highest utility, and also analyzes the conflict degree of the sensitive material. Robust research was conducted on the efficacy and usefulness of this new technique to reveal the ways that it can benefit in maintaining the integrity of the dataset. This new technique is able to firstly pinpoint the specific parts of the dataset that are possibly poisoned data and also use computer algorithms to make a calculation between the tradeoffs of how useful it is to decide if it should be removed. This is a new way of data sanitization that takes into account the utility of the data before it is immediately discarded.

Applications of data sanitization

Data sanitization methods are also implemented for privacy preserving data mining, association rule hiding, and blockchain-based secure information sharing. These methods involve the transfer and analysis of large datasets that contain private information. This private information needs to be sanitized before being made available online so that sensitive material is not exposed. Data sanitization is used to ensure privacy is maintained in the dataset, even when it is being analyzed.

Privacy preserving data mining

Privacy Preserving Data Mining (PPDM) is the process of data mining while maintaining privacy of sensitive material. Data mining involves analyzing large datasets to gain new information and draw conclusions. PPDM has a wide range of uses and is an integral step in the transfer or use of any large data set containing sensitive material.

Data sanitization is an integral step to privacy preserving data mining because private datasets need to be sanitized before they can be utilized by individuals or companies for analysis. The aim of privacy preserving data mining is to ensure that private information cannot be leaked or accessed by attackers and sensitive data is not traceable to individuals that have submitted the data. Privacy preserving data mining aims to maintain this level of privacy for individuals while also maintaining the integrity and functionality of the original dataset. In order for the dataset to be used, necessary aspects of the original data need to be protected during the process of data sanitization. This balance between privacy and utility has been the primary goal of data sanitization methods.

One approach to achieve this optimization of privacy and utility is through encrypting and decrypting sensitive information using a process called key generation. After the data is sanitized, key generation is used to ensure that this data is secure and cannot be tampered with. Approaches such as the Rider optimization Algorithm (ROA), also called Randomized ROA (RROA) use these key generation strategies to find the optimal key so that data can be transferred without leaking sensitive information.

Some versions of key generation have also been optimized to fit larger datasets. For example, a novel, method-based Privacy Preserving Distributed Data Mining strategy is able to increase privacy and hide sensitive material through key generation. This version of sanitization allows large amount of material to be sanitized. For companies that are seeking to share information with several different groups, this methodology may be preferred over original methods that take much longer to process.

Certain models of data sanitization delete or add information to the original database in an effort to preserve the privacy of each subject. These heuristic based algorithms are beginning to become more popularized, especially in the field of association rule mining. Heuristic methods involve specific algorithms that use pattern hiding, rule hiding, and sequence hiding to keep specific information hidden. This type of data hiding can be used to cover wide patterns in data, but is not as effective for specific information protection. Heuristic based methods are not as suited to sanitizing large datasets, however, recent developments in the heuristics based field have analyzed ways to tackle this problem. An example includes the MR-OVnTSA approach, a heuristics based sensitive pattern hiding approach for big data, introduced by Shivani Sharma and Durga Toshniwa. This approach uses a heuristics based method called the ‘MapReduce Based Optimum Victim Item and Transaction Selection Approach’, also called MR-OVnTSA, that aims to reduce the loss of important data while removing and hiding sensitive information. It takes advantage of algorithms that compare steps and optimize sanitization.

An important goal of PPDM is to strike a balance between maintaining the privacy of users that have submitted the data while also enabling developers to make full use of the dataset. Many measures of PPDM directly modify the dataset and create a new version that makes the original unrecoverable. It strictly erases any sensitive information and makes it inaccessible for attackers.

Association rule mining

One type of data sanitization is rule based PPDM, which uses defined computer algorithms to clean datasets. Association rule hiding is the process of data sanitization as applied to transactional databases. Transactional databases are the general term for data storage used to record transactions as organizations conduct their business. Examples include shipping payments, credit card payments, and sales orders. This source analyzes fifty four different methods of data sanitization and presents its four major findings of its trends

Certain new methods of data sanitization that rely on machine deep learning. There are various weaknesses in the current use of data sanitization. Many methods are not intricate or detailed enough to protect against more specific data attacks. This effort to maintain privacy while dating important data is referred to as privacy-preserving data mining. Machine learning develops methods that are more adapted to different types of attacks and can learn to face a broader range of situations. Deep learning is able to simplify the data sanitization methods and run these protective measures in a more efficient and less time consuming way.

There have also been hybrid models that utilize both rule based and machine deep learning methods to achieve a balance between the two techniques.

Blockchain-based secure information sharing

Browser backed cloud storage systems are heavily reliant on data sanitization and are becoming an increasingly popular route of data storage. Furthermore, the ease of usage is important for enterprises and workplaces that use cloud storage for communication and collaboration.

Blockchain is used to record and transfer information in a secure way and data sanitization techniques are required to ensure that this data is transferred more securely and accurately. It’s especially applicable for those working in supply chain management and may be useful for those looking to optimize the supply chain process. For example, the Whale Optimization Algorithm (WOA), uses a method of secure key generation to ensure that information is shared securely through the blockchain technique. The need to improve blockchain methods is becoming increasingly relevant as the global level of development increases and becomes more electronically dependent.

Industry specific applications

Healthcare

The healthcare industry is an important sector that relies heavily on data mining and use of datasets to store confidential information about patients. The use of electronic storage has also been increasing in recent years, which requires more comprehensive research and understanding of the risks that it may pose. Currently, data mining and storage techniques are only able to store limited amounts of information. This reduces the efficacy of data storage and increases the costs of storing data. New advanced methods of storing and mining data that involve cloud based systems are becoming increasingly popular as they are able to both mine and store larger amounts of information.

Operator (computer programming)

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Operator_(computer_programmin...