Search This Blog

Monday, October 16, 2023

Space Shuttle program

From Wikipedia, the free encyclopedia
 
Space Shuttle program
Program overview
CountryUnited States
OrganizationNASA
PurposeCrewed orbital flight
StatusCompleted
Program history
CostUS$196 billion (2011)
Duration1972–1986 1989-2011
First flight
First crewed flight
Last flight
Successes133
Failures2 (STS-51-L and STS-107)
Partial failures1 (STS-83)
Launch site(s)
Vehicle information
Crewed vehicle(s)Space Shuttle orbiter
Launch vehicle(s)Space Shuttle

The Space Shuttle program was the fourth human spaceflight program carried out by the U.S. National Aeronautics and Space Administration (NASA), which accomplished routine transportation for Earth-to-orbit crew and cargo from 1981 to 2011. Its official name, Space Transportation System (STS), was taken from a 1969 plan for a system of reusable spacecraft of which it was the only item funded for development. It flew 135 missions and carried 355 astronauts from 16 countries, many on multiple trips.

The Space Shuttle—composed of an orbiter launched with two reusable solid rocket boosters and a disposable external fuel tank—carried up to eight astronauts and up to 50,000 lb (23,000 kg) of payload into low Earth orbit (LEO). When its mission was complete, the orbiter would reenter the Earth's atmosphere and land like a glider at either the Kennedy Space Center or Edwards Air Force Base.

The Shuttle is the only winged crewed spacecraft to have achieved orbit and landing, and the first reusable crewed space vehicle that made multiple flights into orbit.[a] Its missions involved carrying large payloads to various orbits including the International Space Station (ISS), providing crew rotation for the space station, and performing service missions on the Hubble Space Telescope. The orbiter also recovered satellites and other payloads (e.g., from the ISS) from orbit and returned them to Earth, though its use in this capacity was rare. Each vehicle was designed with a projected lifespan of 100 launches, or 10 years' operational life. Original selling points on the shuttles were over 150 launches over a 15-year operational span with a 'launch per month' expected at the peak of the program, but extensive delays in the development of the International Space Station never created such a peak demand for frequent flights.

Background

Various shuttle concepts had been explored since the late 1960s. The program formally commenced in 1972, becoming the sole focus of NASA's human spaceflight operations after the Apollo, Skylab, and Apollo–Soyuz programs in 1975. The Shuttle was originally conceived of and presented to the public in 1972 as a 'Space Truck' which would, among other things, be used to build a United States space station in low Earth orbit during the 1980s and then be replaced by a new vehicle by the early 1990s. The stalled plans for a U.S. space station evolved into the International Space Station and were formally initiated in 1983 by President Ronald Reagan, but the ISS suffered from long delays, design changes and cost over-runs and forced the service life of the Space Shuttle to be extended several times until 2011 when it was finally retired—serving twice as long as it was originally designed to do. In 2004, according to President George W. Bush's Vision for Space Exploration, use of the Space Shuttle was to be focused almost exclusively on completing assembly of the ISS, which was far behind schedule at that point.

The first experimental orbiter Enterprise was a high-altitude glider, launched from the back of a specially modified Boeing 747, only for initial atmospheric landing tests (ALT). Enterprise's first test flight was on February 18, 1977, only five years after the Shuttle program was formally initiated; leading to the launch of the first space-worthy shuttle Columbia on April 12, 1981, on STS-1. The Space Shuttle program finished with its last mission, STS-135 flown by Atlantis, in July 2011, retiring the final Shuttle in the fleet. The Space Shuttle program formally ended on August 31, 2011.

Conception and development

Early U.S. space shuttle concepts

Before the Apollo 11 Moon landing in 1969, NASA began studies of Space Shuttle designs as early as October 1968. The early studies were denoted "Phase A", and in June 1970, "Phase B", which were more detailed and specific. The primary intended use of the Space Shuttle was supporting the future space station, ferrying a minimum crew of four and about 20,000 pounds (9,100 kg) of cargo, and being able to be rapidly turned around for future flights.

Two designs emerged as front-runners. One was designed by engineers at the Manned Spaceflight Center, and championed especially by George Mueller. This was a two-stage system with delta-winged spacecraft, and generally complex. An attempt to re-simplify was made in the form of the DC-3, designed by Maxime Faget, who had designed the Mercury capsule among other vehicles. Numerous offerings from a variety of commercial companies were also offered but generally fell by the wayside as each NASA lab pushed for its own version.

All of this was taking place in the midst of other NASA teams proposing a wide variety of post-Apollo missions, a number of which would cost as much as Apollo or more. As each of these projects fought for funding, the NASA budget was at the same time being severely constrained. Three were eventually presented to Vice President Agnew in 1969. The shuttle project rose to the top, largely due to tireless campaigning by its supporters. By 1970 the shuttle had been selected as the one major project for the short-term post-Apollo time frame.

When funding for the program came into question, there were concerns that the project might be canceled. This led to an effort to interest the US Air Force in using the shuttle for their missions as well. The Air Force was mildly interested but demanded a much larger vehicle, far larger than the original concepts, which NASA accepted since it was also beneficial to their own plans. To lower the development costs of the resulting designs, boosters were added, a throw-away fuel tank was adopted, and many other changes were made that greatly lowered the reusability and greatly added to the vehicle and operational costs. With the Air Force's assistance, the system emerged in its operational form.

Program history

President Richard Nixon (right) with NASA Administrator James Fletcher in January 1972, three months before Congress approved funding for the Shuttle program
Shuttle approach and landing test crews, 1976

All Space Shuttle missions were launched from the Kennedy Space Center (KSC) in Florida. Some civilian and military circumpolar space shuttle missions were planned for Vandenberg AFB in California. However, the use of Vandenberg AFB for space shuttle missions was canceled after the Challenger disaster in 1986. The weather criteria used for launch included, but were not limited to: precipitation, temperatures, cloud cover, lightning forecast, wind, and humidity. The Shuttle was not launched under conditions where it could have been struck by lightning.

The first fully functional orbiter was Columbia (designated OV-102), built in Palmdale, California. It was delivered to Kennedy Space Center (KSC) on March 25, 1979, and was first launched on April 12, 1981—the 20th anniversary of Yuri Gagarin's space flight—with a crew of two.

Challenger (OV-099) was delivered to KSC in July 1982, Discovery (OV-103) in November 1983, Atlantis (OV-104) in April 1985 and Endeavour in May 1991. Challenger was originally built and used as a Structural Test Article (STA-099), but was converted to a complete orbiter when this was found to be less expensive than converting Enterprise from its Approach and Landing Test configuration into a spaceworthy vehicle.

On April 24, 1990, Discovery carried the Hubble Space Telescope into space during STS-31.

In the course of 135 missions flown, two orbiters (Columbia and Challenger) suffered catastrophic accidents, with the loss of all crew members, totaling 14 astronauts.

The accidents led to national level inquiries and detailed analysis of why the accidents occurred. There was a significant pause where changes were made before the Shuttles returned to flight. The Columbia disaster occurred in 2003, but STS took more than a year off before returning to flight in June 2005 with the STS-114 mission. The previously mentioned break was between January 1986 (when the Challenger disaster occurred) and 32 months later when STS-26 was launched on September 29, 1988.

The longest Shuttle mission was STS-80 lasting 17 days, 15 hours. The final flight of the Space Shuttle program was STS-135 on July 8, 2011.

Since the Shuttle's retirement in 2011, many of its original duties are performed by an assortment of government and private vessels. The European ATV Automated Transfer Vehicle supplied the ISS between 2008 and 2015. Classified military missions are being flown by the US Air Force's uncrewed spaceplane, the X-37B. By 2012, cargo to the International Space Station was already being delivered commercially under NASA's Commercial Resupply Services by SpaceX's partially reusable Dragon spacecraft, followed by Orbital Sciences' Cygnus spacecraft in late 2013. Crew service to the ISS is currently provided by the Russian Soyuz and, since 2020, the SpaceX Dragon 2 crew capsule, launched on the company's reusable Falcon 9 rocket as part of NASA's Commercial Crew Development program. Boeing is also developing its Starliner capsule for ISS crew service, but has been delayed since its Dec. 2019 uncrewed test flight was unsuccessful. For missions beyond low Earth orbit, NASA is building the Space Launch System and the Orion spacecraft, part of the Artemis program.

NASA Administrator address the crowd at the Spacelab arrival ceremony in February 1982. On the podium with him is then-Vice President George Bush, the director general of European Space Agency (ESA), Eric Quistgaard, and director of Kennedy Space Center Richard G. Smith
"President Ronald Reagan chats with NASA astronauts Henry Hartsfield and Ken Mattingly on the runway as first lady Nancy Reagan inspects the nose of Space Shuttle Columbia following its Independence Day landing at Edwards Air Force Base on July 4, 1982."[9]
STS-3 lands in March 1982

Accomplishments

Galileo floating free in space after release from Space Shuttle Atlantis, 1989
Space Shuttle Endeavour docked with the International Space Station (ISS), 2011

Space Shuttle missions have included:

Budget

Space Shuttle Atlantis takes flight on the STS-27 mission on December 2, 1988. The Shuttle took about 8.5 minutes to accelerate to a speed of over 27,000 km/h (17000 mph) and achieve orbit.
A drag chute is deployed by Endeavour as it completes a mission of almost 17 days in space on Runway 22 at Edwards Air Force Base in southern California. Landing occurred at 1:46 pm (EST), March 18, 1995.

Early during development of the Space Shuttle, NASA had estimated that the program would cost $7.45 billion ($43 billion in 2011 dollars, adjusting for inflation) in development/non-recurring costs, and $9.3M ($54M in 2011 dollars) per flight. Early estimates for the cost to deliver payload to low-Earth orbit were as low as $118 per pound ($260/kg) of payload ($635/lb or $1,400/kg in 2011 dollars), based on marginal or incremental launch costs, and assuming a 65,000 pound (30 000 kg) payload capacity and 50 launches per year. A more realistic projection of 12 flights per year for the 15-year service life combined with the initial development costs would have resulted in a total cost projection for the program of roughly $54 billion (in 2011 dollars).

The total cost of the actual 30-year service life of the Shuttle program through 2011, adjusted for inflation, was $196 billion. The exact breakdown into non-recurring and recurring costs is not available, but, according to NASA, the average cost to launch a Space Shuttle as of 2011 was about $450 million per mission.

NASA's budget for 2005 allocated 30%, or $5 billion, to space shuttle operations; this was decreased in 2006 to a request of $4.3 billion. Non-launch costs account for a significant part of the program budget: for example, during fiscal years 2004 to 2006, NASA spent around $13 billion on the Space Shuttle program, even though the fleet was grounded in the aftermath of the Columbia disaster and there were a total of three launches during this period of time. In fiscal year 2009, NASA budget allocated $2.98 billion for 5 launches to the program, including $490 million for "program integration", $1.03 billion for "flight and ground operations", and $1.46 billion for "flight hardware" (which includes maintenance of orbiters, engines, and the external tank between flights.)

Per-launch costs can be measured by dividing the total cost over the life of the program (including buildings, facilities, training, salaries, etc.) by the number of launches. With 135 missions, and the total cost of US$192 billion (in 2010 dollars), this gives approximately $1.5 billion per launch over the life of the Shuttle program. A 2017 study found that carrying one kilogram of cargo to the ISS on the Shuttle cost $272,000 in 2017 dollars, twice the cost of Cygnus and three times that of Dragon.

NASA used a management philosophy known as success-oriented management during the Space Shuttle program which was described by historian Alex Roland in the aftermath of the Columbia disaster as "hoping for the best". Success-oriented management has since been studied by several analysts in the area.

Accidents

In the course of 135 missions flown, two orbiters were destroyed, with loss of crew totalling 14 astronauts:

  • Challenger – lost 73 seconds after liftoff, STS-51-L, January 28, 1986
  • Columbia – lost approximately 16 minutes before its expected landing, STS-107, February 1, 2003

There was also one abort-to-orbit and some fatal accidents on the ground during launch preparations.

STS-51-L (Challenger, 1986)

In 1986, Challenger disintegrated one minute and 13 seconds after liftoff.

Close-up video footage of Challenger during its final launch on January 28, 1986, clearly shows that the problems began due to an O-ring failure on the right solid rocket booster (SRB). The hot plume of gas leaking from the failed joint caused the collapse of the external tank, which then resulted in the orbiter's disintegration due to high aerodynamic stress. The accident resulted in the loss of all seven astronauts on board. Endeavour (OV-105) was built to replace Challenger (using structural spare parts originally intended for the other orbiters) and delivered in May 1991; it was first launched a year later.

After the loss of Challenger, NASA grounded the Space Shuttle program for over two years, making numerous safety changes recommended by the Rogers Commission Report, which included a redesign of the SRB joint that failed in the Challenger accident. Other safety changes included a new escape system for use when the orbiter was in controlled flight, improved landing gear tires and brakes, and the reintroduction of pressure suits for Shuttle astronauts (these had been discontinued after STS-4; astronauts wore only coveralls and oxygen helmets from that point on until the Challenger accident). The Shuttle program continued in September 1988 with the launch of Discovery on STS-26.

The accidents did not just affect the technical design of the orbiter, but also NASA. Quoting some recommendations made by the post-Challenger Rogers commission:

Recommendation I – The faulty Solid Rocket Motor joint and seal must be changed. This could be a new design eliminating the joint or a redesign of the current joint and seal. ... the Administrator of NASA should request the National Research Council to form an independent Solid Rocket Motor design oversight committee to implement the Commission's design recommendations and oversee the design effort.
Recommendation II – The Shuttle Program Structure should be reviewed. ... NASA should encourage the transition of qualified astronauts into agency management Positions.
Recommendation III – NASA and the primary shuttle contractors should review all Criticality 1, 1R, 2, and 2R items and hazard analyses.
Recommendation IV – NASA should establish an Office of Safety, Reliability and Quality Assurance to be headed by an Associate Administrator, reporting directly to the NASA Administrator.
Recommendation VI – NASA must take actions to improve landing safety. The tire, brake and nosewheel system must be improved.
Recommendation VII – Make all efforts to provide a crew escape system for use during controlled gliding flight.
Recommendation VIII – The nation's reliance on the shuttle as its principal space launch capability created a relentless pressure on NASA to increase the flight rate ... NASA must establish a flight rate that is consistent with its resources.

STS-107 (Columbia, 2003)

Space Shuttle Discovery as it approaches the International Space Station during STS-114 on July 28, 2005. This was the Shuttle's "return to flight" mission after the Columbia disaster

The Shuttle program operated accident-free for seventeen years and 88 missions after the Challenger disaster, until Columbia broke up on reentry, killing all seven crew members, on February 1, 2003. The ultimate cause of the accident was a piece of foam separating from the external tank moments after liftoff and striking the leading edge of the orbiter's left wing, puncturing one of the reinforced carbon-carbon (RCC) panels that covered the wing edge and protected it during reentry. As Columbia reentered the atmosphere at the end of an otherwise normal mission, hot gas penetrated the wing and destroyed it from the inside out, causing the orbiter to lose control and disintegrate.

After the Columbia disaster, the International Space Station operated on a skeleton crew of two for more than two years and was serviced primarily by Russian spacecraft. While the "Return to Flight" mission STS-114 in 2005 was successful, a similar piece of foam from a different portion of the tank was shed. Although the debris did not strike Discovery, the program was grounded once again for this reason.

The second "Return to Flight" mission, STS-121 launched on July 4, 2006, at 14:37 (EDT). Two previous launches were scrubbed because of lingering thunderstorms and high winds around the launch pad, and the launch took place despite objections from its chief engineer and safety head. A five-inch (13 cm) crack in the foam insulation of the external tank gave cause for concern; however, the Mission Management Team gave the go for launch. This mission increased the ISS crew to three. Discovery touched down successfully on July 17, 2006, at 09:14 (EDT) on Runway 15 at Kennedy Space Center.

Following the success of STS-121, all subsequent missions were completed without major foam problems, and the construction of the ISS was completed (during the STS-118 mission in August 2007, the orbiter was again struck by a foam fragment on liftoff, but this damage was minimal compared to the damage sustained by Columbia).

The Columbia Accident Investigation Board, in its report, noted the reduced risk to the crew when a Shuttle flew to the International Space Station (ISS), as the station could be used as a safe haven for the crew awaiting rescue in the event that damage to the orbiter on ascent made it unsafe for reentry. The board recommended that for the remaining flights, the Shuttle always orbit with the station. Prior to STS-114, NASA Administrator Sean O'Keefe declared that all future flights of the Space Shuttle would go to the ISS, precluding the possibility of executing the final Hubble Space Telescope servicing mission which had been scheduled before the Columbia accident, despite the fact that millions of dollars worth of upgrade equipment for Hubble were ready and waiting in NASA warehouses. Many dissenters, including astronauts[who?], asked NASA management to reconsider allowing the mission, but initially the director stood firm. On October 31, 2006, NASA announced approval of the launch of Atlantis for the fifth and final shuttle servicing mission to the Hubble Space Telescope, scheduled for August 28, 2008. However SM4/STS-125 eventually launched in May 2009.

One impact of Columbia was that future crewed launch vehicles, namely the Ares I, had a special emphasis on crew safety compared to other considerations.

Retirement

The Space Shuttle retirement was announced in January 2004. President George W. Bush announced his Vision for Space Exploration, which called for the retirement of the Space Shuttle once it completed construction of the ISS. To ensure the ISS was properly assembled, the contributing partners determined the need for 16 remaining assembly missions in March 2006. One additional Hubble Space Telescope servicing mission was approved in October 2006. Originally, STS-134 was to be the final Space Shuttle mission. However, the Columbia disaster resulted in additional orbiters being prepared for launch on need in the event of a rescue mission. As Atlantis was prepared for the final launch-on-need mission, the decision was made in September 2010 that it would fly as STS-135 with a four-person crew that could remain at the ISS in the event of an emergency. STS-135 launched on July 8, 2011, and landed at the KSC on July 21, 2011, at 5:57 a.m. EDT (09:57 UTC). From then until the launch of Crew Dragon Demo-2 on May 30, 2020, the US launched its astronauts aboard Russian Soyuz spacecraft.

Following each orbiter's final flight, it was processed to make it safe for display. The OMS and RCS systems used presented the primary dangers due to their toxic hypergolic propellant, and most of their components were permanently removed to prevent any dangerous outgassing. Atlantis is on display at the Kennedy Space Center Visitor Complex, Florida, Discovery is at the Udvar-Hazy Center, Virginia, Endeavour is on display at the California Science Center in Los Angeles, and Enterprise is displayed at the Intrepid Sea-Air-Space Museum in New York. Components from the orbiters were transferred to the US Air Force, ISS program, and Russian and Canadian governments. The engines were removed to be used on the Space Launch System, and spare RS-25 nozzles were attached for display purposes.

Atlantis being greeted by a crowd after its final landing
Atlantis after its final landing, marking the end of the Space Shuttle Program

Preservation

Space Shuttle Discovery at the Udvar Hazy museum

Out of the five fully functional shuttle orbiters built, three remain. Enterprise, which was used for atmospheric test flights but not for orbital flight, had many parts taken out for use on the other orbiters. It was later visually restored and was on display at the National Air and Space Museum's Steven F. Udvar-Hazy Center until April 19, 2012. Enterprise was moved to New York City in April 2012 to be displayed at the Intrepid Sea, Air & Space Museum, whose Space Shuttle Pavilion opened on July 19, 2012. Discovery replaced Enterprise at the National Air and Space Museum's Steven F. Udvar-Hazy Center. Atlantis formed part of the Space Shuttle Exhibit at the Kennedy Space Center visitor complex and has been on display there since June 29, 2013, following its refurbishment.

On October 14, 2012, Endeavour completed an unprecedented 12 mi (19 km) drive on city streets from Los Angeles International Airport to the California Science Center, where it has been on display in a temporary hangar since late 2012. The transport from the airport took two days and required major street closures, the removal of over 400 city trees, and extensive work to raise power lines, level the street, and temporarily remove street signs, lamp posts, and other obstacles. Hundreds of volunteers, and fire and police personnel, helped with the transport. Large crowds of spectators waited on the streets to see the shuttle as it passed through the city. Endeavour, along with the last flight-qualified external tank (ET-94), is currently on display at the California Science Center's Samuel Oschin Pavilion (in a horizontal orientation) until the completion of the Samuel Oschin Air and Space Center (a planned addition to the California Science Center). Once moved, it will be permanently displayed in launch configuration, complete with genuine solid rocket boosters and external tank.

Crew modules

Spacehab module
Ten people inside Spacelab Module in the Shuttle bay in June 1995, celebrating the docking of the Space Shuttle and Mir.

One area of Space Shuttle applications is an expanded crew. Crews of up to eight have been flown in the Orbiter, but it could have held at least a crew of ten. Various proposals for filling the payload bay with additional passengers were also made as early as 1979. One proposal by Rockwell provided seating for 74 passengers in the Orbiter payload bay, with support for three days in Earth orbit. With a smaller 64 seat orbiter, costs for the late 1980s would be around US$1.5 million per seat per launch. The Rockwell passenger module had two decks, four seats across on top and two on the bottom, including a 25-inch (63.5 cm) wide aisle and extra storage space.

Another design was Space Habitation Design Associates 1983 proposal for 72 passengers in the Space Shuttle Payload bay. Passengers were located in 6 sections, each with windows and its own loading ramp at launch, and with seats in different configurations for launch and landing. Another proposal was based on the Spacelab habitation modules, which provided 32 seats in the payload bay in addition to those in the cockpit area.

There were some efforts to analyze commercial operation of STS. Using the NASA figure for average cost to launch a Space Shuttle as of 2011 at about $450 million per mission, a cost per seat for a 74 seat module envisioned by Rockwell came to less than $6 million, not including the regular crew. Some passenger modules used hardware similar to existing equipment, such as the tunnel, which was also needed for Spacehab and Spacelab.

Successors

During the three decades of operation, various follow-on and replacements for the STS Space Shuttle were partially developed but not finished.

Examples of possible future space vehicles to supplement or supplant STS:

One effort in the direction of space transportation was the Reusable Launch Vehicle (RLV) program, initiated in 1994 by NASA. This led to work on the X-33 and X-34 vehicles. NASA spent about US$1 billion on developing the X-33 hoping for it be in operation by 2005. Another program around the turn of the millennium was the Space Launch Initiative, which was a next generation launch initiative.

The Space Launch Initiative program was started in 2001, and in late 2002 it was evolved into two programs, the Orbital Space Plane Program and the Next Generation Launch Technology program. OSP was oriented towards provided access to the International Space Station.

Other vehicles that would have taken over some of the Shuttles responsibilities were the HL-20 Personnel Launch System or the NASA X-38 of the Crew Return Vehicle program, which were primarily for getting people down from ISS. The X-38 was cancelled in 2002, and the HL-20 was cancelled in 1993. Several other programs in this existed such as the Station Crew Return Alternative Module (SCRAM) and Assured Crew Return Vehicle (ACRV).

According to the 2004 Vision for Space Exploration, the next human NASA program was to be Constellation program with its Ares I and Ares V launch vehicles and the Orion spacecraft; however, the Constellation program was never fully funded, and in early 2010 the Obama administration asked Congress to instead endorse a plan with heavy reliance on the private sector for delivering cargo and crew to LEO.

The Commercial Orbital Transportation Services (COTS) program began in 2006 with the purpose of creating commercially operated uncrewed cargo vehicles to service the ISS. The first of these vehicles, SpaceX Dragon, became operational in 2012, and the second, Orbital Sciences's Cygnus did so in 2014.

The Commercial Crew Development (CCDev) program was initiated in 2010 with the purpose of creating commercially operated crewed spacecraft capable of delivering at least four crew members to the ISS, staying docked for 180 days and then returning them back to Earth. These spacecraft, like SpaceX's Dragon 2 and Boeing CST-100 Starliner were expected to become operational around 2020. On the Crew Dragon Demo-2 mission, SpaceX's Dragon 2 sent astronauts to the ISS, restoring America's human launch capability. The first operational SpaceX mission launched on November 15, 2020, at 7:27:17 p.m. ET, carrying four astronauts to the ISS.

Although the Constellation program was canceled, it has been replaced with a very similar Artemis program. The Orion spacecraft has been left virtually unchanged from its previous design. The planned Ares V rocket has been replaced with the smaller Space Launch System (SLS), which is planned to launch both Orion and other necessary hardware. Exploration Flight Test-1 (EFT-1), an uncrewed test flight of the Orion spacecraft, launched on December 5, 2014, on a Delta IV Heavy rocket.

Artemis 1 is the first flight of the SLS and was launched as a test of the completed Orion and SLS system. During the mission, an uncrewed Orion capsule spent 10 days in a 57,000-kilometer (31,000-nautical-mile) distant retrograde orbit around the Moon before returning to Earth. Artemis 2, the first crewed mission of the program, will launch four astronauts in 2024 on a free-return flyby of the Moon at a distance of 8,520 kilometers (4,600 nautical miles). After Artemis 2, the Power and Propulsion Element of the Lunar Gateway and three components of an expendable lunar lander are planned to be delivered on multiple launches from commercial launch service providers. Artemis 3 is planned to launch in 2025 aboard a SLS Block 1 rocket and will use the minimalist Gateway and expendable lander to achieve the first crewed lunar landing of the program. The flight is planned to touch down on the lunar south pole region, with two astronauts staying there for about one week.

Gallery

Assets and transition plan

Atlantis about 30 minutes after final touchdown

The Space Shuttle program occupied over 654 facilities, used over 1.2 million line items of equipment, and employed over 5,000 people. The total value of equipment was over $12 billion. Shuttle-related facilities represented over a quarter of NASA's inventory. There were over 1,200 active suppliers to the program throughout the United States. NASA's transition plan had the program operating through 2010 with a transition and retirement phase lasting through 2015. During this time, the Ares I and Orion as well as the Altair Lunar Lander were to be under development, although these programs have since been canceled.

In the 2010s, two major programs for human spaceflight are Commercial Crew Program and the Artemis program. Kennedy Space Center Launch Complex 39A is, for example, used to launch Falcon Heavy and Falcon 9.

Criticism

The partial reusability of the Space Shuttle was one of the primary design requirements during its initial development. The technical decisions that dictated the orbiter's return and re-use reduced the per-launch payload capabilities. The original intention was to compensate for this lower payload by lowering the per-launch costs and a high launch frequency. However, the actual costs of a Space Shuttle launch were higher than initially predicted, and the Space Shuttle did not fly the intended 24 missions per year as initially predicted by NASA.

The Space Shuttle was originally intended as a launch vehicle to deploy satellites, which it was primarily used for on the missions prior to the Challenger disaster. NASA's pricing, which was below cost, was lower than expendable launch vehicles; the intention was that the high volume of Space Shuttle missions would compensate for early financial losses. The improvement of expendable launch vehicles and the transition away from commercial payloads on the Space Shuttle resulted in expendable launch vehicles becoming the primary deployment option for satellites. A key customer for the Space Shuttle was the National Reconnaissance Office (NRO) responsible for spy satellites. The existence of NRO's connection was classified through 1993, and secret considerations of NRO payload requirements led to lack of transparency in the program. The proposed Shuttle-Centaur program, cancelled in the wake of the Challenger disaster, would have pushed the spacecraft beyond its operational capacity.

The fatal Challenger and Columbia disasters demonstrated the safety risks of the Space Shuttle that could result in the loss of the crew. The spaceplane design of the orbiter limited the abort options, as the abort scenarios required the controlled flight of the orbiter to a runway or to allow the crew to egress individually, rather than the abort escape options on the Apollo and Soyuz space capsules. Early safety analyses advertised by NASA engineers and management predicted the chance of a catastrophic failure resulting in the death of the crew as ranging from 1 in 100 launches to as rare as 1 in 100,000. Following the loss of two Space Shuttle missions, the risks for the initial missions were reevaluated, and the chance of a catastrophic loss of the vehicle and crew was found to be as high as 1 in 9. NASA management was criticized afterwards for accepting increased risk to the crew in exchange for higher mission rates. Both the Challenger and Columbia reports explained that NASA culture had failed to keep the crew safe by not objectively evaluating the potential risks of the missions.

Support vehicles

Many other vehicles were used in support of the Space Shuttle program, mainly terrestrial transportation vehicles.

Crawler-transporter No.2 ("Franz") in a December 2004 road test after track shoe replacement
Atlantis being prepared to be mated to the Shuttle Carrier Aircraft using the Mate-Demate Device following STS-44.
MV Freedom Star was a NASA recovery ship for the Space Shuttle Solid Rocket Boosters

RAID

From Wikipedia, the free encyclopedia

RAID (/rd/; "redundant array of inexpensive disks" or "redundant array of independent disks") is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

Data is distributed across the drives in one of several ways, referred to as RAID levels, depending on the required level of redundancy and performance. The different schemes, or data distribution layouts, are named by the word "RAID" followed by a number, for example RAID 0 or RAID 1. Each scheme, or RAID level, provides a different balance among the key goals: reliability, availability, performance, and capacity. RAID levels greater than RAID 0 provide protection against unrecoverable sector read errors, as well as against failures of whole physical drives.

History

The term "RAID" was invented by David Patterson, Garth A. Gibson, and Randy Katz at the University of California, Berkeley in 1987. In their June 1988 paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)", presented at the SIGMOD Conference, they argued that the top-performing mainframe disk drives of the time could be beaten on performance by an array of the inexpensive drives that had been developed for the growing personal computer market. Although failures would rise in proportion to the number of drives, by configuring for redundancy, the reliability of an array could far exceed that of any large single drive.

Although not yet using that terminology, the technologies of the five levels of RAID named in the June 1988 paper were used in various products prior to the paper's publication, including the following:

  • Mirroring (RAID 1) was well established in the 1970s including, for example, Tandem NonStop Systems.
  • In 1977, Norman Ken Ouchi at IBM filed a patent disclosing what was subsequently named RAID 4.
  • Around 1983, DEC began shipping subsystem mirrored RA8X disk drives (now known as RAID 1) as part of its HSC50 subsystem.
  • In 1986, Clark et al. at IBM filed a patent disclosing what was subsequently named RAID 5.
  • Around 1988, the Thinking Machines' DataVault used error correction codes (now known as RAID 2) in an array of disk drives. A similar approach was used in the early 1960s on the IBM 353.

Industry manufacturers later redefined the RAID acronym to stand for "redundant array of independent disks".

Overview

Many RAID levels employ an error protection scheme called "parity", a widely used method in information technology to provide fault tolerance in a given set of data. Most use simple XOR, but RAID 6 uses two separate parities based respectively on addition and multiplication in a particular Galois field or Reed–Solomon error correction.

RAID can also provide data security with solid-state drives (SSDs) without the expense of an all-SSD system. For example, a fast SSD can be mirrored with a mechanical drive. For this configuration to provide a significant speed advantage, an appropriate controller is needed that uses the fast SSD for all read operations. Adaptec calls this "hybrid RAID".

Standard levels

Storage servers with 24 hard disk drives each and built-in hardware RAID controllers supporting various RAID levels

Originally, there were five standard levels of RAID, but many variations have evolved, including several nested levels and many non-standard levels (mostly proprietary). RAID levels and their associated data formats are standardized by the Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard:

  • RAID 0 consists of block-level striping, but no mirroring or parity. Compared to a spanned volume, the capacity of a RAID 0 volume is the same; it is the sum of the capacities of the drives in the set. But because striping distributes the contents of each file among all drives in the set, the failure of any drive causes the entire RAID 0 volume and all files to be lost. In comparison, a spanned volume preserves the files on the unfailing drives. The benefit of RAID 0 is that the throughput of read and write operations to any file is multiplied by the number of drives because, unlike spanned volumes, reads and writes are done concurrently. The cost is increased vulnerability to drive failures—since any drive in a RAID 0 setup failing causes the entire volume to be lost, the average failure rate of the volume rises with the number of attached drives.
  • RAID 1 consists of data mirroring, without parity or striping. Data is written identically to two or more drives, thereby producing a "mirrored set" of drives. Thus, any read request can be serviced by any drive in the set. If a request is broadcast to every drive in the set, it can be serviced by the drive that accesses the data first (depending on its seek time and rotational latency), improving performance. Sustained read throughput, if the controller or software is optimized for it, approaches the sum of throughputs of every drive in the set, just as for RAID 0. Actual read throughput of most RAID 1 implementations is slower than the fastest drive. Write throughput is always slower because every drive must be updated, and the slowest drive limits the write performance. The array continues to operate as long as at least one drive is functioning.
  • RAID 2 consists of bit-level striping with dedicated Hamming-code parity. All disk spindle rotation is synchronized and data is striped such that each sequential bit is on a different drive. Hamming-code parity is calculated across corresponding bits and stored on at least one parity drive. This level is of historical significance only; although it was used on some early machines (for example, the Thinking Machines CM-2), as of 2014 it is not used by any commercially available system.
  • RAID 3 consists of byte-level striping with dedicated parity. All disk spindle rotation is synchronized and data is striped such that each sequential byte is on a different drive. Parity is calculated across corresponding bytes and stored on a dedicated parity drive. Although implementations exist, RAID 3 is not commonly used in practice.
  • RAID 4 consists of block-level striping with dedicated parity. This level was previously used by NetApp, but has now been largely replaced by a proprietary implementation of RAID 4 with two parity disks, called RAID-DP. The main advantage of RAID 4 over RAID 2 and 3 is I/O parallelism: in RAID 2 and 3, a single read I/O operation requires reading the whole group of data drives, while in RAID 4 one I/O read operation does not have to spread across all data drives. As a result, more I/O operations can be executed in parallel, improving the performance of small transfers.
  • RAID 5 consists of block-level striping with distributed parity. Unlike RAID 4, parity information is distributed among the drives, requiring all drives but one to be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 requires at least three disks. Like all single-parity concepts, large RAID 5 implementations are susceptible to system failures because of trends regarding array rebuild time and the chance of drive failure during rebuild (see "Increasing rebuild time and failure probability" section, below). Rebuilding an array requires reading all data from all disks, opening a chance for a second drive failure and the loss of the entire array.
  • RAID 6 consists of block-level striping with double distributed parity. Double parity provides fault tolerance up to two failed drives. This makes larger RAID groups more practical, especially for high-availability systems, as large-capacity drives take longer to restore. RAID 6 requires a minimum of four disks. As with RAID 5, a single drive failure results in reduced performance of the entire array until the failed drive has been replaced. With a RAID 6 array, using drives from multiple sources and manufacturers, it is possible to mitigate most of the problems associated with RAID 5. The larger the drive capacities and the larger the array size, the more important it becomes to choose RAID 6 instead of RAID 5. RAID 10 also minimizes these problems.

Nested (hybrid) RAID

In what was originally termed hybrid RAID, many storage controllers allow RAID levels to be nested. The elements of a RAID may be either individual drives or arrays themselves. Arrays are rarely nested more than one level deep.

The final array is known as the top array. When the top array is RAID 0 (such as in RAID 1+0 and RAID 5+0), most vendors omit the "+" (yielding RAID 10 and RAID 50, respectively).

  • RAID 0+1: creates two stripes and mirrors them. If a single drive failure occurs then one of the mirrors has failed, at this point it is running effectively as RAID 0 with no redundancy. Significantly higher risk is introduced during a rebuild than RAID 1+0 as all the data from all the drives in the remaining stripe has to be read rather than just from one drive, increasing the chance of an unrecoverable read error (URE) and significantly extending the rebuild window.
  • RAID 1+0: (see: RAID 10) creates a striped set from a series of mirrored drives. The array can sustain multiple drive losses so long as no mirror loses all its drives.
  • JBOD RAID N+N: With JBOD (just a bunch of disks), it is possible to concatenate disks, but also volumes such as RAID sets. With larger drive capacities, write delay and rebuilding time increase dramatically (especially, as described above, with RAID 5 and RAID 6). By splitting a larger RAID N set into smaller subsets and concatenating them with linear JBOD, write and rebuilding time will be reduced. If a hardware RAID controller is not capable of nesting linear JBOD with RAID N, then linear JBOD can be achieved with OS-level software RAID in combination with separate RAID N subset volumes created within one, or more, hardware RAID controller(s). Besides a drastic speed increase, this also provides a substantial advantage: the possibility to start a linear JBOD with a small set of disks and to be able to expand the total set with disks of different size, later on (in time, disks of bigger size become available on the market). There is another advantage in the form of disaster recovery (if a RAID N subset happens to fail, then the data on the other RAID N subsets is not lost, reducing restore time).

Non-standard levels

Many configurations other than the basic numbered RAID levels are possible, and many companies, organizations, and groups have created their own non-standard configurations, in many cases designed to meet the specialized needs of a small niche group. Such configurations include the following:

  • Linux MD RAID 10 provides a general RAID driver that in its "near" layout defaults to a standard RAID 1 with two drives, and a standard RAID 1+0 with four drives; however, it can include any number of drives, including odd numbers. With its "far" layout, MD RAID 10 can run both striped and mirrored, even with only two drives in f2 layout; this runs mirroring with striped reads, giving the read performance of RAID 0. Regular RAID 1, as provided by Linux software RAID, does not stripe reads, but can perform reads in parallel.
  • Hadoop has a RAID system that generates a parity file by xor-ing a stripe of blocks in a single HDFS file.
  • BeeGFS, the parallel file system, has internal striping (comparable to file-based RAID0) and replication (comparable to file-based RAID10) options to aggregate throughput and capacity of multiple servers and is typically based on top of an underlying RAID to make disk failures transparent.
  • Declustered RAID scatters dual (or more) copies of the data across all disks (possibly hundreds) in a storage subsystem, while holding back enough spare capacity to allow for a few disks to fail. The scattering is based on algorithms which give the appearance of arbitrariness. When one or more disks fail the missing copies are rebuilt into that spare capacity, again arbitrarily. Because the rebuild is done from and to all the remaining disks, it operates much faster than with traditional RAID, reducing the overall impact on clients of the storage system.

Implementations

The distribution of data across multiple drives can be managed either by dedicated computer hardware or by software. A software solution may be part of the operating system, part of the firmware and drivers supplied with a standard drive controller (so-called "hardware-assisted software RAID"), or it may reside entirely within the hardware RAID controller.

Hardware-based

Hardware RAID controllers can be configured through card BIOS or Option ROM before an operating system is booted, and after the operating system is booted, proprietary configuration utilities are available from the manufacturer of each controller. Unlike the network interface controllers for Ethernet, which can usually be configured and serviced entirely through the common operating system paradigms like ifconfig in Unix, without a need for any third-party tools, each manufacturer of each RAID controller usually provides their own proprietary software tooling for each operating system that they deem to support, ensuring a vendor lock-in, and contributing to reliability issues.

For example, in FreeBSD, in order to access the configuration of Adaptec RAID controllers, users are required to enable Linux compatibility layer, and use the Linux tooling from Adaptec, potentially compromising the stability, reliability and security of their setup, especially when taking the long-term view.

Some other operating systems have implemented their own generic frameworks for interfacing with any RAID controller, and provide tools for monitoring RAID volume status, as well as facilitation of drive identification through LED blinking, alarm management and hot spare disk designations from within the operating system without having to reboot into card BIOS. For example, this was the approach taken by OpenBSD in 2005 with its bio(4) pseudo-device and the bioctl utility, which provide volume status, and allow LED/alarm/hotspare control, as well as the sensors (including the drive sensor) for health monitoring; this approach has subsequently been adopted and extended by NetBSD in 2007 as well.

Software-based

Software RAID implementations are provided by many modern operating systems. Software RAID can be implemented as:

  • A layer that abstracts multiple devices, thereby providing a single virtual device (such as Linux kernel's md and OpenBSD's softraid)
  • A more generic logical volume manager (provided with most server-class operating systems such as Veritas or LVM)
  • A component of the file system (such as ZFS, Spectrum Scale or Btrfs)
  • A layer that sits above any file system and provides parity protection to user data (such as RAID-F)

Some advanced file systems are designed to organize data across multiple storage devices directly, without needing the help of a third-party logical volume manager:

  • ZFS supports the equivalents of RAID 0, RAID 1, RAID 5 (RAID-Z1) single-parity, RAID 6 (RAID-Z2) double-parity, and a triple-parity version (RAID-Z3) also referred to as RAID 7. As it always stripes over top-level vdevs, it supports equivalents of the 1+0, 5+0, and 6+0 nested RAID levels (as well as striped triple-parity sets) but not other nested combinations. ZFS is the native file system on Solaris and illumos, and is also available on FreeBSD and Linux. Open-source ZFS implementations are actively developed under the OpenZFS umbrella project.
  • Spectrum Scale, initially developed by IBM for media streaming and scalable analytics, supports declustered RAID protection schemes up to n+3. A particularity is the dynamic rebuilding priority which runs with low impact in the background until a data chunk hits n+0 redundancy, in which case this chunk is quickly rebuilt to at least n+1. On top, Spectrum Scale supports metro-distance RAID 1.
  • Btrfs supports RAID 0, RAID 1 and RAID 10 (RAID 5 and 6 are under development).
  • XFS was originally designed to provide an integrated volume manager that supports concatenating, mirroring and striping of multiple physical storage devices. However, the implementation of XFS in Linux kernel lacks the integrated volume manager.

Many operating systems provide RAID implementations, including the following:

  • Hewlett-Packard's OpenVMS operating system supports RAID 1. The mirrored disks, called a "shadow set", can be in different locations to assist in disaster recovery.
  • Apple's macOS and macOS Server support RAID 0, RAID 1, and RAID 1+0.
  • FreeBSD supports RAID 0, RAID 1, RAID 3, and RAID 5, and all nestings via GEOM modules and ccd.
  • Linux's md supports RAID 0, RAID 1, RAID 4, RAID 5, RAID 6, and all nestings. Certain reshaping/resizing/expanding operations are also supported.
  • Microsoft Windows supports RAID 0, RAID 1, and RAID 5 using various software implementations. Logical Disk Manager, introduced with Windows 2000, allows for the creation of RAID 0, RAID 1, and RAID 5 volumes by using dynamic disks, but this was limited only to professional and server editions of Windows until the release of Windows 8. Windows XP can be modified to unlock support for RAID 0, 1, and 5. Windows 8 and Windows Server 2012 introduced a RAID-like feature known as Storage Spaces, which also allows users to specify mirroring, parity, or no redundancy on a folder-by-folder basis. These options are similar to RAID 1 and RAID 5, but are implemented at a higher abstraction level.
  • NetBSD supports RAID 0, 1, 4, and 5 via its software implementation, named RAIDframe.
  • OpenBSD supports RAID 0, 1 and 5 via its software implementation, named softraid.

If a boot drive fails, the system has to be sophisticated enough to be able to boot from the remaining drive or drives. For instance, consider a computer whose disk is configured as RAID 1 (mirrored drives); if the first drive in the array fails, then a first-stage boot loader might not be sophisticated enough to attempt loading the second-stage boot loader from the second drive as a fallback. The second-stage boot loader for FreeBSD is capable of loading a kernel from such an array.

Firmware- and driver-based

A SATA 3.0 controller that provides RAID functionality through proprietary firmware and drivers

Software-implemented RAID is not always compatible with the system's boot process, and it is generally impractical for desktop versions of Windows. However, hardware RAID controllers are expensive and proprietary. To fill this gap, inexpensive "RAID controllers" were introduced that do not contain a dedicated RAID controller chip, but simply a standard drive controller chip with proprietary firmware and drivers. During early bootup, the RAID is implemented by the firmware and, once the operating system has been more completely loaded, the drivers take over control. Consequently, such controllers may not work when driver support is not available for the host operating system. An example is Intel Rapid Storage Technology, implemented on many consumer-level motherboards.

Because some minimal hardware support is involved, this implementation is also called "hardware-assisted software RAID", "hybrid model" RAID, or even "fake RAID". If RAID 5 is supported, the hardware may provide a hardware XOR accelerator. An advantage of this model over the pure software RAID is that—if using a redundancy mode—the boot drive is protected from failure (due to the firmware) during the boot process even before the operating system's drivers take over.

Integrity

Data scrubbing (referred to in some environments as patrol read) involves periodic reading and checking by the RAID controller of all the blocks in an array, including those not otherwise accessed. This detects bad blocks before use. Data scrubbing checks for bad blocks on each storage device in an array, but also uses the redundancy of the array to recover bad blocks on a single drive and to reassign the recovered data to spare blocks elsewhere on the drive.

Frequently, a RAID controller is configured to "drop" a component drive (that is, to assume a component drive has failed) if the drive has been unresponsive for eight seconds or so; this might cause the array controller to drop a good drive because that drive has not been given enough time to complete its internal error recovery procedure. Consequently, using consumer-marketed drives with RAID can be risky, and so-called "enterprise class" drives limit this error recovery time to reduce risk. Western Digital's desktop drives used to have a specific fix. A utility called WDTLER.exe limited a drive's error recovery time. The utility enabled TLER (time limited error recovery), which limits the error recovery time to seven seconds. Around September 2009, Western Digital disabled this feature in their desktop drives (such as the Caviar Black line), making such drives unsuitable for use in RAID configurations. However, Western Digital enterprise class drives are shipped from the factory with TLER enabled. Similar technologies are used by Seagate, Samsung, and Hitachi. For non-RAID usage, an enterprise class drive with a short error recovery timeout that cannot be changed is therefore less suitable than a desktop drive. In late 2010, the Smartmontools program began supporting the configuration of ATA Error Recovery Control, allowing the tool to configure many desktop class hard drives for use in RAID setups.

While RAID may protect against physical drive failure, the data is still exposed to operator, software, hardware, and virus destruction. Many studies cite operator fault as a common source of malfunction, such as a server operator replacing the incorrect drive in a faulty RAID, and disabling the system (even temporarily) in the process.

An array can be overwhelmed by catastrophic failure that exceeds its recovery capacity and the entire array is at risk of physical damage by fire, natural disaster, and human forces, however backups can be stored off site. An array is also vulnerable to controller failure because it is not always possible to migrate it to a new, different controller without data loss.

Weaknesses

Correlated failures

In practice, the drives are often the same age (with similar wear) and subject to the same environment. Since many drive failures are due to mechanical issues (which are more likely on older drives), this violates the assumptions of independent, identical rate of failure amongst drives; failures are in fact statistically correlated. In practice, the chances for a second failure before the first has been recovered (causing data loss) are higher than the chances for random failures. In a study of about 100,000 drives, the probability of two drives in the same cluster failing within one hour was four times larger than predicted by the exponential statistical distribution—which characterizes processes in which events occur continuously and independently at a constant average rate. The probability of two failures in the same 10-hour period was twice as large as predicted by an exponential distribution.

Unrecoverable read errors during rebuild

Unrecoverable read errors (URE) present as sector read failures, also known as latent sector errors (LSE). The associated media assessment measure, unrecoverable bit error (UBE) rate, is typically guaranteed to be less than one bit in 1015 for enterprise-class drives (SCSI, FC, SAS or SATA), and less than one bit in 1014 for desktop-class drives (IDE/ATA/PATA or SATA). Increasing drive capacities and large RAID 5 instances have led to the maximum error rates being insufficient to guarantee a successful recovery, due to the high likelihood of such an error occurring on one or more remaining drives during a RAID set rebuild. When rebuilding, parity-based schemes such as RAID 5 are particularly prone to the effects of UREs as they affect not only the sector where they occur, but also reconstructed blocks using that sector for parity computation.

Double-protection parity-based schemes, such as RAID 6, attempt to address this issue by providing redundancy that allows double-drive failures; as a downside, such schemes suffer from elevated write penalty—the number of times the storage medium must be accessed during a single write operation. Schemes that duplicate (mirror) data in a drive-to-drive manner, such as RAID 1 and RAID 10, have a lower risk from UREs than those using parity computation or mirroring between striped sets. Data scrubbing, as a background process, can be used to detect and recover from UREs, effectively reducing the risk of them happening during RAID rebuilds and causing double-drive failures. The recovery of UREs involves remapping of affected underlying disk sectors, utilizing the drive's sector remapping pool; in case of UREs detected during background scrubbing, data redundancy provided by a fully operational RAID set allows the missing data to be reconstructed and rewritten to a remapped sector.

Increasing rebuild time and failure probability

Drive capacity has grown at a much faster rate than transfer speed, and error rates have only fallen a little in comparison. Therefore, larger-capacity drives may take hours if not days to rebuild, during which time other drives may fail or yet undetected read errors may surface. The rebuild time is also limited if the entire array is still in operation at reduced capacity. Given an array with only one redundant drive (which applies to RAID levels 3, 4 and 5, and to "classic" two-drive RAID 1), a second drive failure would cause complete failure of the array. Even though individual drives' mean time between failure (MTBF) have increased over time, this increase has not kept pace with the increased storage capacity of the drives. The time to rebuild the array after a single drive failure, as well as the chance of a second failure during a rebuild, have increased over time.

Some commentators have declared that RAID 6 is only a "band aid" in this respect, because it only kicks the problem a little further down the road. However, according to the 2006 NetApp study of Berriman et al., the chance of failure decreases by a factor of about 3,800 (relative to RAID 5) for a proper implementation of RAID 6, even when using commodity drives. Nevertheless, if the currently observed technology trends remain unchanged, in 2019 a RAID 6 array will have the same chance of failure as its RAID 5 counterpart had in 2010.

Mirroring schemes such as RAID 10 have a bounded recovery time as they require the copy of a single failed drive, compared with parity schemes such as RAID 6, which require the copy of all blocks of the drives in an array set. Triple parity schemes, or triple mirroring, have been suggested as one approach to improve resilience to an additional drive failure during this large rebuild time.

Atomicity

A system crash or other interruption of a write operation can result in states where the parity is inconsistent with the data due to non-atomicity of the write process, such that the parity cannot be used for recovery in the case of a disk failure. This is commonly termed the write hole which is a known data corruption issue in older and low-end RAIDs, caused by interrupted destaging of writes to disk. The write hole can be addressed in a few ways:

  • Write-ahead logging.
    • Hardware RAID systems use an onboard nonvolatile cache for this purpose.
    • mdadm can use a dedicated journaling device (to avoid performance penalty, typically, SSDs and NVMs are preferred) for this purpose.
  • Write intent logging. mdadm uses a "write-intent-bitmap". If it finds any location marked as incompletely written at startup, it resyncs them. It closes the write hole but does not protect against loss of in-transit data, unlike a full WAL.
  • Partial parity. mdadm can save a "partial parity" that, when combined with modified chunks, recovers the original parity. This closes the write hole, but again does not protect against loss of in-transit data.
  • Dynamic stripe size. RAID-Z ensures that each block is its own stripe, so every block is complete. COW transactional semantics guard metadata associated with stripes. The downside is IO fragmentation.
  • Avoiding overwriting used stripes. bcachefs, which uses a copying garbage collector, chooses this option. COW again protect references to striped data.

Write hole is a little understood and rarely mentioned failure mode for redundant storage systems that do not utilize transactional features. Database researcher Jim Gray wrote "Update in Place is a Poison Apple" during the early days of relational database commercialization.

Write-cache reliability

There are concerns about write-cache reliability, specifically regarding devices equipped with a write-back cache, which is a caching system that reports the data as written as soon as it is written to cache, as opposed to when it is written to the non-volatile medium. If the system experiences a power loss or other major failure, the data may be irrevocably lost from the cache before reaching the non-volatile storage. For this reason good write-back cache implementations include mechanisms, such as redundant battery power, to preserve cache contents across system failures (including power failures) and to flush the cache at system restart time.

Solvent effects

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Sol...