In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state. This allows multiple processes to share a single central processing unit (CPU) and is an essential feature of a multiprogramming or multitasking operating system.
In a traditional CPU, each process – a program in execution – uses
the various CPU registers to store data and hold the current state of
the running process. However, in a multitasking operating system, the
operating system switches between processes or threads to allow the
execution of multiple processes simultaneously. For every switch, the operating system must save the state of the
currently running process, followed by loading the next process state,
which will run on the CPU. This sequence of operations that stores the
state of the running process and loads the following running process is
called a context switch.
The precise meaning of the phrase "context switch" varies. In a
multitasking context, it refers to the process of storing the system
state for one task, so that task can be paused and another task resumed.
A context switch can also occur as the result of an interrupt, such as when a task needs to access disk storage, freeing up CPU time for other tasks. Some operating systems also require a context switch to move between user mode and kernel mode tasks. The process of context switching can have a negative impact on system performance.
Cost
Context switches are usually computationally intensive, and much of
the design of operating systems is to optimize the use of context
switches. Switching from one process to another requires a certain
amount of time for doing the administration – saving and loading
registers and memory maps, updating various tables and lists, etc. What
is actually involved in a context switch depends on the architectures,
operating systems, and the number of resources shared (threads that
belong to the same process share many resources compared to unrelated
non-cooperating processes).
For example, in the Linux kernel, context switching involves loading the corresponding process control block
(PCB) stored in the PCB table in the kernel stack to retrieve
information about the state of the new process. CPU state information
including the registers, stack pointer, and program counter as well as memory management information like segmentation tables and page tables
(unless the old process shares the memory with the new) are loaded from
the PCB for the new process. To avoid incorrect address translation in
the case of the previous and current processes using different memory,
the translation lookaside buffer
(TLB) must be flushed. This negatively affects performance because
every memory reference to the TLB will be a miss because it is empty
after most context switches.
Furthermore, analogous context switching happens between user threads, notably green threads, and is often very lightweight, saving and restoring minimal context. In extreme cases, such as switching between goroutines in Go, a context switch is equivalent to a coroutine yield, which is only marginally more expensive than a subroutine call.
Switching cases
There are three potential triggers for a context switch:
Multitasking
Most commonly, within some scheduling
scheme, one process must be switched out of the CPU so another process
can run. This context switch can be triggered by the process making
itself unrunnable, such as by waiting for an I/O or synchronization operation to complete. On a pre-emptive multitasking
system, the scheduler may also switch out processes that are still
runnable. To prevent other processes from being starved of CPU time,
pre-emptive schedulers often configure a timer interrupt to fire when a
process exceeds its time slice. This interrupt ensures that the scheduler will gain control to perform a context switch.
Interrupt handling
Modern architectures are interrupt driven. This means that if the CPU requests data from a disk, for example, it does not need to busy-wait
until the read is over; it can issue the request (to the I/O device)
and continue with some other task. When the read is over, the CPU can be
interrupted (by a hardware in this case, which sends interrupt request to PIC) and presented with the read. For interrupts, a program called an interrupt handler is installed, and it is the interrupt handler that handles the interrupt from the disk.
When an interrupt occurs, the hardware automatically switches a
part of the context (at least enough to allow the handler to return to
the interrupted code). The handler may save additional context,
depending on details of the particular hardware and software designs.
Often only a minimal part of the context is changed in order to minimize
the amount of time spent handling the interrupt. The kernel
does not spawn or schedule a special process to handle interrupts, but
instead the handler executes in the (often partial) context established
at the beginning of interrupt handling. Once interrupt servicing is
complete, the context in effect before the interrupt occurred is
restored so that the interrupted process can resume execution in its
proper state.
User and kernel mode switching
When the system transitions between user mode and kernel mode, a context switch is not necessary; a mode transition
is not by itself a context switch. However, depending on the operating
system, a context switch may also take place at this time.
Steps
The state of the currently executing process must be saved so it can be restored when rescheduled for execution.
The process state includes all the registers that the process may be using, especially the program counter, plus any other operating system specific data that may be necessary. This is usually stored in a data structure called a process control block (PCB) or switchframe.
The PCB might be stored on a per-process stack in kernel memory (as opposed to the user-mode call stack), or there may be some specific operating system-defined data structure for this information. A handle to the PCB is added to a queue of processes that are ready to run, often called the ready queue.
Since the operating system has effectively suspended the
execution of one process, it can then switch context by choosing a
process from the ready queue and restoring its PCB. In doing so, the
program counter from the PCB is loaded, and thus execution can continue
in the chosen process. Process and thread priority can influence which
process is chosen from the ready queue (i.e., it may be a priority queue).
Examples
The details vary depending on the architecture and operating system, but these are common scenarios.
No context switch needed
Considering a general arithmetic addition operation A = B + 1. The instruction is stored in the instruction register, and the program counter is incremented. A and B are read from memory and are stored in registers R1, R2 respectively. In this case, B + 1
is calculated and written in R1 as the final answer. This operation
only requires sequential reads and writes, and there's no waits for function calls used, hence no context switch/wait takes place in this case.
Context switch caused by interrupt
Suppose a process A is running and a timer interrupt occurs. The user
registers — program counter, stack pointer, and status register — of
process A are then implicitly saved by the CPU onto the kernel stack of
A. Then, the hardware switches to kernel mode and jumps into interrupt
handler for the operating system to take over. Then the operating system
calls the switch() routine to first save the
general-purpose user registers of A onto A's kernel stack, then it saves
A's current kernel register values into the PCB of A, restores kernel
registers from the PCB of process B, and switches context, that is,
changes kernel stack pointer to point to the kernel stack of process B.
The operating system then returns from interrupt. The hardware then
loads user registers from B's kernel stack, switches to user mode, and
starts running process B from B's program counter.
Performance
Context switching itself has a cost in performance, due to running the task scheduler, TLB flushes, and indirectly due to sharing the CPU cache between multiple tasks. Switching between threads of a single process can be faster than between two separate processes because threads share the same virtual memory maps, so a TLB flush is not necessary.
The time to switch between two separate processes is called the process switching latency. The time to switch between two threads of the same process is called the thread switching latency. The time from when a hardware interrupt is generated to when the interrupt is serviced is called the interrupt latency.
Switching between two processes in a single address space operating system can be faster than switching between two processes in an operating system with private per-process address spaces.
Hardware vs. software
Context switching can be performed primarily by software or hardware. Some processors, like the Intel 80386 and its successors, have hardware support for context switches, by making use of a special data segment designated the task state segment (TSS). A task switch can be explicitly triggered with a CALL or JMP instruction targeted at a TSS descriptor in the global descriptor table. It can occur implicitly when an interrupt or exception is triggered if there is a task gate in the interrupt descriptor table (IDT). When a task switch occurs, the CPU can automatically load the new state from the TSS.
As with other tasks performed in hardware, one would expect this
to be rather fast; however, mainstream operating systems, including Windows and Linux, do not use this feature. This is mainly due to two reasons:
Hardware context switching does not save all the registers (only general-purpose registers, not floating-point registers — although the TS bit is automatically turned on in the CR0control register, resulting in a fault when executing floating-point instructions and giving the OS the opportunity to save and restore the floating-point state as needed).
Associated performance issues, e.g., software context switching can
be selective and store only those registers that need storing, whereas
hardware context switching stores nearly all registers whether they are
required or not.
Modern
desktop operating systems are capable of handling large numbers of
different processes at the same time. This screenshot shows Fedora Linux running simultaneously with the KDE Plasma 6 desktop environment, Firefox, KCalc, the built-in calendar, GNU nano, GIMP, and the VLC media player.Multitasking of Microsoft Windows 1.01 released in 1985, here shown running the MS-DOS Executive and Calculator programs
In computing, multitasking is the concurrent execution of multiple tasks (also known as processes) over a certain period of time. New tasks can interrupt already started ones before they finish,
instead of waiting for them to end. As a result, a computer executes
segments of multiple tasks in an interleaved manner, while the tasks
share common processing resources such as central processing units (CPUs) and main memory.
Multitasking automatically interrupts the running program, saving its
state (partial results, memory contents and computer register contents)
and loading the saved state of another program and transferring control
to it. This "context switch" may be initiated at fixed time intervals (pre-emptive multitasking), or the running program may be coded to signal to the supervisory software when it can be interrupted (cooperative multitasking).
Multitasking does not require parallel execution of multiple tasks at exactly the same time; instead, it allows more than one task to advance over a given period of time. Even on multiprocessor computers, multitasking allows many more tasks to be run than there are CPUs.
Multitasking is a common feature of computer operating systems
since at least the 1960s. It allows more efficient use of the computer
hardware; when a program is waiting for some external event such as a
user input or an input/output transfer with a peripheral to complete, the central processor can still be used with another program. In a time-sharing
system, multiple human operators use the same processor as if it was
dedicated to their use, while behind the scenes the computer is serving
many users by multitasking their individual programs. In multiprogramming systems, a task runs until it must wait for an external event or until the operating system's scheduler forcibly swaps the running task out of the CPU. Real-time
systems such as those designed to control industrial robots, require
timely processing; a single processor might be shared between
calculations of machine movement, communications, and user interface.
Often multitasking operating systems include measures to change
the priority of individual tasks, so that important jobs receive more
processor time than those considered less significant. Depending on the
operating system, a task might be as large as an entire application
program, or might be made up of smaller threads that carry out portions of the overall program.
A processor intended for use with multitasking operating systems
may include special hardware to securely support multiple tasks, such as
memory protection, and protection rings that ensure the supervisory software cannot be damaged or subverted by user-mode program errors.
The term "multitasking" has become an international term, as the
same word is used in many other languages such as German, Italian,
Dutch, Romanian, Czech, Danish and Norwegian.
Multiprogramming
In the early days of computing, CPU time was expensive, and peripherals
were very slow. When the computer ran a program that needed access to a
peripheral, the central processing unit (CPU) would have to stop
executing program instructions while the peripheral processed the data.
This was usually very inefficient. Multiprogramming is a computing
technique that enables multiple programs to be concurrently loaded and
executed into a computer's memory, allowing the CPU to switch between
them swiftly. This optimizes CPU utilization by keeping it engaged with
the execution of tasks, particularly useful when one program is waiting
for I/O operations to complete.
The Bull Gamma 60,
initially designed in 1957 and first released in 1960, was the first
computer designed with multiprogramming in mind. Its architecture
featured a central memory and a Program Distributor feeding up to
twenty-five autonomous processing units with code and data, and allowing
concurrent operation of multiple clusters.
Another such computer was the LEO III, first released in 1961. During batch processing,
several different programs were loaded in the computer memory, and the
first one began to run. When the first program reached an instruction
waiting for a peripheral, the context of this program was stored away,
and the second program in memory was given a chance to run. The process
continued until all programs finished running.
Multiprogramming gives no guarantee that a program will run in a
timely manner. Indeed, the first program may very well run for hours
without needing access to a peripheral. As there were no users waiting
at an interactive terminal, this was no problem: users handed in a deck
of punched cards to an operator, and came back a few hours later for
printed results. Multiprogramming greatly reduced wait times when
multiple batches were being processed.
Early multitasking systems used applications that voluntarily ceded
time to one another. This approach, which was eventually supported by
many computer operating systems,
is known today as cooperative multitasking. Although it is now rarely
used in larger systems except for specific applications such as CICS or the JES2 subsystem, cooperative multitasking was once the only scheduling scheme employed by Microsoft Windows and classic Mac OS to enable multiple applications to run simultaneously. Cooperative multitasking is still used today on RISC OS systems.
As a cooperatively multitasked system relies on each process
regularly giving up time to other processes on the system, one poorly
designed program can consume all of the CPU time for itself, either by
performing extensive calculations or by busy waiting; both would cause the whole system to hang. In a server environment, this is a hazard that makes the entire environment unacceptably fragile.
Kubuntu (KDE Plasma 5) four Virtual desktops running multiple programs at the same time
Preemptive multitasking allows the computer system to more reliably
guarantee to each process a regular "slice" of operating time. It also
allows the system to deal rapidly with important external events like
incoming data, which might require the immediate attention of one or
another process. Operating systems were developed to take advantage of
these hardware capabilities and run multiple processes preemptively.
Preemptive multitasking was implemented in the PDP-6 Monitor and Multics in 1964, in OS/360 MFT in 1967, and in Unix in 1969, and was available in some operating systems for computers as small as DEC's PDP-8; it is a core feature of all Unix-like operating systems, such as Linux, Solaris and BSD with its derivatives, as well as modern versions of Windows.
Possibly the earliest preemptive multitasking OS available to home users was Microware's OS-9, available for computers based on the Motorola 6809 such as the TRS-80 Color Computer 2, with the operating system supplied by Tandy as an upgrade for disk-equipped systems. Sinclair QDOS on the Sinclair QL followed in 1984, but it was not a big success. Commodore's Amiga
was released the following year, offering a combination of multitasking
and multimedia capabilities. Microsoft made preemptive multitasking a
core feature of their flagship operating system in the early 1990s when
developing Windows NT 3.1 and then Windows 95. In 1988 Apple offered A/UX as a UNIX System V-based alternative to the Classic Mac OS. In 2001 Apple switched to the NeXTSTEP-influenced Mac OS X.
A similar model is used in Windows 9x and the Windows NT family, where native 32-bit applications are multitasked preemptively. 64-bit editions of Windows, both for the x86-64 and Itanium
architectures, no longer support legacy 16-bit applications, and thus
provide preemptive multitasking for all supported applications.
Real time
Another reason for multitasking was in the design of real-time computing
systems, where there are a number of possibly unrelated external
activities needed to be controlled by a single processor system. In such
systems a hierarchical interrupt system is coupled with process
prioritization to ensure that key activities were given a greater share
of available process time.
Multithreading
Threads
were born from the idea that the most efficient way for cooperating
processes to exchange data would be to share their entire memory space.
Thus, threads are effectively processes that run in the same memory
context and share other resources with their parent processes, such as open files. Threads are described as lightweight processes because switching between threads does not involve changing the memory context.
While threads are scheduled preemptively, some operating systems provide a variant to threads, named fibers,
that are scheduled cooperatively. On operating systems that do not
provide fibers, an application may implement its own fibers using
repeated calls to worker functions. Fibers are even more lightweight
than threads, and somewhat easier to program with, although they tend to
lose some or all of the benefits of threads on machines with multiple processors.
Essential to any multitasking system is to safely and effectively
share access to system resources. Access to memory must be strictly
managed to ensure that no process can inadvertently or deliberately read
or write to memory locations outside the process's address space. This
is done for the purpose of general system stability and data integrity,
as well as data security.
In general, memory access management is a responsibility of the
operating system kernel, in combination with hardware mechanisms that
provide supporting functionalities, such as a memory management unit
(MMU). If a process attempts to access a memory location outside its
memory space, the MMU denies the request and signals the kernel to take
appropriate actions; this usually results in forcibly terminating the
offending process. Depending on the software and kernel design and the
specific error in question, the user may receive an access violation
error message such as "segmentation fault".
In a well designed and correctly implemented multitasking system,
a given process can never directly access memory that belongs to
another process. An exception to this rule is in the case of shared
memory; for example, in the System V
inter-process communication mechanism the kernel allocates memory to be
mutually shared by multiple processes. Such features are often used by
database management software such as PostgreSQL.
Inadequate memory protection mechanisms, either due to flaws in
their design or poor implementations, allow for security vulnerabilities
that may be potentially exploited by malicious software.
Memory swapping
Use of a swap file
or swap partition is a way for the operating system to provide more
memory than is physically available by keeping portions of the primary
memory in secondary storage.
While multitasking and memory swapping are two completely unrelated
techniques, they are very often used together, as swapping memory allows
more tasks to be loaded at the same time. Typically, a multitasking
system allows another process to run when the running process hits a
point where it has to wait for some portion of memory to be reloaded
from secondary storage.
Programming
Over the years, multitasking systems have been refined. Modern
operating systems generally include detailed mechanisms for prioritizing
processes, while symmetric multiprocessing has introduced new complexities and capabilities.
Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then
be solved at the same time. There are several different forms of
parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption (and consequently heat generation) by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.
Parallelism vs concurrency
In computer science, parallelism and concurrency are two different things: a parallel program uses multiple CPU cores,
each core performing a task independently. On the other hand,
concurrency enables a program to deal with multiple tasks even on a
single CPU core; the core switches between tasks (i.e. threads)
without necessarily completing each one. A program can have both,
neither or a combination of parallelism and concurrency characteristics.
Parallel computers can be roughly classified according to the
level at which the hardware supports parallelism, with multi-core and multi-processor computers having multiple processing elements within a single machine, while clusters, MPPs, and grids
use multiple computers to work on the same task. Specialized parallel
computer architectures are sometimes used alongside traditional
processors, for accelerating specific tasks.
In some cases parallelism is transparent to the programmer, such
as in bit-level or instruction-level parallelism, but explicitly parallel algorithms, particularly those that use concurrency, are more difficult to write than sequential ones, because concurrency introduces several new classes of potential software bugs, of which race conditions are the most common. Communication and synchronization between the different subtasks are typically some of the greatest obstacles to getting optimal parallel program performance.
A theoretical upper bound on the speed-up of a single program as a result of parallelization is given by Amdahl's law, which states that it is limited by the fraction of time for which the parallelization can be utilised.
Background
Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next one is executed.
Parallel computing, on the other hand, uses multiple processing
elements simultaneously to solve a problem. This is accomplished by
breaking the problem into independent parts so that each processing
element can execute its part of the algorithm simultaneously with the
others. The processing elements can be diverse and include resources
such as a single computer with multiple processors, several networked
computers, specialized hardware, or any combination of the above. Historically parallel computing was used for scientific computing and
the simulation of scientific problems, particularly in the natural and engineering sciences, such as meteorology. This led to the design of parallel hardware and software, and high performance computing.
Frequency scaling was the dominant reason for improvements in computer performance from the mid-1980s until 2004. The runtime
of a program is equal to the number of instructions multiplied by the
average time per instruction. Maintaining everything else constant,
increasing the clock frequency decreases the average time it takes to
execute an instruction. An increase in frequency thus decreases runtime
for all compute-bound programs. However, power consumption P by a chip is given by the equation P = C × V2 × F, where C is the capacitance being switched per clock cycle (proportional to the number of transistors whose inputs change), V is voltage, and F is the processor frequency (cycles per second). Increases in frequency increase the amount of power used in a
processor. Increasing processor power consumption led ultimately to Intel's May 8, 2004 cancellation of its Tejas and Jayhawk processors, which is generally cited as the end of frequency scaling as the dominant computer architecture paradigm.
To deal with the problem of power consumption and overheating the major central processing unit
(CPU or processor) manufacturers started to produce power efficient
processors with multiple cores. The core is the computing unit of the
processor and in multi-core processors each core is independent and can
access the same memory concurrently. Multi-core processors have brought parallel computing to desktop computers.
Thus parallelization of serial programs has become a mainstream
programming task. In 2012 quad-core processors became standard for desktop computers, while servers had 10+ core processors. Moore's law predicted that the number of cores per processor would double every 18–24 months. By 2023 some processors had over hundred cores. Some designs having a mix of performance and efficiency cores (such as ARM's big.LITTLE design) due to thermal and design constraints.
An operating system
can ensure that different tasks and user programs are run in parallel
on the available cores. However, for a serial software program to take
full advantage of the multi-core architecture the programmer needs to
restructure and parallelize the code. A speed-up of application software
runtime will no longer be achieved through frequency scaling, instead
programmers will need to parallelize their software code to take
advantage of the increasing computing power of multicore architectures.
Relevant laws
A graphical representation of Amdahl's law. The law demonstrates the theoretical maximum speedup
of an overall system and the concept of diminishing returns. If exactly
50% of the work can be parallelized, the best possible speedup is 2
times. If 95% of the work can be parallelized, the best possible speedup
is 20 times. According to the law, even with an infinite number of
processors, the speedup is constrained by the unparallelizable portion.Assume that a task has two independent parts, A and B. Part B
takes roughly 25% of the time of the whole computation. By working very
hard, one may be able to make this part 5 times faster, but this only
reduces the time for the whole computation by a little. In contrast, one
may need to perform less work to make part A twice as fast. This will make the computation much faster than by optimizing part B, even though part B's speedup is greater by ratio, (5 times versus 2 times).
Optimally, the speedup
from parallelization would be linear—doubling the number of processing
elements should halve the runtime, and doubling it a second time should
again halve the runtime. However, very few parallel algorithms achieve
optimal speedup. Most of them have a near-linear speedup for small
numbers of processing elements, which flattens out into a constant value
for large numbers of processing elements.
The maximum potential speedup of an overall system can be calculated by Amdahl's law. Amdahl's Law indicates that optimal performance improvement is achieved
by balancing enhancements to both parallelizable and non-parallelizable
components of a task. Furthermore, it reveals that increasing the
number of processors yields diminishing returns, with negligible speedup
gains beyond a certain point.
Amdahl's Law has limitations, including assumptions of fixed workload, neglecting inter-process communication and synchronization
overheads, primarily focusing on computational aspect and ignoring
extrinsic factors such as data persistence, I/O operations, and memory
access overheads.
Understanding data dependencies is fundamental in implementing parallel algorithms. No program can run more quickly than the longest chain of dependent calculations (known as the critical path),
since calculations that depend upon prior calculations in the chain
must be executed in order. However, most algorithms do not consist of
just a long chain of dependent calculations; there are usually
opportunities to execute independent calculations in parallel.
Let Pi and Pj be two program segments. Bernstein's conditions describe when the two are independent and can be executed in parallel. For Pi, let Ii be all of the input variables and Oi the output variables, and likewise for Pj. Pi and Pj are independent if they satisfy
Violation of the first condition introduces a flow dependency,
corresponding to the first segment producing a result used by the second
segment. The second condition represents an anti-dependency, when the
second segment produces a variable needed by the first segment. The
third and final condition represents an output dependency: when two
segments write to the same location, the result comes from the logically
last executed segment.
Consider the following functions, which demonstrate several kinds of dependencies:
1: function Dep(a, b)
2: c := a * b
3: d := 3 * c
4: end function
In this example, instruction 3 cannot be executed before (or even in
parallel with) instruction 2, because instruction 3 uses a result from
instruction 2. It violates condition 1, and thus introduces a flow
dependency.
1: function NoDep(a, b)
2: c := a * b
3: d := 3 * b
4: e := a + b
5: end function
In this example, there are no dependencies between the instructions, so they can all be run in parallel.
Bernstein's conditions do not allow memory to be shared between
different processes. For that, some means of enforcing an ordering
between accesses is necessary, such as semaphores, barriers or some other synchronization method.
Race conditions, mutual exclusion, synchronization, and parallel slowdown
Subtasks in a parallel program are often called threads. Some parallel computer architectures use smaller, lightweight versions of threads known as fibers, while others use bigger versions known as processes. However, "threads" is generally accepted as a generic term for subtasks. Threads will often need synchronized access to an object or other resource, for example when they must update a variable
that is shared between them. Without synchronization, the instructions
between the two threads may be interleaved in any order. For example,
consider the following program:
Thread A
Thread B
1A: Read variable V
1B: Read variable V
2A: Add 1 to variable V
2B: Add 1 to variable V
3A: Write back to variable V
3B: Write back to variable V
If instruction 1B is executed between 1A and 3A, or if instruction 1A
is executed between 1B and 3B, the program will produce incorrect data.
This is known as a race condition. The programmer must use a lock to provide mutual exclusion.
A lock is a programming language construct that allows one thread to
take control of a variable and prevent other threads from reading or
writing it, until that variable is unlocked. The thread holding the lock
is free to execute its critical section
(the section of a program that requires exclusive access to some
variable), and to unlock the data when it is finished. Therefore, to
guarantee correct program execution, the above program can be rewritten
to use locks:
Thread A
Thread B
1A: Lock variable V
1B: Lock variable V
2A: Read variable V
2B: Read variable V
3A: Add 1 to variable V
3B: Add 1 to variable V
4A: Write back to variable V
4B: Write back to variable V
5A: Unlock variable V
5B: Unlock variable V
One thread will successfully lock variable V, while the other thread will be locked out—unable
to proceed until V is unlocked again. This guarantees correct execution
of the program. Locks may be necessary to ensure correct program
execution when threads must serialize access to resources, but their use
can greatly slow a program and may affect its reliability.
Locking multiple variables using non-atomic locks introduces the possibility of program deadlock. An atomic lock
locks multiple variables all at once. If it cannot lock all of them, it
does not lock any of them. If two threads each need to lock the same
two variables using non-atomic locks, it is possible that one thread
will lock one of them and the second thread will lock the second
variable. In such a case, neither thread can complete, and deadlock
results.
Many parallel programs require that their subtasks act in synchrony. This requires the use of a barrier. Barriers are typically implemented using a lock or a semaphore. One class of algorithms, known as lock-free and wait-free algorithms,
altogether avoids the use of locks and barriers. However, this approach
is generally difficult to implement and requires correctly designed
data structures.
Not all parallelization results in speed-up. Generally, as a task
is split up into more and more threads, those threads spend an
ever-increasing portion of their time communicating with each other or
waiting on each other for access to resources. Once the overhead from resource contention or communication dominates
the time spent on other computation, further parallelization (that is,
splitting the workload over even more threads) increases rather than
decreases the amount of time required to finish. This problem, known as parallel slowdown, can be improved in some cases by software analysis and redesign.
Fine-grained, coarse-grained, and embarrassing parallelism
Applications are often classified according to how often their
subtasks need to synchronize or communicate with each other. An
application exhibits fine-grained parallelism if its subtasks must
communicate many times per second; it exhibits coarse-grained
parallelism if they do not communicate many times per second, and it
exhibits embarrassing parallelism if they rarely or never have to communicate. Embarrassingly parallel applications are considered the easiest to parallelize.
Michael J. Flynn created one of the earliest classification systems for parallel (and sequential) computers and programs, now known as Flynn's taxonomy.
Flynn classified programs and computers by whether they were operating
using a single set or multiple sets of instructions, and whether or not
those instructions were using a single set or multiple sets of data.
The single-instruction-single-data (SISD) classification is
equivalent to an entirely sequential program. The
single-instruction-multiple-data (SIMD) classification is analogous to
doing the same operation repeatedly over a large data set. This is
commonly done in signal processing
applications. Multiple-instruction-single-data (MISD) is a rarely used
classification. While computer architectures to deal with this were
devised (such as systolic arrays),
few applications that fit this class materialized.
Multiple-instruction-multiple-data (MIMD) programs are by far the most
common type of parallel programs.
According to David A. Patterson and John L. Hennessy,
"Some machines are hybrids of these categories, of course, but this
classic model has survived because it is simple, easy to understand, and
gives a good first approximation. It is also—perhaps because of its
understandability—the most widely used scheme."
Disadvantages
Parallel computing can incur significant overhead in practice,
primarily due to the costs associated with merging data from multiple
processes. Specifically, inter-process communication and synchronization
can lead to overheads that are substantially higher—often by two or
more orders of magnitude—compared to processing the same data on a
single thread.Therefore, the overall improvement should be carefully evaluated.
Taiwania 3 of Taiwan, a parallel supercomputing device that joined COVID-19 research
From the advent of very-large-scale integration
(VLSI) computer-chip fabrication technology in the 1970s until about
1986, speed-up in computer architecture was driven by doubling computer word size—the amount of information the processor can manipulate per cycle. Increasing the word size reduces the number of instructions the
processor must execute to perform an operation on variables whose sizes
are greater than the length of the word. For example, where an 8-bit processor must add two 16-bitintegers,
the processor must first add the 8 lower-order bits from each integer
using the standard addition instruction, then add the 8 higher-order
bits using an add-with-carry instruction and the carry bit
from the lower order addition; thus, an 8-bit processor requires two
instructions to complete a single operation, where a 16-bit processor
would be able to complete the operation with a single instruction.
Historically, 4-bit
microprocessors were replaced with 8-bit, then 16-bit, then 32-bit
microprocessors. This trend generally came to an end with the
introduction of 32-bit processors, which has been a standard in
general-purpose computing for two decades. Not until the early 2000s,
with the advent of x86-64 architectures, did 64-bit processors become commonplace.
A canonical processor without pipeline. It takes five clock cycles to complete one instruction and thus the processor can issue subscalar performance (IPC = 0.2 < 1).
A computer program is, in essence, a stream of instructions executed
by a processor. Without instruction-level parallelism, a processor can
only issue less than one instruction per clock cycle (IPC < 1). These processors are known as subscalar processors. These instructions can be re-ordered
and combined into groups which are then executed in parallel without
changing the result of the program. This is known as instruction-level
parallelism. Advances in instruction-level parallelism dominated
computer architecture from the mid-1980s until the mid-1990s.
A canonical five-stage pipelined
processor. In the best case scenario, it takes one clock cycle to
complete one instruction and thus the processor can issue scalar
performance (IPC = 1).
All modern processors have multi-stage instruction pipelines.
Each stage in the pipeline corresponds to a different action the
processor performs on that instruction in that stage; a processor with
an N-stage pipeline can have up to N different instructions at different stages of completion and thus can issue one instruction per clock cycle (IPC = 1). These processors are known as scalar processors. The canonical example of a pipelined processor is a RISC
processor, with five stages: instruction fetch (IF), instruction decode
(ID), execute (EX), memory access (MEM), and register write back (WB).
The Pentium 4 processor had a 35-stage pipeline.
A canonical five-stage pipelined
processor with two execution units. In the best case scenario, it takes
one clock cycle to complete two instructions and thus the processor can
issue superscalar performance (IPC = 2 > 1).
Most modern processors also have multiple execution units. They usually combine this feature with pipelining and thus can issue more than one instruction per clock cycle (IPC > 1). These processors are known as superscalar processors. Superscalar processors differ from multi-core processors
in that the several execution units are not entire processors (i.e.
processing units). Instructions can be grouped together only if there is
no data dependency between them. Scoreboarding and the Tomasulo algorithm (which is similar to scoreboarding but makes use of register renaming) are two of the most common techniques for implementing out-of-order execution and instruction-level parallelism.
Task parallelisms is the characteristic of a parallel program that
"entirely different calculations can be performed on either the same or
different sets of data". This contrasts with data parallelism, where the same calculation is
performed on the same or different sets of data. Task parallelism
involves the decomposition of a task into sub-tasks and then allocating
each sub-task to a processor for execution. The processors would then
execute these sub-tasks concurrently and often cooperatively. Task
parallelism does not usually scale with the size of a problem.
Superword level parallelism
Superword level parallelism is a vectorization technique based on loop unrolling and basic block vectorization. It is distinct from loop vectorization algorithms in that it can exploit parallelism of inline code, such as manipulating coordinates, color channels or in loops unrolled by hand.
Hardware
Memory and communication
Main memory in a parallel computer is either shared memory (shared between all processing elements in a single address space), or distributed memory (in which each processing element has its own local address space). Distributed memory refers to the fact that the memory is logically
distributed, but often implies that it is physically distributed as
well. Distributed shared memory and memory virtualization
combine the two approaches, where the processing element has its own
local memory and access to the memory on non-local processors. Accesses
to local memory are typically faster than accesses to non-local memory.
On the supercomputers, distributed shared memory space can be implemented using the programming model such as PGAS.
This model allows processes on one compute node to transparently
access the remote memory of another compute node. All compute nodes are
also connected to an external shared memory system via high-speed
interconnect, such as Infiniband, this external shared memory system is known as burst buffer, which is typically built from arrays of non-volatile memory physically distributed across multiple I/O nodes.
A logical view of a non-uniform memory access
(NUMA) architecture. Processors in one directory can access that
directory's memory with less latency than they can access memory in the
other directory's memory.
Computer architectures in which each element of main memory can be accessed with equal latency and bandwidth are known as uniform memory access (UMA) systems. Typically, that can be achieved only by a shared memory system, in which the memory is not physically distributed. A system that does not have this property is known as a non-uniform memory access (NUMA) architecture. Distributed memory systems have non-uniform memory access.
Computer systems make use of caches—small
and fast memories located close to the processor which store temporary
copies of memory values (nearby in both the physical and logical sense).
Parallel computer systems have difficulties with caches that may store
the same value in more than one location, with the possibility of
incorrect program execution. These computers require a cache coherency system, which keeps track of cached values and strategically purges them, thus ensuring correct program execution. Bus snooping
is one of the most common methods for keeping track of which values are
being accessed (and thus should be purged). Designing large,
high-performance cache coherence systems is a very difficult problem in
computer architecture. As a result, shared memory computer architectures
do not scale and distributed memory systems do.
Processor–processor and processor–memory communication can be
implemented in hardware in several ways, including via shared (either
multiported or multiplexed) memory, a crossbar switch, a shared bus or an interconnect network of a myriad of topologies including star, ring, tree, hypercube, fat hypercube (a hypercube with more than one processor at a node), or n-dimensional mesh.
Parallel computers based on interconnected networks need to have some kind of routing
to enable the passing of messages between nodes that are not directly
connected. The medium used for communication between the processors is
likely to be hierarchical in large multiprocessor machines.
Classes of parallel computers
Parallel computers can be roughly classified according to the level
at which the hardware supports parallelism. This classification is
broadly analogous to the distance between basic computing nodes. These
are not mutually exclusive; for example, clusters of symmetric
multiprocessors are relatively common.
A multi-core processor is a processor that includes multiple processing units (called "cores") on the same chip. This processor differs from a superscalar processor, which includes multiple execution units
and can issue multiple instructions per clock cycle from one
instruction stream (thread); in contrast, a multi-core processor can
issue multiple instructions per clock cycle from multiple instruction
streams. IBM's Cell microprocessor, designed for use in the SonyPlayStation 3,
is a prominent multi-core processor. Each core in a multi-core
processor can potentially be superscalar as well—that is, on every clock
cycle, each core can issue multiple instructions from one thread.
Simultaneous multithreading (of which Intel's Hyper-Threading
is the best known) was an early form of pseudo-multi-coreism. A
processor capable of concurrent multithreading includes multiple
execution units in the same processing unit—that is it has a superscalar
architecture—and can issue multiple instructions per clock cycle from multiple threads. Temporal multithreading on the other hand includes a single execution unit in the same processing unit and can issue one instruction at a time from multiple threads.
A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect via a bus. Bus contention prevents bus architectures from scaling. As a result, SMPs generally do not comprise more than 32 processors. Because of the small size of the processors and the significant
reduction in the requirements for bus bandwidth achieved by large
caches, such symmetric multiprocessors are extremely cost-effective,
provided that a sufficient amount of memory bandwidth exists.
A distributed computer (also known as a distributed memory
multiprocessor) is a distributed memory computer system in which the
processing elements are connected by a network. Distributed computers
are highly scalable. The terms "concurrent computing", "parallel computing", and "distributed computing" have a lot of overlap, and no clear distinction exists between them. The same system may be characterized both as "parallel" and
"distributed"; the processors in a typical distributed system run
concurrently in parallel.
A cluster is a group of loosely coupled computers that work together
closely, so that in some respects they can be regarded as a single
computer. Clusters are composed of multiple standalone machines connected by a
network. While machines in a cluster do not have to be symmetric, load balancing is more difficult if they are not. The most common type of cluster is the Beowulf cluster, which is a cluster implemented on multiple identical commercial off-the-shelf computers connected with a TCP/IPEthernetlocal area network. Beowulf technology was originally developed by Thomas Sterling and Donald Becker. 87% of all Top500 supercomputers are clusters. The remaining are Massively Parallel Processors, explained below.
Because grid computing systems (described below) can easily
handle embarrassingly parallel problems, modern clusters are typically
designed to handle more difficult problems—problems that require nodes
to share intermediate results with each other more often. This requires a
high bandwidth and, more importantly, a low-latency
interconnection network. Many historic and current supercomputers use
customized high-performance network hardware specifically designed for
cluster computing, such as the Cray Gemini network. As of 2014, most current supercomputers use some off-the-shelf standard network hardware, often Myrinet, InfiniBand, or Gigabit Ethernet.
A massively parallel processor (MPP) is a single computer with many
networked processors. MPPs have many of the same characteristics as
clusters, but MPPs have specialized interconnect networks (whereas
clusters use commodity hardware for networking). MPPs also tend to be
larger than clusters, typically having "far more" than 100 processors. In an MPP, "each CPU contains its own memory and copy of the operating
system and application. Each subsystem communicates with the others via a
high-speed interconnect."
Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the Internet
to work on a given problem. Because of the low bandwidth and extremely
high latency available on the Internet, distributed computing typically
deals only with embarrassingly parallel problems.
Most grid computing applications use middleware
(software that sits between the operating system and the application to
manage network resources and standardize the software interface). The
most common grid computing middleware is the Berkeley Open Infrastructure for Network Computing (BOINC). Often volunteer computing software makes use of "spare cycles", performing computations at times when a computer is idling.
The ubiquity of the Internet and high-bandwidth networks enabled cloud computing,
a model where massively parallel resources are provided as a service.
This paradigm abstracts the underlying hardware, allowing users to
access virtualized clusters for scalable workloads without managing
physical infrastructure.
Modern distributed ledger protocols apply parallel computing principles to overcome the sequential bottlenecks of traditional blockchains. By sharding
the state space, newer consensus architectures allow for "massively
parallel transaction processing". In this model, utilized by protocols
such as Cerberus, independent transactions are treated as parallel tasks
that can be executed simultaneously on different nodes, rather than
being processed serially in a single global block.
Specialized parallel computers
Within parallel computing, there are specialized parallel devices that remain niche areas of interest. While not domain-specific, they tend to be applicable to only a few classes of parallel problems.
Reconfigurable computing with field-programmable gate arrays
AMD's decision to open its HyperTransport technology to third-party vendors has become the enabling technology for high-performance reconfigurable computing. According to Michael R. D'Amour, Chief Operating Officer of DRC
Computer Corporation, "when we first walked into AMD, they called us
'the socket stealers.' Now they call us their partners."
General-purpose computing on graphics processing units (GPGPU)
General-purpose computing on graphics processing units (GPGPU) is a fairly recent trend in computer engineering research. GPUs are co-processors that have been heavily optimized for computer graphics processing. Computer graphics processing is a field dominated by data parallel operations—particularly linear algebramatrix operations.
In the early days, GPGPU programs used the normal graphics APIs
for executing programs. However, several new programming languages and
platforms have been built to do general purpose computation on GPUs with
both Nvidia and AMD releasing programming environments with CUDA and Stream SDK respectively. Other GPU programming languages include BrookGPU, PeakStream, and RapidMind. Nvidia has also released specific products for computation in their Tesla series. The technology consortium Khronos Group has released the OpenCL specification, which is a framework for writing programs that execute across platforms consisting of CPUs and GPUs. AMD, Apple, Intel, Nvidia and others are supporting OpenCL.
Because an ASIC is (by definition) specific to a given
application, it can be fully optimized for that application. As a
result, for a given application, an ASIC tends to outperform a
general-purpose computer. However, ASICs are created by UV photolithography. This process requires a mask set, which can be extremely expensive. A mask set can cost over a million US dollars. (The smaller the transistors required for the chip, the more expensive
the mask will be.) Meanwhile, performance increases in general-purpose
computing over time (as described by Moore's law) tend to wipe out these gains in only one or two chip generations. High initial cost, and the tendency to be overtaken by
Moore's-law-driven general-purpose computing, has rendered ASICs
unfeasible for most parallel computing applications. However, some have
been built. One example is the PFLOPS RIKEN MDGRAPE-3 machine which uses custom ASICs for molecular dynamics simulation.
A vector processor is a CPU or computer system that can execute the
same instruction on large sets of data. Vector processors have
high-level operations that work on linear arrays of numbers or vectors.
An example vector operation is A = B × C, where A, B, and C are each 64-element vectors of 64-bit floating-point numbers. They are closely related to Flynn's SIMD classification.
Cray
computers became famous for their vector-processing computers in the
1970s and 1980s. However, vector processors—both as CPUs and as full
computer systems—have generally disappeared. Modern processor instruction sets do include some vector processing instructions, such as with Freescale Semiconductor's AltiVec and Intel's Streaming SIMD Extensions (SSE).
Concurrent programming languages, libraries, APIs, and parallel programming models (such as algorithmic skeletons)
have been created for programming parallel computers. These can
generally be divided into classes based on the assumptions they make
about the underlying memory architecture—shared memory, distributed
memory, or shared distributed memory. Shared memory programming
languages communicate by manipulating shared memory variables.
Distributed memory uses message passing. POSIX Threads and OpenMP are two of the most widely used shared memory APIs, whereas Message Passing Interface (MPI) is the most widely used message-passing system API. One concept used in programming parallel programs is the future concept, where one part of a program promises to deliver a required datum to another part of a program at some future time.
Efforts to standardize parallel programming include an open standard called OpenHMPP
for hybrid multi-core parallel programming. The OpenHMPP
directive-based programming model offers a syntax to efficiently offload
computations on hardware accelerators and to optimize data movement
to/from the hardware memory using remote procedure calls.
The rise of consumer GPUs has led to support for compute kernels, either in graphics APIs (referred to as compute shaders), in dedicated APIs (such as OpenCL), or in other language extensions.
Automatic parallelization of a sequential program by a compiler
is the "holy grail" of parallel computing, especially with the
aforementioned limit of processor frequency. Despite decades of work by
compiler researchers, automatic parallelization has had only limited
success.
As a computer system grows in complexity, the mean time between failures usually decreases. Application checkpointing
is a technique whereby the computer system takes a "snapshot" of the
application—a record of all current resource allocations and variable
states, akin to a core dump—;
this information can be used to restore the program if the computer
should fail. Application checkpointing means that the program has to
restart from only its last checkpoint rather than the beginning. While
checkpointing provides benefits in a variety of situations, it is
especially useful in highly parallel systems with a large number of
processors used in high performance computing.
Algorithmic methods
As parallel computers become larger and faster, we are now able to
solve problems that had previously taken too long to run. Fields as
varied as bioinformatics (for protein folding and sequence analysis) and economics have taken advantage of parallel computing. Common types of problems in parallel computing applications include:
Parallel computing can also be applied to the design of fault-tolerant computer systems, particularly via lockstep systems performing the same operation in parallel. This provides redundancy in case one component fails, and also allows automatic error detection and error correction if the results differ. These methods can be used to help prevent single-event upsets caused by transient errors. Although additional measures may be required in embedded or specialized
systems, this method can provide a cost-effective approach to achieve
n-modular redundancy in commercial off-the-shelf systems.
In 1957, Compagnie des Machines Bull announced the first computer architecture specifically designed for parallelism, the Gamma 60. It utilized a fork-join model
and a "Program Distributor" to dispatch and collect data to and from
independent processing units connected to a central memory.
In April 1958, Stanley Gill (Ferranti) discussed parallel programming and the need for branching and waiting. Also in 1958, IBM researchers John Cocke and Daniel Slotnick discussed the use of parallelism in numerical calculations for the first time. Burroughs Corporation introduced the D825 in 1962, a four-processor computer that accessed up to 16 memory modules through a crossbar switch. In 1967, Amdahl and Slotnick published a debate about the feasibility
of parallel processing at American Federation of Information Processing
Societies Conference. It was during this debate that Amdahl's law was coined to define the limit of speed-up due to parallelism.
In 1969, Honeywell introduced its first Multics system, a symmetric multiprocessor system capable of running up to eight processors in parallel. C.mmp, a multi-processor project at Carnegie Mellon University
in the 1970s, was among the first multiprocessors with more than a few
processors. The first bus-connected multiprocessor with snooping caches
was the Synapse N+1 in 1984.
SIMD parallel computers can be traced back to the 1970s. The motivation behind early SIMD computers was to amortize the gate delay of the processor's control unit over multiple instructions. In 1964, Slotnick had proposed building a massively parallel computer for the Lawrence Livermore National Laboratory. His design was funded by the US Air Force, which was the earliest SIMD parallel-computing effort, ILLIAC IV. The key to its design was a fairly high parallelism, with up to
256 processors, which allowed the machine to work on large datasets in
what would later be known as vector processing.
However, ILLIAC IV was called "the most infamous of supercomputers",
because the project was only one-fourth completed, but took 11 years and
cost almost four times the original estimate. When it was finally ready to run its first real application in 1976, it
was outperformed by existing commercial supercomputers such as the Cray-1.
Biological brain as massively parallel computer
In the early 1970s, at the MIT Computer Science and Artificial Intelligence Laboratory, Marvin Minsky and Seymour Papert started developing the Society of Mind theory, which views the biological brain as massively parallel computer. In 1986, Minsky published The Society of Mind, which claims that "mind is formed from many little agents, each mindless by itself". The theory attempts to explain how what we call intelligence could be a
product of the interaction of non-intelligent parts. Minsky says that
the biggest source of ideas about the theory came from his work in
trying to create a machine that uses a robotic arm, a video camera, and a
computer to build with children's blocks.
Similar models (which also view the biological brain as a
massively parallel computer, i.e., the brain is made up of a
constellation of independent or semi-independent agents) were also
described by: