A Medley of Potpourri

Monday, November 7, 2022

Supercomputer architecture

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Supercomputer_architecture

A SGI Altix supercomputer with 23,000 processors at the CINES facility in France

Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to achieve superior computational peak performance. However, in time the demand for increased computational power ushered in the age of massively parallel systems.

While the supercomputers of the 1970s used only a few processors, in the 1990s, machines with thousands of processors began to appear and by the end of the 20th century, massively parallel supercomputers with tens of thousands of "off-the-shelf" processors were the norm. Supercomputers of the 21st century can use over 100,000 processors (some being graphic units) connected by fast connections.

Throughout the decades, the management of heat density has remained a key issue for most centralized supercomputers. The large amount of heat generated by a system may also have other effects, such as reducing the lifetime of other system components. There have been diverse approaches to heat management, from pumping Fluorinert through the system, to a hybrid liquid-air cooling system or air cooling with normal air conditioning temperatures.

Systems with a massive number of processors generally take one of two paths: in one approach, e.g., in grid computing the processing power of a large number of computers in distributed, diverse administrative domains, is opportunistically used whenever a computer is available. In another approach, a large number of processors are used in close proximity to each other, e.g., in a computer cluster. In such a centralized massively parallel system the speed and flexibility of the interconnect becomes very important, and modern supercomputers have used various approaches ranging from enhanced Infiniband systems to three-dimensional torus interconnects.

Context and overview

Since the late 1960s the growth in the power and proliferation of supercomputers has been dramatic, and the underlying architectural directions of these systems have taken significant turns. While the early supercomputers relied on a small number of closely connected processors that accessed shared memory, the supercomputers of the 21st century use over 100,000 processors connected by fast networks.

Throughout the decades, the management of heat density has remained a key issue for most centralized supercomputers. Seymour Cray's "get the heat out" motto was central to his design philosophy and has continued to be a key issue in supercomputer architectures, e.g., in large-scale experiments such as Blue Waters. The large amount of heat generated by a system may also have other effects, such as reducing the lifetime of other system components.

An IBM HS22 blade

There have been diverse approaches to heat management, e.g., the Cray 2 pumped Fluorinert through the system, while System X used a hybrid liquid-air cooling system and the Blue Gene/P is air-cooled with normal air conditioning temperatures. The heat from the Aquasar supercomputer is used to warm a university campus.

The heat density generated by a supercomputer has a direct dependence on the processor type used in the system, with more powerful processors typically generating more heat, given similar underlying semiconductor technologies. While early supercomputers used a few fast, closely packed processors that took advantage of local parallelism (e.g., pipelining and vector processing), in time the number of processors grew, and computing nodes could be placed further away,e.g., in a computer cluster, or could be geographically dispersed in grid computing.As the number of processors in a supercomputer grows, "component failure rate" begins to become a serious issue. If a supercomputer uses thousands of nodes, each of which may fail once per year on the average, then the system will experience several node failures each day.

As the price/performance of general purpose graphic processors (GPGPUs) has improved, a number of petaflop supercomputers such as Tianhe-I and Nebulae have started to rely on them. However, other systems such as the K computer continue to use conventional processors such as SPARC-based designs and the overall applicability of GPGPUs in general purpose high performance computing applications has been the subject of debate, in that while a GPGPU may be tuned to score well on specific benchmarks its overall applicability to everyday algorithms may be limited unless significant effort is spent to tune the application towards it. However, GPUs are gaining ground and in 2012 the Jaguar supercomputer was transformed into Titan by replacing CPUs with GPUs.

As the number of independent processors in a supercomputer increases, the way they access data in the file system and how they share and access secondary storage resources becomes prominent. Over the years a number of systems for distributed file management were developed, e.g., the IBM General Parallel File System, BeeGFS, the Parallel Virtual File System, Hadoop, etc. A number of supercomputers on the TOP100 list such as the Tianhe-I use Linux's Lustre file system.

Early systems with a few processors

The CDC 6600 series of computers were very early attempts at supercomputing and gained their advantage over the existing systems by relegating work to peripheral devices, freeing the CPU (Central Processing Unit) to process actual data. With the Minnesota FORTRAN compiler the 6600 could sustain 500 kiloflops on standard mathematical operations.

The cylindrical shape of the early Cray computers centralized access, keeping distances short and uniform.

Other early supercomputers such as the Cray 1 and Cray 2 that appeared afterwards used a small number of fast processors that worked in harmony and were uniformly connected to the largest amount of shared memory that could be managed at the time.

These early architectures introduced parallel processing at the processor level, with innovations such as vector processing, in which the processor can perform several operations during one clock cycle, rather than having to wait for successive cycles.

In time, as the number of processors increased, different architectural issues emerged. Two issues that need to be addressed as the number of processors increases are the distribution of memory and processing. In the distributed memory approach, each processor is physically packaged close with some local memory. The memory associated with other processors is then "further away" based on bandwidth and latency parameters in non-uniform memory access.

In the 1960s pipelining was viewed as an innovation, and by the 1970s the use of vector processors had been well established. By the 1980s, many supercomputers used parallel vector processors.

The relatively small number of processors in early systems, allowed them to easily use a shared memory architecture, which allows processors to access a common pool of memory. In the early days a common approach was the use of uniform memory access (UMA), in which access time to a memory location was similar between processors. The use of non-uniform memory access (NUMA) allowed a processor to access its own local memory faster than other memory locations, while cache-only memory architectures (COMA) allowed for the local memory of each processor to be used as cache, thus requiring coordination as memory values changed.

As the number of processors increases, efficient interprocessor communication and synchronization on a supercomputer becomes a challenge. A number of approaches may be used to achieve this goal. For instance, in the early 1980s, in the Cray X-MP system, shared registers were used. In this approach, all processors had access to shared registers that did not move data back and forth but were only used for interprocessor communication and synchronization. However, inherent challenges in managing a large amount of shared memory among many processors resulted in a move to more distributed architectures.

Massive centralized parallelism

A Blue Gene/L cabinet showing the stacked blades, each holding many processors

During the 1980s, as the demand for computing power increased, the trend to a much larger number of processors began, ushering in the age of massively parallel systems, with distributed memory and distributed file systems, given that shared memory architectures could not scale to a large number of processors. Hybrid approaches such as distributed shared memory also appeared after the early systems.

The computer clustering approach connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast, private local area network. The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit, e.g. via a single system image concept.

Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is distinct from other approaches such as peer-to-peer or grid computing which also use many nodes, but with a far more distributed nature. By the 21st century, the TOP500 organization's semiannual list of the 500 fastest supercomputers often includes many clusters, e.g. the world's fastest in 2011, the K computer with a distributed memory, cluster architecture.

When a large number of local semi-independent computing nodes are used (e.g. in a cluster architecture) the speed and flexibility of the interconnect becomes very important. Modern supercomputers have taken different approaches to address this issue, e.g. Tianhe-1 uses a proprietary high-speed network based on the Infiniband QDR, enhanced with FeiTeng-1000 CPUs. On the other hand, the Blue Gene/L system uses a three-dimensional torus interconnect with auxiliary networks for global communications. In this approach each node is connected to its six nearest neighbors. A similar torus was used by the Cray T3E.

Massive centralized systems at times use special-purpose processors designed for a specific application, and may use field-programmable gate arrays (FPGA) chips to gain performance by sacrificing generality. Examples of special-purpose supercomputers include Belle, Deep Blue, and Hydra, for playing chess, Gravity Pipe for astrophysics, MDGRAPE-3 for protein structure computation molecular dynamics and Deep Crack, for breaking the DES cipher.

Massive distributed parallelism

Example architecture of a geographically disperse computing system connecting many nodes over a network

Grid computing uses a large number of computers in distributed, diverse administrative domains. It is an opportunistic approach which uses resources whenever they are available. An example is BOINC a volunteer-based, opportunistic grid system. Some BOINC applications have reached multi-petaflop levels by using close to half a million computers connected on the internet, whenever volunteer resources become available. However, these types of results often do not appear in the TOP500 ratings because they do not run the general purpose Linpack benchmark.

Although grid computing has had success in parallel task execution, demanding supercomputer applications such as weather simulations or computational fluid dynamics have remained out of reach, partly due to the barriers in reliable sub-assignment of a large number of tasks as well as the reliable availability of resources at a given time.

In quasi-opportunistic supercomputing a large number of geographically disperse computers are orchestrated with built-in safeguards. The quasi-opportunistic approach goes beyond volunteer computing on a highly distributed systems such as BOINC, or general grid computing on a system such as Globus by allowing the middleware to provide almost seamless access to many computing clusters so that existing programs in languages such as Fortran or C can be distributed among multiple computing resources.

Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic resource sharing. The quasi-opportunistic approach enables the execution of demanding applications within computer grids by establishing grid-wise resource allocation agreements; and fault tolerant message passing to abstractly shield against the failures of the underlying resources, thus maintaining some opportunism, while allowing a higher level of control.

21st-century architectural trends

A person walking between the racks of a Cray XE6 supercomputer

The air-cooled IBM Blue Gene supercomputer architecture trades processor speed for low power consumption so that a larger number of processors can be used at room temperature, by using normal air-conditioning. The second-generation Blue Gene/P system has processors with integrated node-to-node communication logic. It is energy-efficient, achieving 371 MFLOPS/W.

The K computer is a water-cooled, homogeneous processor, distributed memory system with a cluster architecture. It uses more than 80,000 SPARC64 VIIIfx processors, each with eight cores, for a total of over 700,000 cores—almost twice as many as any other system. It comprises more than 800 cabinets, each with 96 computing nodes (each with 16 GB of memory), and 6 I/O nodes. Although it is more powerful than the next five systems on the TOP500 list combined, at 824.56 MFLOPS/W it has the lowest power to performance ratio of any current major supercomputer system. The follow up system for the K computer, called the PRIMEHPC FX10 uses the same six-dimensional torus interconnect, but still only one processor per node.

Unlike the K computer, the Tianhe-1A system uses a hybrid architecture and integrates CPUs and GPUs. It uses more than 14,000 Xeon general-purpose processors and more than 7,000 Nvidia Tesla general-purpose graphics processing units (GPGPUs) on about 3,500 blades. It has 112 computer cabinets and 262 terabytes of distributed memory; 2 petabytes of disk storage is implemented via Lustre clustered files. Tianhe-1 uses a proprietary high-speed communication network to connect the processors. The proprietary interconnect network was based on the Infiniband QDR, enhanced with Chinese made FeiTeng-1000 CPUs. In the case of the interconnect the system is twice as fast as the Infiniband, but slower than some interconnects on other supercomputers.

The limits of specific approaches continue to be tested, as boundaries are reached through large scale experiments, e.g., in 2011 IBM ended its participation in the Blue Waters petaflops project at the University of Illinois. The Blue Waters architecture was based on the IBM POWER7 processor and intended to have 200,000 cores with a petabyte of "globally addressable memory" and 10 petabytes of disk space. The goal of a sustained petaflop led to design choices that optimized single-core performance, and hence a lower number of cores. The lower number of cores was then expected to help performance on programs that did not scale well to a large number of processors. The large globally addressable memory architecture aimed to solve memory address problems in an efficient manner, for the same type of programs. Blue Waters had been expected to run at sustained speeds of at least one petaflop, and relied on the specific water-cooling approach to manage heat. In the first four years of operation, the National Science Foundation spent about $200 million on the project. IBM released the Power 775 computing node derived from that project's technology soon thereafter, but effectively abandoned the Blue Waters approach.

Architectural experiments are continuing in a number of directions, e.g. the Cyclops64 system uses a "supercomputer on a chip" approach, in a direction away from the use of massive distributed processors.^[60]^[61] Each 64-bit Cyclops64 chip contains 80 processors, and the entire system uses a globally addressable memory architecture. The processors are connected with non-internally blocking crossbar switch and communicate with each other via global interleaved memory. There is no data cache in the architecture, but half of each SRAM bank can be used as a scratchpad memory. Although this type of architecture allows unstructured parallelism in a dynamically non-contiguous memory system, it also produces challenges in the efficient mapping of parallel algorithms to a many-core system.

Program optimization

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Program_optimization

In computer science, program optimization, code optimization, or software optimization, is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources. In general, a computer program may be optimized so that it executes more rapidly, or to make it capable of operating with less memory storage or other resources, or draw less power.

General

Although the word "optimization" shares the same root as "optimal", it is rare for the process of optimization to produce a truly optimal system. A system can generally be made optimal not in absolute terms, but only with respect to a given quality metric, which may be in contrast with other possible metrics. As a result, the optimized system will typically only be optimal in one application or for one audience. One might reduce the amount of time that a program takes to perform some task at the price of making it consume more memory. In an application where memory space is at a premium, one might deliberately choose a slower algorithm in order to use less memory. Often there is no "one size fits all" design which works well in all cases, so engineers make trade-offs to optimize the attributes of greatest interest. Additionally, the effort required to make a piece of software completely optimal – incapable of any further improvement – is almost always more than is reasonable for the benefits that would be accrued; so the process of optimization may be halted before a completely optimal solution has been reached. Fortunately, it is often the case that the greatest improvements come early in the process.

Even for a given quality metric (such as execution speed), most methods of optimization only improve the result; they have no pretense of producing optimal output. Superoptimization is the process of finding truly optimal output.

Levels of optimization

Optimization can occur at a number of levels. Typically the higher levels have greater impact, and are harder to change later on in a project, requiring significant changes or a complete rewrite if they need to be changed. Thus optimization can typically proceed via refinement from higher to lower, with initial gains being larger and achieved with less work, and later gains being smaller and requiring more work. However, in some cases overall performance depends on performance of very low-level portions of a program, and small changes at a late stage or early consideration of low-level details can have outsized impact. Typically some consideration is given to efficiency throughout a project – though this varies significantly – but major optimization is often considered a refinement to be done late, if ever. On longer-running projects there are typically cycles of optimization, where improving one area reveals limitations in another, and these are typically curtailed when performance is acceptable or gains become too small or costly.

As performance is part of the specification of a program – a program that is unusably slow is not fit for purpose: a video game with 60 Hz (frames-per-second) is acceptable, but 6 frames-per-second is unacceptably choppy – performance is a consideration from the start, to ensure that the system is able to deliver sufficient performance, and early prototypes need to have roughly acceptable performance for there to be confidence that the final system will (with optimization) achieve acceptable performance. This is sometimes omitted in the belief that optimization can always be done later, resulting in prototype systems that are far too slow – often by an order of magnitude or more – and systems that ultimately are failures because they architecturally cannot achieve their performance goals, such as the Intel 432 (1981); or ones that take years of work to achieve acceptable performance, such as Java (1995), which only achieved acceptable performance with HotSpot (1999). The degree to which performance changes between prototype and production system, and how amenable it is to optimization, can be a significant source of uncertainty and risk.

Design level

At the highest level, the design may be optimized to make best use of the available resources, given goals, constraints, and expected use/load. The architectural design of a system overwhelmingly affects its performance. For example, a system that is network latency-bound (where network latency is the main constraint on overall performance) would be optimized to minimize network trips, ideally making a single request (or no requests, as in a push protocol) rather than multiple roundtrips. Choice of design depends on the goals: when designing a compiler, if fast compilation is the key priority, a one-pass compiler is faster than a multi-pass compiler (assuming same work), but if speed of output code is the goal, a slower multi-pass compiler fulfills the goal better, even though it takes longer itself. Choice of platform and programming language occur at this level, and changing them frequently requires a complete rewrite, though a modular system may allow rewrite of only some component – for example, a Python program may rewrite performance-critical sections in C. In a distributed system, choice of architecture (client-server, peer-to-peer, etc.) occurs at the design level, and may be difficult to change, particularly if all components cannot be replaced in sync (e.g., old clients).

Algorithms and data structures

Given an overall design, a good choice of efficient algorithms and data structures, and efficient implementation of these algorithms and data structures comes next. After design, the choice of algorithms and data structures affects efficiency more than any other aspect of the program. Generally data structures are more difficult to change than algorithms, as a data structure assumption and its performance assumptions are used throughout the program, though this can be minimized by the use of abstract data types in function definitions, and keeping the concrete data structure definitions restricted to a few places.

For algorithms, this primarily consists of ensuring that algorithms are constant O(1), logarithmic O(log n), linear O(n), or in some cases log-linear O(n log n) in the input (both in space and time). Algorithms with quadratic complexity O(n²) fail to scale, and even linear algorithms cause problems if repeatedly called, and are typically replaced with constant or logarithmic if possible.

Beyond asymptotic order of growth, the constant factors matter: an asymptotically slower algorithm may be faster or smaller (because simpler) than an asymptotically faster algorithm when they are both faced with small input, which may be the case that occurs in reality. Often a hybrid algorithm will provide the best performance, due to this tradeoff changing with size.

A general technique to improve performance is to avoid work. A good example is the use of a fast path for common cases, improving performance by avoiding unnecessary work. For example, using a simple text layout algorithm for Latin text, only switching to a complex layout algorithm for complex scripts, such as Devanagari. Another important technique is caching, particularly memoization, which avoids redundant computations. Because of the importance of caching, there are often many levels of caching in a system, which can cause problems from memory use, and correctness issues from stale caches.

Source code level

Beyond general algorithms and their implementation on an abstract machine, concrete source code level choices can make a significant difference. For example, on early C compilers, while(1) was slower than for(;;) for an unconditional loop, because while(1) evaluated 1 and then had a conditional jump which tested if it was true, while for (;;) had an unconditional jump . Some optimizations (such as this one) can nowadays be performed by optimizing compilers. This depends on the source language, the target machine language, and the compiler, and can be both difficult to understand or predict and changes over time; this is a key place where understanding of compilers and machine code can improve performance. Loop-invariant code motion and return value optimization are examples of optimizations that reduce the need for auxiliary variables and can even result in faster performance by avoiding round-about optimizations.

Build level

Between the source and compile level, directives and build flags can be used to tune performance options in the source code and compiler respectively, such as using preprocessor defines to disable unneeded software features, optimizing for specific processor models or hardware capabilities, or predicting branching, for instance. Source-based software distribution systems such as BSD's Ports and Gentoo's Portage can take advantage of this form of optimization.

Compile level

Use of an optimizing compiler tends to ensure that the executable program is optimized at least as much as the compiler can predict.

Assembly level

At the lowest level, writing code using an assembly language, designed for a particular hardware platform can produce the most efficient and compact code if the programmer takes advantage of the full repertoire of machine instructions. Many operating systems used on embedded systems have been traditionally written in assembler code for this reason. Programs (other than very small programs) are seldom written from start to finish in assembly due to the time and cost involved. Most are compiled down from a high level language to assembly and hand optimized from there. When efficiency and size are less important large parts may be written in a high-level language.

With more modern optimizing compilers and the greater complexity of recent CPUs, it is harder to write more efficient code than what the compiler generates, and few projects need this "ultimate" optimization step.

Much of the code written today is intended to run on as many machines as possible. As a consequence, programmers and compilers don't always take advantage of the more efficient instructions provided by newer CPUs or quirks of older models. Additionally, assembly code tuned for a particular processor without using such instructions might still be suboptimal on a different processor, expecting a different tuning of the code.

Typically today rather than writing in assembly language, programmers will use a disassembler to analyze the output of a compiler and change the high-level source code so that it can be compiled more efficiently, or understand why it is inefficient.

Run time

Just-in-time compilers can produce customized machine code based on run-time data, at the cost of compilation overhead. This technique dates to the earliest regular expression engines, and has become widespread with Java HotSpot and V8 for JavaScript. In some cases adaptive optimization may be able to perform run time optimization exceeding the capability of static compilers by dynamically adjusting parameters according to the actual input or other factors.

Profile-guided optimization is an ahead-of-time (AOT) compilation optimization technique based on run time profiles, and is similar to a static "average case" analog of the dynamic technique of adaptive optimization.

Self-modifying code can alter itself in response to run time conditions in order to optimize code; this was more common in assembly language programs.

Some CPU designs can perform some optimizations at run time. Some examples include out-of-order execution, speculative execution, instruction pipelines, and branch predictors. Compilers can help the program take advantage of these CPU features, for example through instruction scheduling.

Platform dependent and independent optimizations

Code optimization can be also broadly categorized as platform-dependent and platform-independent techniques. While the latter ones are effective on most or all platforms, platform-dependent techniques use specific properties of one platform, or rely on parameters depending on the single platform or even on the single processor. Writing or producing different versions of the same code for different processors might therefore be needed. For instance, in the case of compile-level optimization, platform-independent techniques are generic techniques (such as loop unrolling, reduction in function calls, memory efficient routines, reduction in conditions, etc.), that impact most CPU architectures in a similar way. A great example of platform-independent optimization has been shown with inner for loop, where it was observed that a loop with an inner for loop performs more computations per unit time than a loop without it or one with an inner while loop. Generally, these serve to reduce the total instruction path length required to complete the program and/or reduce total memory usage during the process. On the other hand, platform-dependent techniques involve instruction scheduling, instruction-level parallelism, data-level parallelism, cache optimization techniques (i.e., parameters that differ among various platforms) and the optimal instruction scheduling might be different even on different processors of the same architecture.

Strength reduction

Computational tasks can be performed in several different ways with varying efficiency. A more efficient version with equivalent functionality is known as a strength reduction. For example, consider the following C code snippet whose intention is to obtain the sum of all integers from 1 to N:

int i, sum = 0;
for (i = 1; i <= N; ++i) {
  sum += i;
}
printf("sum: %d\n", sum);

This code can (assuming no arithmetic overflow) be rewritten using a mathematical formula like:

int sum = N * (1 + N) / 2;
printf("sum: %d\n", sum);

The optimization, sometimes performed automatically by an optimizing compiler, is to select a method (algorithm) that is more computationally efficient, while retaining the same functionality. See algorithmic efficiency for a discussion of some of these techniques. However, a significant improvement in performance can often be achieved by removing extraneous functionality.

Optimization is not always an obvious or intuitive process. In the example above, the "optimized" version might actually be slower than the original version if N were sufficiently small and the particular hardware happens to be much faster at performing addition and looping operations than multiplication and division.

Trade-offs

In some cases, however, optimization relies on using more elaborate algorithms, making use of "special cases" and special "tricks" and performing complex trade-offs. A "fully optimized" program might be more difficult to comprehend and hence may contain more faults than unoptimized versions. Beyond eliminating obvious antipatterns, some code level optimizations decrease maintainability.

Optimization will generally focus on improving just one or two aspects of performance: execution time, memory usage, disk space, bandwidth, power consumption or some other resource. This will usually require a trade-off – where one factor is optimized at the expense of others. For example, increasing the size of cache improves run time performance, but also increases the memory consumption. Other common trade-offs include code clarity and conciseness.

There are instances where the programmer performing the optimization must decide to make the software better for some operations but at the cost of making other operations less efficient. These trade-offs may sometimes be of a non-technical nature – such as when a competitor has published a benchmark result that must be beaten in order to improve commercial success but comes perhaps with the burden of making normal usage of the software less efficient. Such changes are sometimes jokingly referred to as pessimizations.

Bottlenecks

Optimization may include finding a bottleneck in a system – a component that is the limiting factor on performance. In terms of code, this will often be a hot spot – a critical part of the code that is the primary consumer of the needed resource – though it can be another factor, such as I/O latency or network bandwidth.

In computer science, resource consumption often follows a form of power law distribution, and the Pareto principle can be applied to resource optimization by observing that 80% of the resources are typically used by 20% of the operations. In software engineering, it is often a better approximation that 90% of the execution time of a computer program is spent executing 10% of the code (known as the 90/10 law in this context).

More complex algorithms and data structures perform well with many items, while simple algorithms are more suitable for small amounts of data — the setup, initialization time, and constant factors of the more complex algorithm can outweigh the benefit, and thus a hybrid algorithm or adaptive algorithm may be faster than any single algorithm. A performance profiler can be used to narrow down decisions about which functionality fits which conditions.

In some cases, adding more memory can help to make a program run faster. For example, a filtering program will commonly read each line and filter and output that line immediately. This only uses enough memory for one line, but performance is typically poor, due to the latency of each disk read. Caching the result is similarly effective, though also requiring larger memory use.

When to optimize

Optimization can reduce readability and add code that is used only to improve the performance. This may complicate programs or systems, making them harder to maintain and debug. As a result, optimization or performance tuning is often performed at the end of the development stage.

Donald Knuth made the following two statements on optimization:

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%"

(He also attributed the quote to Tony Hoare several years later, although this might have been an error as Hoare disclaims having coined the phrase.)

"In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering"

"Premature optimization" is a phrase used to describe a situation where a programmer lets performance considerations affect the design of a piece of code. This can result in a design that is not as clean as it could have been or code that is incorrect, because the code is complicated by the optimization and the programmer is distracted by optimizing.

When deciding whether to optimize a specific part of the program, Amdahl's Law should always be considered: the impact on the overall program depends very much on how much time is actually spent in that specific part, which is not always clear from looking at the code without a performance analysis.

A better approach is therefore to design first, code from the design and then profile/benchmark the resulting code to see which parts should be optimized. A simple and elegant design is often easier to optimize at this stage, and profiling may reveal unexpected performance problems that would not have been addressed by premature optimization.

In practice, it is often necessary to keep performance goals in mind when first designing software, but the programmer balances the goals of design and optimization.

Modern compilers and operating systems are so efficient that the intended performance increases often fail to materialize. As an example, caching data at the application level that is again cached at the operating system level does not yield improvements in execution. Even so, it is a rare case when the programmer will remove failed optimizations from production code. It is also true that advances in hardware will more often than not obviate any potential improvements, yet the obscuring code will persist into the future long after its purpose has been negated.

Macros

Optimization during code development using macros takes on different forms in different languages.

In some procedural languages, such as C and C++, macros are implemented using token substitution. Nowadays, inline functions can be used as a type safe alternative in many cases. In both cases, the inlined function body can then undergo further compile-time optimizations by the compiler, including constant folding, which may move some computations to compile time.

In many functional programming languages macros are implemented using parse-time substitution of parse trees/abstract syntax trees, which it is claimed makes them safer to use. Since in many cases interpretation is used, that is one way to ensure that such computations are only performed at parse-time, and sometimes the only way.

Lisp originated this style of macro, and such macros are often called "Lisp-like macros." A similar effect can be achieved by using template metaprogramming in C++.

In both cases, work is moved to compile-time. The difference between C macros on one side, and Lisp-like macros and C++ template metaprogramming on the other side, is that the latter tools allow performing arbitrary computations at compile-time/parse-time, while expansion of C macros does not perform any computation, and relies on the optimizer ability to perform it. Additionally, C macros do not directly support recursion or iteration, so are not Turing complete.

As with any optimization, however, it is often difficult to predict where such tools will have the most impact before a project is complete.

Automated and manual optimization

Optimization can be automated by compilers or performed by programmers. Gains are usually limited for local optimization, and larger for global optimizations. Usually, the most powerful optimization is to find a superior algorithm.

Optimizing a whole system is usually undertaken by programmers because it is too complex for automated optimizers. In this situation, programmers or system administrators explicitly change code so that the overall system performs better. Although it can produce better efficiency, it is far more expensive than automated optimizations. Since many parameters influence the program performance, the program optimization space is large. Meta-heuristics and machine learning are used to address the complexity of program optimization.

Use a profiler (or performance analyzer) to find the sections of the program that are taking the most resources – the bottleneck. Programmers sometimes believe they have a clear idea of where the bottleneck is, but intuition is frequently wrong. Optimizing an unimportant piece of code will typically do little to help the overall performance.

When the bottleneck is localized, optimization usually starts with a rethinking of the algorithm used in the program. More often than not, a particular algorithm can be specifically tailored to a particular problem, yielding better performance than a generic algorithm. For example, the task of sorting a huge list of items is usually done with a quicksort routine, which is one of the most efficient generic algorithms. But if some characteristic of the items is exploitable (for example, they are already arranged in some particular order), a different method can be used, or even a custom-made sort routine.

After the programmer is reasonably sure that the best algorithm is selected, code optimization can start. Loops can be unrolled (for lower loop overhead, although this can often lead to lower speed if it overloads the CPU cache), data types as small as possible can be used, integer arithmetic can be used instead of floating-point, and so on. (See algorithmic efficiency article for these and other techniques.)

Performance bottlenecks can be due to language limitations rather than algorithms or data structures used in the program. Sometimes, a critical part of the program can be re-written in a different programming language that gives more direct access to the underlying machine. For example, it is common for very high-level languages like Python to have modules written in C for greater speed. Programs already written in C can have modules written in assembly. Programs written in D can use the inline assembler.

Rewriting sections "pays off" in these circumstances because of a general "rule of thumb" known as the 90/10 law, which states that 90% of the time is spent in 10% of the code, and only 10% of the time in the remaining 90% of the code. So, putting intellectual effort into optimizing just a small part of the program can have a huge effect on the overall speed – if the correct part(s) can be located.

Manual optimization sometimes has the side effect of undermining readability. Thus code optimizations should be carefully documented (preferably using in-line comments), and their effect on future development evaluated.

The program that performs an automated optimization is called an optimizer. Most optimizers are embedded in compilers and operate during compilation. Optimizers can often tailor the generated code to specific processors.

Today, automated optimizations are almost exclusively limited to compiler optimization. However, because compiler optimizations are usually limited to a fixed set of rather general optimizations, there is considerable demand for optimizers which can accept descriptions of problem and language-specific optimizations, allowing an engineer to specify custom optimizations. Tools that accept descriptions of optimizations are called program transformation systems and are beginning to be applied to real software systems such as C++.

Some high-level languages (Eiffel, Esterel) optimize their programs by using an intermediate language.

Grid computing or distributed computing aims to optimize the whole system, by moving tasks from computers with high usage to computers with idle time.

Time taken for optimization

Sometimes, the time taken to undertake optimization therein itself may be an issue.

Optimizing existing code usually does not add new features, and worse, it might add new bugs in previously working code (as any change might). Because manually optimized code might sometimes have less "readability" than unoptimized code, optimization might impact maintainability of it as well. Optimization comes at a price and it is important to be sure that the investment is worthwhile.

An automatic optimizer (or optimizing compiler, a program that performs code optimization) may itself have to be optimized, either to further improve the efficiency of its target programs or else speed up its own operation. A compilation performed with optimization "turned on" usually takes longer, although this is usually only a problem when programs are quite large.

In particular, for just-in-time compilers the performance of the run time compile component, executing together with its target code, is the key to improving overall execution speed.

Computer performance

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Computer_performance

In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instructions. When it comes to high computer performance, one or more of the following factors might be involved:

Short response time for a given piece of work.
High throughput (rate of processing work).
Low utilization of computing resource(s).
- Fast (or highly compact) data compression and decompression.
High availability of the computing system or application.
High bandwidth.
Short data transmission time.

Technical and non-technical definitions

The performance of any computer system can be evaluated in measurable, technical terms, using one or more of the metrics listed above. This way the performance can be

Compared relative to other systems or the same system before/after changes
In absolute terms, e.g. for fulfilling a contractual obligation

Whilst the above definition relates to a scientific, technical approach, the following definition given by Arnold Allen would be useful for a non-technical audience:

The word performance in computer performance means the same thing that performance means in other contexts, that is, it means "How well is the computer doing the work it is supposed to do?"

As an aspect of software quality

Computer software performance, particularly software application response time, is an aspect of software quality that is important in human–computer interactions.

Performance engineering

Performance engineering within systems engineering encompasses the set of roles, skills, activities, practices, tools, and deliverables applied at every phase of the systems development life cycle which ensures that a solution will be designed, implemented, and operationally supported to meet the performance requirements defined for the solution.

Performance engineering continuously deals with trade-offs between types of performance. Occasionally a CPU designer can find a way to make a CPU with better overall performance by improving one of the aspects of performance, presented below, without sacrificing the CPU's performance in other areas. For example, building the CPU out of better, faster transistors.

However, sometimes pushing one type of performance to an extreme leads to a CPU with worse overall performance, because other important aspects were sacrificed to get one impressive-looking number, for example, the chip's clock rate (see the megahertz myth).

Application performance engineering

Application Performance Engineering (APE) is a specific methodology within performance engineering designed to meet the challenges associated with application performance in increasingly distributed mobile, cloud and terrestrial IT environments. It includes the roles, skills, activities, practices, tools and deliverables applied at every phase of the application lifecycle that ensure an application will be designed, implemented and operationally supported to meet non-functional performance requirements.

Aspects of performance

Computer performance metrics (things to measure) include availability, response time, channel capacity, latency, completion time, service time, bandwidth, throughput, relative efficiency, scalability, performance per watt, compression ratio, instruction path length and speed up. CPU benchmarks are available.

Availability

Availability of a system is typically measured as a factor of its reliability - as reliability increases, so does availability (that is, less downtime). Availability of a system may also be increased by the strategy of focusing on increasing testability and maintainability and not on reliability. Improving maintainability is generally easier than reliability. Maintainability estimates (Repair rates) are also generally more accurate. However, because the uncertainties in the reliability estimates are in most cases very large, it is likely to dominate the availability (prediction uncertainty) problem, even while maintainability levels are very high.

Response time

Response time is the total amount of time it takes to respond to a request for service. In computing, that service can be any unit of work from a simple disk IO to loading a complex web page. The response time is the sum of three numbers:

Service time - How long it takes to do the work requested.
Wait time - How long the request has to wait for requests queued ahead of it before it gets to run.
Transmission time – How long it takes to move the request to the computer doing the work and the response back to the requestor.

Processing speed

Most consumers pick a computer architecture (normally Intel IA-32 architecture) to be able to run a large base of pre-existing, pre-compiled software. Being relatively uninformed on computer benchmarks, some of them pick a particular CPU based on operating frequency (see megahertz myth).

Some system designers building parallel computers pick CPUs based on the speed per dollar.

Channel capacity

Channel capacity is the tightest upper bound on the rate of information that can be reliably transmitted over a communications channel. By the noisy-channel coding theorem, the channel capacity of a given channel is the limiting information rate (in units of information per unit time) that can be achieved with arbitrarily small error probability.

Information theory, developed by Claude E. Shannon during World War II, defines the notion of channel capacity and provides a mathematical model by which one can compute it. The key result states that the capacity of the channel, as defined above, is given by the maximum of the mutual information between the input and output of the channel, where the maximization is with respect to the input distribution.

Latency

Latency is a time delay between the cause and the effect of some physical change in the system being observed. Latency is a result of the limited velocity with which any physical interaction can take place. This velocity is always lower or equal to speed of light. Therefore, every physical system that has non-zero spatial dimensions will experience some sort of latency.

The precise definition of latency depends on the system being observed and the nature of stimulation. In communications, the lower limit of latency is determined by the medium being used for communications. In reliable two-way communication systems, latency limits the maximum rate that information can be transmitted, as there is often a limit on the amount of information that is "in-flight" at any one moment. In the field of human-machine interaction, perceptible latency (delay between what the user commands and when the computer provides the results) has a strong effect on user satisfaction and usability.

Computers run sets of instructions called a process. In operating systems, the execution of the process can be postponed if other processes are also executing. In addition, the operating system can schedule when to perform the action that the process is commanding. For example, suppose a process commands that a computer card's voltage output be set high-low-high-low and so on at a rate of 1000 Hz. The operating system may choose to adjust the scheduling of each transition (high-low or low-high) based on an internal clock. The latency is the delay between the process instruction commanding the transition and the hardware actually transitioning the voltage from high to low or low to high.

System designers building real-time computing systems want to guarantee worst-case response. That is easier to do when the CPU has low interrupt latency and when it has deterministic response.

Bandwidth

In computer networking, bandwidth is a measurement of bit-rate of available or consumed data communication resources, expressed in bits per second or multiples of it (bit/s, kbit/s, Mbit/s, Gbit/s, etc.).

Bandwidth sometimes defines the net bit rate (aka. peak bit rate, information rate, or physical layer useful bit rate), channel capacity, or the maximum throughput of a logical or physical communication path in a digital communication system. For example, bandwidth tests measure the maximum throughput of a computer network. The reason for this usage is that according to Hartley's law, the maximum data rate of a physical communication link is proportional to its bandwidth in hertz, which is sometimes called frequency bandwidth, spectral bandwidth, RF bandwidth, signal bandwidth or analog bandwidth.

Throughput

In general terms, throughput is the rate of production or the rate at which something can be processed.

In communication networks, throughput is essentially synonymous to digital bandwidth consumption. In wireless networks or cellular communication networks, the system spectral efficiency in bit/s/Hz/area unit, bit/s/Hz/site or bit/s/Hz/cell, is the maximum system throughput (aggregate throughput) divided by the analog bandwidth and some measure of the system coverage area.

In integrated circuits, often a block in a data flow diagram has a single input and a single output, and operate on discrete packets of information. Examples of such blocks are FFT modules or binary multipliers. Because the units of throughput are the reciprocal of the unit for propagation delay, which is 'seconds per message' or 'seconds per output', throughput can be used to relate a computational device performing a dedicated function such as an ASIC or embedded processor to a communications channel, simplifying system analysis.

Relative efficiency

Scalability

Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth

Power consumption

The amount of electric power used by the computer (power consumption). This becomes especially important for systems with limited power sources such as solar, batteries, human power.

Performance per watt

System designers building parallel computers, such as Google's hardware, pick CPUs based on their speed per watt of power, because the cost of powering the CPU outweighs the cost of the CPU itself.

For spaceflight computers, the processing speed per watt ratio is a more useful performance criterion than raw processing speed.

Compression ratio

Compression is useful because it helps reduce resource usage, such as data storage space or transmission capacity. Because compressed data must be decompressed to use, this extra processing imposes computational or other costs through decompression; this situation is far from being a free lunch. Data compression is subject to a space–time complexity trade-off.

Size and weight

This is an important performance feature of mobile systems, from the smart phones you keep in your pocket to the portable embedded systems in a spacecraft.

Environmental impact

The effect of a computer or computers on the environment, during manufacturing and recycling as well as during use. Measurements are taken with the objectives of reducing waste, reducing hazardous materials, and minimizing a computer's ecological footprint.

Transistor count

The transistor count is the number of transistors on an integrated circuit (IC). Transistor count is the most common measure of IC complexity.

Benchmarks

Because there are so many programs to test a CPU on all aspects of performance, benchmarks were developed.

The most famous benchmarks are the SPECint and SPECfp benchmarks developed by Standard Performance Evaluation Corporation and the Certification Mark benchmark developed by the Embedded Microprocessor Benchmark Consortium EEMBC.

Software performance testing

In software engineering, performance testing is in general testing performed to determine how a system performs in terms of responsiveness and stability under a particular workload. It can also serve to investigate, measure, validate or verify other quality attributes of the system, such as scalability, reliability and resource usage.

Performance testing is a subset of performance engineering, an emerging computer science practice which strives to build performance into the implementation, design and architecture of a system.

Profiling (performance analysis)

In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or frequency and duration of function calls. The most common use of profiling information is to aid program optimization.

Profiling is achieved by instrumenting either the program source code or its binary executable form using a tool called a profiler (or code profiler). A number of different techniques may be used by profilers, such as event-based, statistical, instrumented, and simulation methods.

Performance tuning

Performance tuning is the improvement of system performance. This is typically a computer application, but the same methods can be applied to economic markets, bureaucracies or other complex systems. The motivation for such activity is called a performance problem, which can be real or anticipated. Most systems will respond to increased load with some degree of decreasing performance. A system's ability to accept a higher load is called scalability, and modifying a system to handle a higher load is synonymous to performance tuning.

Systematic tuning follows these steps:

Assess the problem and establish numeric values that categorize acceptable behavior.
Measure the performance of the system before modification.
Identify the part of the system that is critical for improving the performance. This is called the bottleneck.
Modify that part of the system to remove the bottleneck.
Measure the performance of the system after modification.
If the modification makes the performance better, adopt it. If the modification makes the performance worse, put it back to the way it was.

Perceived performance

Perceived performance, in computer engineering, refers to how quickly a software feature appears to perform its task. The concept applies mainly to user acceptance aspects.

The amount of time an application takes to start up, or a file to download, is not made faster by showing a startup screen (see Splash screen) or a file progress dialog box. However, it satisfies some human needs: it appears faster to the user as well as providing a visual cue to let them know the system is handling their request.

In most cases, increasing real performance increases perceived performance, but when real performance cannot be increased due to physical limitations, techniques can be used to increase perceived performance.

Performance Equation

The total amount of time (t) required to execute a particular benchmark program is

$t = \frac{N C}{f}$ , or equivalently

$P = \frac{I f}{N}$

where

$P = \frac{1}{t}$ is "the performance" in terms of time-to-execute
$N$ is the number of instructions actually executed (the instruction path length). The code density of the instruction set strongly affects N. The value of N can either be determined exactly by using an instruction set simulator (if available) or by estimation—itself based partly on estimated or actual frequency distribution of input variables and by examining generated machine code from an HLL compiler. It cannot be determined from the number of lines of HLL source code. N is not affected by other processes running on the same processor. The significant point here is that hardware normally does not keep track of (or at least make easily available) a value of N for executed programs. The value can therefore only be accurately determined by instruction set simulation, which is rarely practiced.
$f$ is the clock frequency in cycles per second.
$C = \frac{1}{I}$ is the average cycles per instruction (CPI) for this benchmark.
$I = \frac{1}{C}$ is the average instructions per cycle (IPC) for this benchmark.

Even on one machine, a different compiler or the same compiler with different compiler optimization switches can change N and CPI—the benchmark executes faster if the new compiler can improve N or C without making the other worse, but often there is a trade-off between them—is it better, for example, to use a few complicated instructions that take a long time to execute, or to use instructions that execute very quickly, although it takes more of them to execute the benchmark?

A CPU designer is often required to implement a particular instruction set, and so cannot change N. Sometimes a designer focuses on improving performance by making significant improvements in f (with techniques such as deeper pipelines and faster caches), while (hopefully) not sacrificing too much C—leading to a speed-demon CPU design. Sometimes a designer focuses on improving performance by making significant improvements in CPI (with techniques such as out-of-order execution, superscalar CPUs, larger caches, caches with improved hit rates, improved branch prediction, speculative execution, etc.), while (hopefully) not sacrificing too much clock frequency—leading to a brainiac CPU design. For a given instruction set (and therefore fixed N) and semiconductor process, the maximum single-thread performance (1/t) requires a balance between brainiac techniques and speedracer techniques.

Search This Blog