A Medley of Potpourri

Thursday, November 2, 2023

History of artificial neural networks

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/History_of_artificial_neural_networks

Linear neural network

The simplest kind of feedforward neural network is a linear network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated in each node. The mean squared errors between these calculated outputs and a given target values are minimized by creating an adjustment to the weights. This technique has been known for over two centuries as the method of least squares or linear regression. It was used as a means of finding a good rough linear fit to a set of points by Legendre (1805) and Gauss (1795) for the prediction of planetary movement.

Recurrent network architectures

Wilhelm Lenz and Ernst Ising created and analyzed the Ising model (1925) which is essentially a non-learning artificial recurrent neural network (RNN) consisting of neuron-like threshold elements. In 1972, Shun'ichi Amari made this architecture adaptive. His learning RNN was popularised by John Hopfield in 1982.

Perceptrons and other early neural networks

Warren McCulloch and Walter Pitts (1943) also considered a non-learning computational model for neural networks. This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to artificial intelligence. This work led to work on nerve networks and their link to finite automata.

In the early 1940s, D. O. Hebb created a learning hypothesis based on the mechanism of neural plasticity that became known as Hebbian learning. Hebbian learning is unsupervised learning. This evolved into models for long-term potentiation. Researchers started applying these ideas to computational models in 1948 with Turing's B-type machines. Farley and Clark (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by Rochester, Holland, Habit and Duda (1956).

Rosenblatt (1958) created the perceptron, an algorithm for pattern recognition. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the exclusive-or circuit that could not be processed by neural networks at the time. In 1959, a biological model proposed by Nobel laureates Hubel and Wiesel was based on their discovery of two types of cells in the primary visual cortex: simple cells and complex cells.

Some say that research stagnated following Minsky and Papert (1969), who discovered that basic perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to process useful neural networks. However, by the time this book came out, methods for training multilayer perceptrons (MLPs) by deep learning were already known.

First deep learning

The first deep learning MLP was published by Alexey Grigorevich Ivakhnenko and Valentin Lapa in 1965, as the Group Method of Data Handling. This method employs incremental layer by layer training based on regression analysis, where useless units in hidden layers are pruned with the help of a validation set.

The first deep learning MLP trained by stochastic gradient descent was published in 1967 by Shun'ichi Amari. In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned useful internal representations to classify non-linearily separable pattern classes.

Backpropagation

The backpropagation algorithm is an efficient application of the Leibniz chain rule (1673) to networks of differentiable nodes. It is also known as the reverse mode of automatic differentiation or reverse accumulation, due to Seppo Linnainmaa (1970). The term "back-propagating errors" was introduced in 1962 by Frank Rosenblatt, but he did not have an implementation of this procedure, although Henry J. Kelley had a continuous precursor of backpropagation already in 1960 in the context of control theory. In 1982, Paul Werbos applied backpropagation to MLPs in the way that has become standard. In 1986, David E. Rumelhart et al. published an experimental analysis of the technique.

Self-organizing maps

Self-organizing maps (SOMs) were described by Teuvo Kohonen in 1982. SOMs are neurophysiologically inspired artificial neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using competitive learning.

SOMs create internal representations reminiscent of the cortical homunculus, a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body.

Support vector machines

Support vector machines, developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues (Boser et al., 1992, Isabelle Guyon et al., 1993, Corinna Cortes, 1995, Vapnik et al., 1997) and simpler methods such as linear classifiers gradually overtook neural networks. However, neural networks transformed domains such as the prediction of protein structures.

Convolutional neural networks (CNNs)

The origin of the CNN architecture is the "neocognitron" introduced by Kunihiko Fukushima in 1980. It was inspired by work of Hubel and Wiesel in the 1950s and 1960s which showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.

In 1969, Kunihiko Fukushima also introduced the ReLU (rectified linear unit) activation function. The rectifier has become the most popular activation function for CNNs and deep neural networks in general.

The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel and was one of the first CNNs, as it achieved shift invariance. It did so by utilizing weight sharing in combination with backpropagation training. Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.

In 1988, Wei Zhang et al. applied backpropagation to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system.

In 1989, Yann LeCun et al. trained a CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days. Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types. Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991 and breast cancer detection in mammograms in 1994.

In 1990 Yamaguchi et al. introduced max-pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They combined TDNNs with max-pooling in order to realize a speaker independent isolated word recognition system. In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.Max-pooling is often used in modern CNNs.

LeNet-5, a 7-level CNN by Yann LeCun et al. in 1998, that classifies digits, was applied by several banks to recognize hand-written numbers on checks (British English: cheques) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources.

In 2010, Backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants. Behnke (2003) relied only on the sign of the gradient (Rprop) on problems such as image reconstruction and face localization. Rprop is a first-order optimization algorithm created by Martin Riedmiller and Heinrich Braun in 1992.

In 2011, a deep GPU-based CNN called "DanNet" by Dan Ciresan, Ueli Meier, and Juergen Schmidhuber achieved human-competitive performance for the first time in computer vision contests. Subsequently, a similar GPU-based CNN by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the ImageNet Large Scale Visual Recognition Challenge 2012. A very deep CNN with over 100 layers by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft won the ImageNet 2015 contest.

ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs) whose embodiments are Where-What Networks, WWN-1 (2008) through WWN-7 (2013).

Artificial curiosity and generative adversarial networks

In 1991, Juergen Schmidhuber published adversarial neural networks that contest with each other in the form of a zero-sum game, where one network's gain is the other network's loss. The first network is a generative model that models a probability distribution over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. This was called "artificial curiosity." Earlier adversarial machine learning systems "neither involved unsupervised neural networks nor were about modeling data nor used gradient descent."

In 2014, this adversarial principle was used in a generative adversarial network (GAN) by Ian Goodfellow et al. Here the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set. This can be used to create realistic deepfakes.

In 1992, Schmidhuber also published another type of gradient-based adversarial neural networks where the goal of the zero-sum game is to create disentangled representations of input patterns. This was called predictability minimization.

Nvidia's StyleGAN (2018) is based on the Progressive GAN by Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Here the GAN generator is grown from small to large scale in a pyramidal fashion. StyleGANs improve consistency between fine and coarse details in the generator network.

Transformers and their variants

Many modern large language models such as ChatGPT, GPT-4, and BERT use a feedforward neural network called Transformer by Ashish Vaswani et. al. in their 2017 paper "Attention Is All You Need." Transformers have increasingly become the model of choice for natural language processing problems, replacing recurrent neural networks (RNNs) such as long short-term memory (LSTM).

Basic ideas for this go back a long way: in 1992, Juergen Schmidhuber published the Transformer with "linearized self-attention" (save for a normalization operator), which is also called the "linear Transformer."He advertised it as an "alternative to RNNs" that can learn "internal spotlights of attention," and experimentally applied it to problems of variable binding. Here a slow feedforward neural network learns by gradient descent to control the fast weights of another neural network through outer products of self-generated activation patterns called "FROM" and "TO" which in Transformer terminology are called "key" and "value" for "self-attention." This fast weight "attention mapping" is applied to queries. The 2017 Transformer combines this with a softmax operator and a projection matrix.

Transformers are also increasingly being used in computer vision.

Deep learning with unsupervised or self-supervised pre-training

In the 1980s, backpropagation did not work well for deep FNNs and RNNs. Here the word "deep" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For an FNN, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For RNNs, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.

To overcome this problem, Juergen Schmidhuber (1992) proposed a self-supervised hierarchy of RNNs pre-trained one level at a time by self-supervised learning. This "neural history compressor" uses predictive coding to learn internal representations at multiple self-organizing time scales. The deep architecture may be used to reproduce the original data from the top level feature activations. The RNN hierarchy can be "collapsed" into a single RNN, by "distilling" a higher level "chunker" network into a lower level "automatizer" network. In 1993, a chunker solved a deep learning task whose CAP depth exceeded 1000. Such history compressors can substantially facilitate downstream supervised deep learning.

Geoffrey Hinton et al. (2006) proposed learning a high-level internal representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine to model each layer. This RBM is a generative stochastic feedforward neural network that can learn a probability distribution over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations. In 2012, Andrew Ng and Jeff Dean created an FNN that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.

The vanishing gradient problem and its solutions

Sepp Hochreiter's diploma thesis (1991) was called "one of the most important documents in the history of machine learning" by his supervisor Juergen Schmidhuber. Hochreiter not only tested the neural history compressor, but also identified and analyzed the vanishing gradient problem. He proposed recurrent residual connections to solve this problem. This led to the deep learning method called long short-term memory (LSTM), published in 1997. LSTM recurrent neural networks can learn "very deep learning" tasks with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. The "vanilla LSTM" with forget gate was introduced in 1999 by Felix Gers, Schmidhuber and Fred Cummins. LSTM has become the most cited neural network of the 20th century.

In 2015, Rupesh Kumar Srivastava, Klaus Greff, and Schmidhuber used LSTM principles to create the Highway network, a feedforward neural network with hundreds of layers, much deeper than previous networks. 7 months later, Kaiming He, Xiangyu Zhang; Shaoqing Ren, and Jian Sun won the ImageNet 2015 competition with an open-gated or gateless Highway network variant called Residual neural network. This has become the most cited neural network of the 21st century.

In 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that the ReLU of Kunihiko Fukushima also helps to overcome the vanishing gradient problem, compared to widely used activation functions prior to 2011.

Hardware-based designs

The development of metal–oxide–semiconductor (MOS) very-large-scale integration (VLSI), combining millions or billions of MOS transistors onto a single chip in the form of complementary MOS (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s.

Computational devices were created in CMOS, for both biophysical simulation and neuromorphic computing inspired by the structure and function of the human brain. Nanodevices for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally analog rather than digital (even though the first implementations may use digital devices). Ciresan and colleagues (2010) in Schmidhuber's group showed that despite the vanishing gradient problem, GPUs make backpropagation feasible for many-layered feedforward neural networks.

Contests

Between 2009 and 2012, recurrent neural networks and deep feedforward neural networks developed in Schmidhuber's research group won eight international competitions in pattern recognition and machine learning. For example, the bi-directional and multi-dimensional long short-term memory (LSTM) of Graves et al. won three competitions in connected handwriting recognition at the 2009 International Conference on Document Analysis and Recognition (ICDAR), without any prior knowledge about the three languages to be learned.

Ciresan and colleagues won pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition, the ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge and others. Their neural networks were the first pattern recognizers to achieve human-competitive/superhuman performance on benchmarks such as traffic sign recognition (IJCNN 2012), or the MNIST handwritten digits problem.

Researchers demonstrated (2010) that deep neural networks interfaced to a hidden Markov model with context-dependent states that define the neural network output layer can drastically reduce errors in large-vocabulary speech recognition tasks such as voice search.

GPU-based implementations of this approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition, the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge, the ImageNet Competition and others.

Deep, highly nonlinear neural architectures similar to the neocognitron and the "standard architecture of vision", inspired by simple and complex cells, were pre-trained with unsupervised methods by Hinton. A team from his lab won a 2012 contest sponsored by Merck to design software to help find molecules that might identify new drugs.

Multilayer perceptron

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Multilayer_perceptron

A multilayer perceptron (MLP) is a misnomer for a modern feedforward artificial neural network, consisting of fully connected neurons with a nonlinear kind of activation function, organized in at least three layers, notable for being able to distinguish data that is not linearly separable. It is a misnomer because the original perceptron used a Heaviside step function, instead of a nonlinear kind of activation function (used by modern networks).

Modern feedforward networks are trained using the backpropagation method and are colloquially referred to as the "vanilla" neural networks.

Timeline

In 1958, a layered network of perceptrons, consisting of an input layer, a hidden layer with randomized weights that did not learn, and an output layer with learning connections, was introduced already by Frank Rosenblatt in his book Perceptron. This extreme learning machine was not yet a deep learning network.

In 1965, the first deep-learning feedforward network, not yet using stochastic gradient descent, was published by Alexey Grigorevich Ivakhnenko and Valentin Lapa, at the time called the Group Method of Data Handling.

In 1967, a deep-learning network, which used stochastic gradient descent for the first time, able to classify non-linearily separable pattern classes, was published by Shun'ichi Amari. Amari's student Saito conducted the computer experiments, using a five-layered feedforward network with two learning layers.

In 1970, modern backpropagation method, an efficient application of a chain-rule-based supervised learning, was for the first time published by the Finnish researcher Seppo Linnainmaa. The term (i.e. "back-propagating errors") itself has been used by Rosenblatt himself, but he did not know how to implement it, although a continuous precursor of backpropagation was already used in the context of control theory in 1960 by Henry J. Kelley. It is known also as a reverse mode of automatic differentiation.

In 1982, backpropagation was applied in the way that has become standard, for the first time by Paul Werbos.

In 1985, an experimental analysis of the technique was conducted by David E. Rumelhart et al. Many improvements to the approach have been made in subsequent decades.

In 1990s, an (much simpler) alternative to using neural networks, although still related support vector machine approach was developed by Vladimir Vapnik and his colleagues. In addition to performing linear classification, they were able to efficiently perform a non-linear classification using what is called the kernel trick, using high-dimensional feature spaces.

In 2003, interest in backpropagation networks returned due to the successes of deep learning being applied to language modelling by Yoshua Bengio with co-authors.

In 2017, modern transformer architectures has been introduced.

In 2021, a very simple NN architecture combining two deep MLPs with skip connections and layer normalizations was designed and called MLP-Mixer; its realizations featuring 19 to 431 millions of parameters were shown to be comparable to vision transformers of similar size on ImageNet and similar image classification tasks.

Mathematical foundations

Activation function

If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then linear algebra shows that any number of layers can be reduced to a two-layer input-output model. In MLPs some neurons use a nonlinear activation function that was developed to model the frequency of action potentials, or firing, of biological neurons.

The two historically common activation functions are both sigmoids, and are described by

y (v_{i}) = \tanh (v_{i}) and y (v_{i}) = (1 + e^{- v_{i}})^{- 1}

The first is a hyperbolic tangent that ranges from -1 to 1, while the other is the logistic function, which is similar in shape but ranges from 0 to 1. Here $y_{i}$ is the output of the $i$ th node (neuron) and $v_{i}$ is the weighted sum of the input connections. Alternative activation functions have been proposed, including the rectifier and softplus functions. More specialized activation functions include radial basis functions (used in radial basis networks, another class of supervised neural network models).

In recent developments of deep learning the rectified linear unit (ReLU) is more frequently used as one of the possible ways to overcome the numerical problems related to the sigmoids.

Layers

The MLP consists of three or more layers (an input and an output layer with one or more hidden layers) of nonlinearly-activating nodes. Since MLPs are fully connected, each node in one layer connects with a certain weight $w_{i j}$ to every node in the following layer.

Learning

Learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of supervised learning, and is carried out through backpropagation, a generalization of the least mean squares algorithm in the linear perceptron.

We can represent the degree of error in an output node $j$ in the $n$ th data point (training example) by $e_{j} (n) = d_{j} (n) - y_{j} (n)$ , where $d_{j} (n)$ is the desired target value for $n$ th data point at node $j$ , and $y_{j} (n)$ is the value produced by the perceptron at node $j$ when the $n$ th data point is given as an input.

The node weights can then be adjusted based on corrections that minimize the error in the entire output for the $n$ th data point, given by

E (n) = \frac{1}{2} \sum_{output node j} e_{j}^{2} (n)

Using gradient descent, the change in each weight $w_{i j}$ is

Δ w_{j i} (n) = - η \frac{\partial E (n)}{\partial v_{j} (n)} y_{i} (n)

where $y_{i} (n)$ is the output of the previous neuron $i$ , and $η$ is the learning rate, which is selected to ensure that the weights quickly converge to a response, without oscillations. In the previous expression, $\frac{\partial E (n)}{\partial v_{j} (n)}$ denotes the partial derivate of the error $E (n)$ according to the weighted sum $v_{j} (n)$ of the input connections of neuron $i$ .

The derivative to be calculated depends on the induced local field $v_{j}$ , which itself varies. It is easy to prove that for an output node this derivative can be simplified to

- \frac{\partial E (n)}{\partial v_{j} (n)} = e_{j} (n) ϕ^{'} (v_{j} (n))

where $ϕ^{'}$ is the derivative of the activation function described above, which itself does not vary. The analysis is more difficult for the change in weights to a hidden node, but it can be shown that the relevant derivative is

- \frac{\partial E (n)}{\partial v_{j} (n)} = ϕ^{'} (v_{j} (n)) \sum_{k} - \frac{\partial E (n)}{\partial v_{k} (n)} w_{k j} (n)

This depends on the change in weights of the $k$ th nodes, which represent the output layer. So to change the hidden layer weights, the output layer weights change according to the derivative of the activation function, and so this algorithm represents a backpropagation of the activation function.

Black box

From Wikipedia, the free encyclopedia

To analyze an open system with a typical "black box approach", only the behavior of the stimulus/response will be accounted for, to infer the (unknown) box. The usual representation of this black box system is a data flow diagram centered in the box.

The opposite of a black box is a system where the inner components or logic are available for inspection, which is most commonly referred to as a white box (sometimes also known as a "clear box" or a "glass box").

History

A black box model can be used to describe the outputs of systems.

The modern meaning of the term "black box" seems to have entered the English language around 1945. In electronic circuit theory the process of network synthesis from transfer functions, which led to electronic circuits being regarded as "black boxes" characterized by their response to signals applied to their ports, can be traced to Wilhelm Cauer who published his ideas in their most developed form in 1941. Although Cauer did not himself use the term, others who followed him certainly did describe the method as black-box analysis. Vitold Belevitch puts the concept of black-boxes even earlier, attributing the explicit use of two-port networks as black boxes to Franz Breisig in 1921 and argues that 2-terminal components were implicitly treated as black-boxes before that.

In cybernetics, a full treatment was given by Ross Ashby in 1956. A black box was described by Norbert Wiener in 1961 as an unknown system that was to be identified using the techniques of system identification. He saw the first step in self-organization as being to be able to copy the output behavior of a black box. Many other engineers, scientists and epistemologists, such as Mario Bunge, used and perfected the black box theory in the 1960s.

System theory

In systems theory, the black box is an abstraction representing a class of concrete open system which can be viewed solely in terms of its stimuli inputs and output reactions:

The constitution and structure of the box are altogether irrelevant to the approach under consideration, which is purely external or phenomenological. In other words, only the behavior of the system will be accounted for.
— Mario Bunge

The understanding of a black box is based on the "explanatory principle", the hypothesis of a causal relation between the input and the output. This principle states that input and output are distinct, that the system has observable (and relatable) inputs and outputs and that the system is black to the observer (non-openable).

Recording of observed states

An observer makes observations over time. All observations of inputs and outputs of a black box can be written in a table, in which, at each of a sequence of times, the states of the box's various parts, input and output, are recorded. Thus, using an example from Ashby, examining a box that has fallen from a flying saucer might lead to this protocol:

Time	States of input and output
11:18	I did nothing—the Box emitted a steady hum at 240 Hz.
11:19	I pushed over the switch marked K: the note rose to 480 Hz and remained steady.
11:20	I accidentally pushed the button marked “!”—the Box increased in temperature by 20 °C.
...	Etc.

Thus, every system, fundamentally, is investigated by the collection of a long protocol, drawn out in time, showing the sequence of input and output states. From this there follows the fundamental deduction that all knowledge obtainable from a Black Box (of given input and output) is such as can be obtained by re-coding the protocol (the observation table); all that, and nothing more.

If the observer also controls input, the investigation turns into an experiment (illustration), and hypotheses about cause and effect can be tested directly.

When the experimenter is also motivated to control the box, there is active feedback in the box/observer relation, promoting what in control theory is called a feed forward architecture.

Modeling

The modeling process is the construction of a predictive mathematical model, using existing historic data (observation table).

Testing the black box model

A developed black box model is a validated model when black-box testing methods ensures that it is, based solely on observable elements.

With back testing, out of time data is always used when testing the black box model. Data has to be written down before it is pulled for black box inputs.

Other theories

The observed hydrograph is a graphic of the response of a watershed (a blackbox) with its runoff (red) to an input of rainfall (blue).

Black box theories are those theories defined only in terms of their function. The term can be applied in any field where some inquiry is made into the relations between aspects of the appearance of a system (exterior of the black box), with no attempt made to explain why those relations should exist (interior of the black box). In this context, Newton's theory of gravitation can be described as a black box theory.

Specifically, the inquiry is focused upon a system that has no immediately apparent characteristics and therefore has only factors for consideration held within itself hidden from immediate observation. The observer is assumed ignorant in the first instance as the majority of available data is held in an inner situation away from facile investigations. The black box element of the definition is shown as being characterised by a system where observable elements enter a perhaps imaginary box with a set of different outputs emerging which are also observable.

Adoption in humanities

In humanities disciplines such as philosophy of mind and behaviorism, one of the uses of black box theory is to describe and understand psychological factors in fields such as marketing when applied to an analysis of consumer behaviour.

Black box theory

Black Box theory is even wider in application than professional studies:

The child who tries to open a door has to manipulate the handle (the input) so as to produce the desired movement at the latch (the output); and he has to learn how to control the one by the other without being able to see the internal mechanism that links them. In our daily lives we are confronted at every turn with systems whose internal mechanisms are not fully open to inspection, and which must be treated by the methods appropriate to the Black Box.
— Ashby

(...) This simple rule proved very effective and is an illustration of how the Black Box principle in cybernetics can be used to control situations that, if gone into deeply, may seem very complex.
A further example of the Black Box principle is the treatment of mental patients. The human brain is certainly a Black Box, and while a great deal of neurological research is going on to understand the mechanism of the brain, progress in treatment is also being made by observing patients' responses to stimuli.
— Duckworth, Gear and Lockett

Applications

When the observer (an agent) can also do some stimulus (input), the relation with the black box is not only an observation, but an experiment.

Computing and mathematics

In computer programming and software engineering, black box testing is used to check that the output of a program is as expected, given certain inputs. The term "black box" is used because the actual program being executed is not examined.
In computing in general, a black box program is one where the user cannot see the inner workings (perhaps because it is a closed source program) or one which has no side effects and the function of which need not be examined, a routine suitable for re-use.
Also in computing, a black box refers to a piece of equipment provided by a vendor for the purpose of using that vendor's product. It is often the case that the vendor maintains and supports this equipment, and the company receiving the black box typically is hands-off.
In mathematical modeling, a limiting case.

Science and technology

In neural networking or heuristic algorithms (computer terms generally used to describe 'learning' computers or 'AI simulations'), a black box is used to describe the constantly changing section of the program environment which cannot easily be tested by the programmers. This is also called a white box in the context that the program code can be seen, but the code is so complex that it is functionally equivalent to a black box.
In physics, a black box is a system whose internal structure is unknown, or need not be considered for a particular purpose.
In cryptography to capture the notion of knowledge obtained by an algorithm through the execution of a cryptographic protocol such as a zero-knowledge proof protocol. If the output of an algorithm when interacting with the protocol matches that of a simulator given some inputs, it only needs to know the inputs.

Other applications

In philosophy and psychology, the school of behaviorism sees the human mind as a black box; see other theories.