A Medley of Potpourri

Thursday, November 2, 2023

Perceptron

From Wikipedia, the free encyclopedia

History

The perceptron was invented in 1943 by Warren McCulloch and Walter Pitts. The first hardware implementation was Mark I Perceptron machine built in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt, funded by the Information Systems Branch of the United States Office of Naval Research and the Rome Air Development Center. It was first publicly demonstrated on 23 June 1960. The machine was "part of a previously secret four-year NPIC [the US' National Photographic Interpretation Center] effort from 1963 through 1966 to develop this algorithm into a useful tool for photo-interpreters".

Rosenblatt described the details of the perceptron in a 1958 paper. His organization of a perceptron is constructed of three kinds of cells ("units"): AI, AII, R, which stand for "projection", "association" and "response".

Mark I Perceptron machine

The perceptron was intended to be a machine, rather than a program, and while its first implementation was in software for the IBM 704, it was subsequently implemented in custom-built hardware as the "Mark I perceptron", designed for image recognition. The machine is currently in Smithsonian National Museum of American History.

The Mark I Perceptron has 3 layers.

An array of 400 photocells arranged in a 20x20 grid, named "sensory units" (S-units), or "input retina". Each S-unit can connect to up to 40 A-units.
A hidden layer of 512 perceptrons, named "association units" (A-units).
An output layer of 8 perceptrons, named "response units" (R-units).

Rosenblatt called this three-layered perceptron network the alpha-perceptron, to distinguish it from other perceptron models he experimented with.

The S-units are connected to the A-units randomly (according to a table of random numbers) via a plugboard (see photo), to "eliminate any particular intentional bias in the perceptron". The connection weights are fixed, not learned.

The A-units are connected to the R-units, with adjustable weights encoded in potentiometers, and weight updates during learning were performed by electric motors.The hardware details are in an operators' manual.

In a 1958 press conference organized by the US Navy, Rosenblatt made statements about the perceptron that caused a heated controversy among the fledgling AI community; based on Rosenblatt's statements, The New York Times reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."

Rosenblatt described his experiments with many variants of the Perceptron machine in Principles of Neurodynamics (1962). Among the variants are:

"cross-coupling" (connections between units within the same layer) with possibly closed loops,
"back-coupling" (connections from units in a later layer to units in a previous layer),
four-layer perceptrons where the last two layers have adjustible weights (and thus a proper multilayer perceptron),
analyzing audio (instead of images).

The machine was shipped from Cornell to Smithsonian in 1967, under a government transfer administered by the Office of Naval Research.

Perceptrons (1969)

Although the perceptron initially seemed promising, it was quickly proved that perceptrons could not be trained to recognise many classes of patterns. This caused the field of neural network research to stagnate for many years, before it was recognised that a feedforward neural network with two or more layers (also called a multilayer perceptron) had greater processing power than perceptrons with one layer (also called a single-layer perceptron).

Single-layer perceptrons are only capable of learning linearly separable patterns. For a classification task with some step activation function, a single node will have a single line dividing the data points forming the patterns. More nodes can create more dividing lines, but those lines must somehow be combined to form more complex classifications. A second layer of perceptrons, or even linear nodes, are sufficient to solve many otherwise non-separable problems.

In 1969, a famous book entitled Perceptrons by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function. It is often believed (incorrectly) that they also conjectured that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function. (See the page on Perceptrons (book) for more information.) Nevertheless, the often-miscited Minsky/Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until neural network research experienced a resurgence in the 1980s. This text was reprinted in 1987 as "Perceptrons - Expanded Edition" where some errors in the original text are shown and corrected.

Subsequent work

The kernel perceptron algorithm was already introduced in 1964 by Aizerman et al. Margin bounds guarantees were given for the Perceptron algorithm in the general non-separable case first by Freund and Schapire (1998), and more recently by Mohri and Rostamizadeh (2013) who extend previous results and give new L1 bounds.

The perceptron is a simplified model of a biological neuron. While the complexity of biological neuron models is often required to fully understand neural behavior, research suggests a perceptron-like linear model can produce some behavior seen in real neurons.

The solution spaces of decision boundaries for all binary functions and learning behaviors are studied in .

Definition

The appropriate weights are applied to the inputs, and the resulting weighted sum passed to a function that produces the output o.

In the modern sense, the perceptron is an algorithm for learning a binary classifier called a threshold function: a function that maps its input $x$ (a real-valued vector) to an output value $f (x)$ (a single binary value):

f (x) = θ (w \cdot x + b)

where $θ$ is the heaviside step-function, $w$ is a vector of real-valued weights, $w \cdot x$ is the dot product $\sum_{i = 1}^{m} w_{i} x_{i}$ , where $m$ is the number of inputs to the perceptron, and $b$ is the bias. The bias shifts the decision boundary away from the origin and does not depend on any input value.

Equivalently, since $w \cdot x + b = (w, b) \cdot (x, 1)$ , we can add a coordinate $1$ to each input, and then write it as a linear classifier that passes the origin:

f (x) = θ (w \cdot x)

The value of

f (x)

(0 or 1) is used to perform binary classification on

x

as either a positive or a negative instance. Spatially, the bias shifts the position (though not the orientation) of the planar decision boundary.

In the context of neural networks, a perceptron is an artificial neuron using the Heaviside step function as the activation function. The perceptron algorithm is also termed the single-layer perceptron, to distinguish it from a multilayer perceptron, which is a misnomer for a more complicated neural network. As a linear classifier, the single-layer perceptron is the simplest feedforward neural network.

Power of representation

Information Theory

From an information theory point of view, a single perceptron with K inputs has a capacity of 2K bits of information. This result is due to Thomas Cover.

Specifically let $T (N, K)$ be the number of ways to linearly separate N points in K dimensions, then

T (N, K) = {\begin{array}{cc} 2^{N} & K \geq N \\ 2 \sum_{k = 0}^{K - 1} (\begin{matrix} N - 1 \\ k \end{matrix}) & K < N \end{array}

When K is large,

T (N, K) / 2^{N}

is very close to one when

N \leq 2 K

, but very close to zero when

N > 2 K

. In words, one perceptron unit can almost certainly memorize a random assignment of binary labels on N points when

N \leq 2 K

, but almost certainly not when

N > 2 K

Boolean function

When operating on only binary inputs, a perceptron is called a linearly separable Boolean function, or threshold Boolean function. The sequence of numbers of threshold Boolean functions on n inputs is OEIS A000609. The value is only known exactly up to $n = 9$ case, but the order of magnitude is known quite exactly: it has upper bound $2^{n^{2} - n \log_{2} n + O (n)}$ and lower bound $2^{n^{2} - n \log_{2} n - O (n)}$ .

Any Boolean linear threshold function can be implemented with only integer weights. Furthermore, the number of bits necessary and sufficient for representing a single integer weight parameter is $Θ (n \ln n)$ .

Universal approximation theorem

A single perceptron can learn to classify any half-space. It cannot solve any linearly nonseparable vectors, such as the Boolean exclusive-or problem (the famous "XOR problem").

A perceptron network with one hidden layer can learn to classify any compact subset arbitrarily closely. Similarly, it can also approximate any compactly-supported continuous function arbitrarily closely. This is essentially a special case of the theorems by George Cybenko and Kurt Hornik.

Conjunctively local perceptron

Perceptrons (Minsky and Papert, 1969) studied the kind of perceptron networks necessary to learn various boolean functions.

Consider a perceptron network with $n$ input units, one hidden layer, and one output, similar to the Mark I Perceptron machine. It computes a boolean function of type $f : 2^{n} \to 2$ . They call a function conjuctively local of order $k$ , iff there exists a perceptron network such that each unit in the hidden layer connects to at most $k$ input units.

Theorem. (Theorem 3.1.1): The parity function is conjuctively local of order $n$ .

Theorem. (Section 5.5): The connectedness function is conjuctively local of order $Ω (n^{1 / 2})$ .

Learning algorithm for a single-layer perceptron

Below is an example of a learning algorithm for a single-layer perceptron with a single output unit. For a single-layer perceptron with multiple output units, since the weights of one output unit are completely separate from all the others', the same algorithm can be run for each output unit.

For multilayer perceptrons, where a hidden layer exists, more sophisticated algorithms such as backpropagation must be used. If the activation function or the underlying process being modeled by the perceptron is nonlinear, alternative learning algorithms such as the delta rule can be used as long as the activation function is differentiable. Nonetheless, the learning algorithm described in the steps below will often work, even for multilayer perceptrons with nonlinear activation functions.

When multiple perceptrons are combined in an artificial neural network, each output neuron operates independently of all the others; thus, learning each output can be considered in isolation.

Definitions

We first define some variables:

$r$ is the learning rate of the perceptron. Learning rate is a positive number usually chosen to be less than 1. The larger the value, the greater the chance for volatility in the weight changes.
$y = f (z)$ denotes the output from the perceptron for an input vector $z$ .
$D = {(x_{1}, d_{1}), \dots, (x_{s}, d_{s})}$ $D=\{(\mathbf {x} _{1},d_{1}),\dots ,(\mathbf {x} _{s},d_{s})\}$ is the training set of $s$ $s$ samples, where:
- $x_{j}$ is the $n$ -dimensional input vector.
- $d_{j}$ is the desired output value of the perceptron for that input.

We show the values of the features as follows:

$x_{j, i}$ is the value of the $i$ th feature of the $j$ th training input vector.
$x_{j, 0} = 1$ .

To represent the weights:

$w_{i}$ is the $i$ th value in the weight vector, to be multiplied by the value of the $i$ th input feature.
Because $x_{j, 0} = 1$ , the $w_{0}$ is effectively a bias that we use instead of the bias constant $b$ .

To show the time-dependence of $w$ , we use:

$w_{i} (t)$ is the weight $i$ at time $t$ .

Steps

Initialize the weights. Weights may be initialized to 0 or to a small random value. In the example below, we use 0.
For each example j in our training set D, perform the following steps over the input $x_{j}$ ${\mathbf {x}}_{j}$ and desired output $d_{j}$ $d_{j}$ :
1. Calculate the actual output:
  $\begin{aligned} y_{j} (t) & = f [w (t) \cdot x_{j}] \\ = f [w_{0} (t) x_{j, 0} + w_{1} (t) x_{j, 1} + w_{2} (t) x_{j, 2} + \dots + w_{n} (t) x_{j, n}] \end{aligned}$
2. Update the weights:
  $w_{i} (t + 1) = w_{i} (t) + r \cdot (d_{j} - y_{j} (t)) x_{j, i}$ , for all features $0 \leq i \leq n$ , $r$ is the learning rate.
For offline learning, the second step may be repeated until the iteration error $\frac{1}{s} \sum_{j = 1}^{s} | d_{j} - y_{j} (t) |$ is less than a user-specified error threshold $γ$ , or a predetermined number of iterations have been completed, where s is again the size of the sample set.

The algorithm updates the weights after every training sample in step 2b.

Convergence of one perceptron on a linearly separable dataset

A single perceptron is a linear classifier, therefore it can only reach a stable state if all input vectors classified correctly if the training set $D$ is not linearly separable, i.e. if the positive examples cannot be separated from the negative examples by a hyperplane. In this case, the algorithm would not converge since there is no solution. Hence, if linear separability of the training set is not known a priori, one of the training variants below should be used. Detailed analysis and extensions to the convergence theorem are in Chapter 11 of Perceptrons (1969).

Linear separability is testable in time $min (O (n^{d / 2}), O (d^{2 n}), O (n^{d - 1} \ln n))$ , where $n$ is the number of data points, and $d$ is the dimension of each point.

If the training set is linearly separable, then the perceptron is guaranteed to converge after making finitely many mistakes. The theorem is proved by Rosenblatt et al.

Perceptron convergence theorem — Given a dataset $D$ , such that $max_{(x, y) \in D} ‖ x ‖_{2} = R$ , and it is linearly separable by some unit vector $w^{*}$ , with margin $γ$ :

γ := min_{(x, y) \in D} y (w^{*} \cdot x)

Then the perceptron 0-1 learning algorithm converges after making at most $(R / γ)^{2}$ mistakes, for any learning rate, and any method of sampling from the dataset.

The following simple proof is due to Novikoff (1962). The idea of the proof is that the weight vector is always adjusted by a bounded amount in a direction with which it has a negative dot product, and thus can be bounded above by $O (\sqrt t)$ , where $t$ is the number of changes to the weight vector. However, it can also be bounded below by $O (t)$ because if there exists an (unknown) satisfactory weight vector, then every change makes progress in this (unknown) direction by a positive amount that depends only on the input vector.

Proof

Suppose at step $t$ , the perceptron with weight $w_{t}$ makes a mistake on data point $(x, y)$ , then it updates to $w_{t + 1} = w_{t} + r (y - f_{w_{t}} (x)) x$ .

If $y = 0$ , the argument is symmetric, so we omit it.

WLOG, $y = 1$ , then $f_{w_{t}} (x) = 0$ , $f_{w^{*}} (x) = 1$ , and $w_{t + 1} = w_{t} + r x$ .

By assumption, we have separation with margins:

w^{*} \cdot x \geq γ

Thus,

w^{*} \cdot w_{t + 1} - w^{*} \cdot w_{t} = w^{*} \cdot (r x) \geq r γ

Also

‖ w_{t + 1} ‖_{2}^{2} - ‖ w_{t} ‖_{2}^{2} = ‖ w_{t} + r x ‖_{2}^{2} - ‖ w_{t} ‖_{2}^{2} = 2 r (w_{t} \cdot x) + r^{2} ‖ x ‖_{2}^{2}

and since the perceptron made a mistake,

w_{t} \cdot x \leq 0

, and so

‖ w_{t + 1} ‖_{2}^{2} - ‖ w_{t} ‖_{2}^{2} \leq ‖ x ‖_{2}^{2} \leq r^{2} R^{2}

Since we started with $w_{0} = 0$ , after making $N$ mistakes,

‖ w ‖_{2} \leq \sqrt{N r^{2} R^{2}}

but also

‖ w ‖_{2} \geq w \cdot w^{*} \geq N r γ

Combining the two, we have $N \leq (R / γ)^{2}$

While the perceptron algorithm is guaranteed to converge on some solution in the case of a linearly separable training set, it may still pick any solution and problems may admit many solutions of varying quality. The perceptron of optimal stability, nowadays better known as the linear support-vector machine, was designed to solve this problem (Krauth and Mezard, 1987).

Perceptron cycling theorem

When the dataset is not linearly separable, then there is no way for a single perceptron to converge. However, we still have

Perceptron cycling theorem — If the dataset $D$ has only finitely many points, then there exists an upper bound number $M$ , such that for any starting weight vector $w_{0}$ all weight vector $w_{t}$ has norm bounded by $‖ w_{t} ‖ \leq ‖ w_{0} ‖ + M$

This is proved first by Bradley Efron.

Learning a Boolean function

Consider a dataset where the $x$ are from ${- 1, + 1}^{n}$ , that is, the vertices of an n-dimensional hypercube centered at origin, and $y = θ (x_{i})$ . That is, all data points with positive $x_{i}$ have $y = 1$ , and vice versa. By the perceptron convergence theorem, a perceptron would converge after making at most $n$ mistakes.

If we were to write a logical program to perform the same task, each positive example shows that one of the coordinates is the right one, and each negative example shows that its complement is a positive example. By collecting all the known positive examples, we eventually eliminate all but one coordinate, at which point the dataset is learned.

This bound is asymptotically tight in terms of the worst-case. In the worst-case, the first presented example is entirely new, and gives $n$ bits of information, but each subsequent example would differ minimally from previous examples, and gives 1 bit each. After $n + 1$ examples, there are $2 n$ bits of information, which is sufficient for the perceptron (with $2 n$ bits of information).

However, it is not tight in terms of expectation if the examples are presented uniformly at random, since the first would give $n$ bits, the second $n / 2$ bits, and so on, taking $O (\ln n)$ examples in total.

Variants

The pocket algorithm with ratchet (Gallant, 1990) solves the stability problem of perceptron learning by keeping the best solution seen so far "in its pocket". The pocket algorithm then returns the solution in the pocket, rather than the last solution. It can be used also for non-separable data sets, where the aim is to find a perceptron with a small number of misclassifications. However, these solutions appear purely stochastically and hence the pocket algorithm neither approaches them gradually in the course of learning, nor are they guaranteed to show up within a given number of learning steps.

The Maxover algorithm (Wendemuth, 1995) is "robust" in the sense that it will converge regardless of (prior) knowledge of linear separability of the data set. In the linearly separable case, it will solve the training problem – if desired, even with optimal stability (maximum margin between the classes). For non-separable data sets, it will return a solution with a small number of misclassifications. In all cases, the algorithm gradually approaches the solution in the course of learning, without memorizing previous states and without stochastic jumps. Convergence is to global optimality for separable data sets and to local optimality for non-separable data sets.

The Voted Perceptron (Freund and Schapire, 1999), is a variant using multiple weighted perceptrons. The algorithm starts a new perceptron every time an example is wrongly classified, initializing the weights vector with the final weights of the last perceptron. Each perceptron will also be given another weight corresponding to how many examples do they correctly classify before wrongly classifying one, and at the end the output will be a weighted vote on all perceptrons.

In separable problems, perceptron training can also aim at finding the largest separating margin between the classes. The so-called perceptron of optimal stability can be determined by means of iterative training and optimization schemes, such as the Min-Over algorithm (Krauth and Mezard, 1987) or the AdaTron (Anlauf and Biehl, 1989)). AdaTron uses the fact that the corresponding quadratic optimization problem is convex. The perceptron of optimal stability, together with the kernel trick, are the conceptual foundations of the support-vector machine.

The $α$ -perceptron further used a pre-processing layer of fixed random weights, with thresholded output units. This enabled the perceptron to classify analogue patterns, by projecting them into a binary space. In fact, for a projection space of sufficiently high dimension, patterns can become linearly separable.

Another way to solve nonlinear problems without using multiple layers is to use higher order networks (sigma-pi unit). In this type of network, each element in the input vector is extended with each pairwise combination of multiplied inputs (second order). This can be extended to an n-order network.

It should be kept in mind, however, that the best classifier is not necessarily that which classifies all the training data perfectly. Indeed, if we had the prior constraint that the data come from equi-variant Gaussian distributions, the linear separation in the input space is optimal, and the nonlinear solution is overfitted.

Other linear classification algorithms include Winnow, support-vector machine, and logistic regression.

Multiclass perceptron

Like most other techniques for training linear classifiers, the perceptron generalizes naturally to multiclass classification. Here, the input $x$ and the output $y$ are drawn from arbitrary sets. A feature representation function $f (x, y)$ maps each possible input/output pair to a finite-dimensional real-valued feature vector. As before, the feature vector is multiplied by a weight vector $w$ , but now the resulting score is used to choose among many possible outputs:

\hat{y} = {argmax}_{y} f (x, y) \cdot w .

Learning again iterates over the examples, predicting an output for each, leaving the weights unchanged when the predicted output matches the target, and changing them when it does not. The update becomes:

w_{t + 1} = w_{t} + f (x, y) - f (x, \hat{y}) .

This multiclass feedback formulation reduces to the original perceptron when $x$ is a real-valued vector, $y$ is chosen from ${0, 1}$ , and $f (x, y) = y x$ .

For certain problems, input/output representations and features can be chosen so that ${a r g m a x}_{y} f (x, y) \cdot w$ can be found efficiently even though $y$ is chosen from a very large or even infinite set.

Since 2002, perceptron training has become popular in the field of natural language processing for such tasks as part-of-speech tagging and syntactic parsing (Collins, 2002). It has also been applied to large-scale machine learning problems in a distributed computing setting.