A Medley of Potpourri

Thursday, December 26, 2024

Weak supervision

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Weak_supervision

Weak supervision (also known as semi-supervised learning) is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data (exclusively used in more expensive and time-consuming supervised learning paradigm), followed by a large amount of unlabeled data (used exclusively in unsupervised learning paradigm). In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.

Problem

Tendency for a task to employ supervised vs. unsupervised methods. Task names straddling circle boundaries is intentional. It shows that the classical division of imaginative tasks (left) employing unsupervised methods is blurred in today's learning schemes.

The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.

Technique

More formally, semi-supervised learning assumes a set of $l$ independently identically distributed examples $x_{1}, \dots, x_{l} \in X$ with corresponding labels $y_{1}, \dots, y_{l} \in Y$ and $u$ unlabeled examples $x_{l + 1}, \dots, x_{l + u} \in X$ are processed. Semi-supervised learning combines this information to surpass the classification performance that can be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.

Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data $x_{l + 1}, \dots, x_{l + u}$ only. The goal of inductive learning is to infer the correct mapping from $X$ to $Y$ .

It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.

Assumptions

In order to make any use of unlabeled data, some relationship to the underlying distribution of data must exist. Semi-supervised learning algorithms make use of at least one of the following assumptions:

Continuity / smoothness assumption

Points that are close to each other are more likely to share a label. This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries. In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision boundaries in low-density regions, so few points are close to each other but in different classes.

Cluster assumption

The data tend to form discrete clusters, and points in the same cluster are more likely to share a label (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.

Manifold assumption

The data lie approximately on a manifold of much lower dimension than the input space. In this case learning the manifold using both the labeled and unlabeled data can avoid the curse of dimensionality. Then learning can proceed using distances and densities defined on the manifold.

The manifold assumption is practical when high-dimensional data are generated by some process that may be hard to model directly, but which has only a few degrees of freedom. For instance, human voice is controlled by a few vocal folds, and images of various facial expressions are controlled by a few muscles. In these cases, it is better to consider distances and smoothness in the natural space of the generating problem, rather than in the space of all possible acoustic waves or images, respectively.

History

The heuristic approach of self-training (also known as self-learning or self-labeling) is historically the oldest approach to semi-supervised learning, with examples of applications starting in the 1960s.

The transductive learning framework was formally introduced by Vladimir Vapnik in the 1970s. Interest in inductive learning using generative models also began in the 1970s. A probably approximately correct learning bound for semi-supervised learning of a Gaussian mixture was demonstrated by Ratsaby and Venkatesh in 1995.

Methods

Generative models

Generative approaches to statistical learning first seek to estimate $p (x | y)$ , the distribution of data points belonging to each class. The probability $p (y | x)$ that a given point $x$ has label $y$ is then proportional to $p (x | y) p (y)$ by Bayes' rule. Semi-supervised learning with generative models can be viewed either as an extension of supervised learning (classification plus information about $p (x)$ ) or as an extension of unsupervised learning (clustering plus some labels).

Generative models assume that the distributions take some particular form $p (x | y, θ)$ parameterized by the vector $θ$ . If these assumptions are incorrect, the unlabeled data may actually decrease the accuracy of the solution relative to what would have been obtained from labeled data alone. However, if the assumptions are correct, then the unlabeled data necessarily improves performance.

The unlabeled data are distributed according to a mixture of individual-class distributions. In order to learn the mixture distribution from the unlabeled data, it must be identifiable, that is, different parameters must yield different summed distributions. Gaussian mixture distributions are identifiable and commonly used for generative models.

The parameterized joint distribution can be written as $p (x, y | θ) = p (y | θ) p (x | y, θ)$ by using the chain rule. Each parameter vector $θ$ is associated with a decision function $f_{θ} (x) = \underset{y}{argmax} p (y | x, θ)$ . The parameter is then chosen based on fit to both the labeled and unlabeled data, weighted by $λ$ :

\underset{Θ}{argmax} (\log p ({x_{i}, y_{i}}_{i = 1}^{l} | θ) + λ \log p ({x_{i}}_{i = l + 1}^{l + u} | θ))

Low-density separation

Another major class of methods attempts to place boundaries in regions with few data points (labeled or unlabeled). One of the most commonly used algorithms is the transductive support vector machine, or TSVM (which, despite its name, may be used for inductive learning as well). Whereas support vector machines for supervised learning seek a decision boundary with maximal margin over the labeled data, the goal of TSVM is a labeling of the unlabeled data such that the decision boundary has maximal margin over all of the data. In addition to the standard hinge loss $(1 - y f (x))_{+}$ for labeled data, a loss function $(1 - | f (x) |)_{+}$ is introduced over the unlabeled data by letting $y = sign f (x)$ . TSVM then selects $f^{*} (x) = h^{*} (x) + b$ from a reproducing kernel Hilbert space $H$ by minimizing the regularized empirical risk:

f^{*} = \underset{f}{argmin} (\sum_{i = 1}^{l} (1 - y_{i} f (x_{i}))_{+} + λ_{1} ‖ h ‖_{H}^{2} + λ_{2} \sum_{i = l + 1}^{l + u} (1 - | f (x_{i}) |)_{+})

An exact solution is intractable due to the non-convex term $(1 - | f (x) |)_{+}$ , so research focuses on useful approximations.

Other approaches that implement low-density separation include Gaussian process models, information regularization, and entropy minimization (of which TSVM is a special case).

Laplacian regularization

Laplacian regularization has been historically approached through graph-Laplacian. Graph-based methods for semi-supervised learning use a graph representation of the data, with a node for each labeled and unlabeled example. The graph may be constructed using domain knowledge or similarity of examples; two common methods are to connect each data point to its $k$ nearest neighbors or to examples within some distance $ϵ$ . The weight $W_{i j}$ of an edge between $x_{i}$ and $x_{j}$ is then set to $e^{- ‖ x_{i} - x_{j} ‖^{2} / ϵ^{2}}$ .

Within the framework of manifold regularization, the graph serves as a proxy for the manifold. A term is added to the standard Tikhonov regularization problem to enforce smoothness of the solution relative to the manifold (in the intrinsic space of the problem) as well as relative to the ambient input space. The minimization problem becomes

\underset{f \in H}{argmin} (\frac{1}{l} \sum_{i = 1}^{l} V (f (x_{i}), y_{i}) + λ_{A} ‖ f ‖_{H}^{2} + λ_{I} \int_{M} ‖ \nabla_{M} f (x) ‖^{2} d p (x))

where $H$ is a reproducing kernel Hilbert space and $M$ is the manifold on which the data lie. The regularization parameters $λ_{A}$ and $λ_{I}$ control smoothness in the ambient and intrinsic spaces respectively. The graph is used to approximate the intrinsic regularization term. Defining the graph Laplacian $L = D - W$ where $D_{i i} = \sum_{j = 1}^{l + u} W_{i j}$ and $f$ is the vector $[f (x_{1}) \dots f (x_{l + u})]$ , we have

f^{T} L f = \sum_{i, j = 1}^{l + u} W_{i j} (f_{i} - f_{j})^{2} \approx \int_{M} ‖ \nabla_{M} f (x) ‖^{2} d p (x)

The graph-based approach to Laplacian regularization is to put in relation with finite difference method.

The Laplacian can also be used to extend the supervised learning algorithms: regularized least squares and support vector machines (SVM) to semi-supervised versions Laplacian regularized least squares and Laplacian SVM.

Heuristic approaches

Some methods for semi-supervised learning are not intrinsically geared to learning from both unlabeled and labeled data, but instead make use of unlabeled data within a supervised learning framework. For instance, the labeled and unlabeled examples $x_{1}, \dots, x_{l + u}$ may inform a choice of representation, distance metric, or kernel for the data in an unsupervised first step. Then supervised learning proceeds from only the labeled examples. In this vein, some methods learn a low-dimensional representation using the supervised data and then apply either low-density separation or graph-based methods to the learned representation. Iteratively refining the representation and then performing semi-supervised learning on said representation may further improve performance.

Self-training is a wrapper method for semi-supervised learning. First a supervised learning algorithm is trained based on the labeled data only. This classifier is then applied to the unlabeled data to generate more labeled examples as input for the supervised learning algorithm. Generally only the labels the classifier is most confident in are added at each step. In natural language processing, a common self-training algorithm is the Yarowsky algorithm for problems like word sense disambiguation, accent restoration, and spelling correction.

Co-training is an extension of self-training in which multiple classifiers are trained on different (ideally disjoint) sets of features and generate labeled examples for one another.

In human cognition

Human responses to formal semi-supervised learning problems have yielded varying conclusions about the degree of influence of the unlabeled data. More natural learning problems may also be viewed as instances of semi-supervised learning. Much of human concept learning involves a small amount of direct instruction (e.g. parental labeling of objects during childhood) combined with large amounts of unlabeled experience (e.g. observation of objects without naming or counting them, or at least without feedback).

Human infants are sensitive to the structure of unlabeled natural categories such as images of dogs and cats or male and female faces. Infants and children take into account not only unlabeled examples, but the sampling process from which labeled examples arise.

Evolutionary algorithm

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Evolutionary_algorithm

In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators.

Evolutionary algorithms often perform well approximating solutions to all types of problems because they ideally do not make any assumption about the underlying fitness landscape. Techniques from evolutionary algorithms applied to the modeling of biological evolution are generally limited to explorations of microevolutionary processes and planning models based upon cellular processes. In most real applications of EAs, computational complexity is a prohibiting factor. In fact, this computational complexity is due to fitness function evaluation. Fitness approximation is one of the solutions to overcome this difficulty. However, seemingly simple EA can solve often complex problems; therefore, there may be no direct link between algorithm complexity and problem complexity.

Implementation

The following is an example of a generic single-objective genetic algorithm.

Step One: Generate the initial population of individuals randomly. (First generation)

Step Two: Repeat the following regenerational steps until termination (time limit, sufficient fitness achieved, etc.):

Evaluate the fitness of each individual in the population
Select the individuals for reproduction based on their fitness. (Parents)
Breed new individuals through crossover and mutation operations to give birth to offspring.
Replace the least-fit individuals of the population with new individuals.

Types

Similar techniques differ in genetic representation and other implementation details, and the nature of the particular applied problem.

Genetic algorithm – This is the most popular type of EA. One seeks the solution of a problem in the form of strings of numbers (traditionally binary, although the best representations are usually those that reflect something about the problem being solved), by applying operators such as recombination and mutation (sometimes one, sometimes both). This type of EA is often used in optimization problems.
Genetic programming – Here the solutions are in the form of computer programs, and their fitness is determined by their ability to solve a computational problem. There are many variants of Genetic Programming, including Cartesian genetic programming, gene expression programming, grammatical evolution, linear genetic programming, multi expression programming etc.
Evolutionary programming – Similar to genetic programming, but the structure of the program is fixed and its numerical parameters are allowed to evolve.
Evolution strategy – Works with vectors of real numbers as representations of solutions, and typically uses self-adaptive mutation rates. The method is mainly used for numerical optimization, although there are also variants for combinatorial tasks.
Differential evolution – Based on vector differences and is therefore primarily suited for numerical optimization problems.
Coevolutionary algorithm – Similar to genetic algorithms and evolution strategies, but the created solutions are compared on the basis of their outcomes from interactions with other solutions. Solutions can either compete or cooperate during the search process. Coevolutionary algorithms are often used in scenarios where the fitness landscape is dynamic, complex, or involves competitive interactions.
Neuroevolution – Similar to genetic programming but the genomes represent artificial neural networks by describing structure and connection weights. The genome encoding can be direct or indirect.
Learning classifier system – Here the solution is a set of classifiers (rules or conditions). A Michigan-LCS evolves at the level of individual classifiers whereas a Pittsburgh-LCS uses populations of classifier-sets. Initially, classifiers were only binary, but now include real, neural net, or S-expression types. Fitness is typically determined with either a strength or accuracy based reinforcement learning or supervised learning approach.
Quality–Diversity algorithms – QD algorithms simultaneously aim for high-quality and diverse solutions. Unlike traditional optimization algorithms that solely focus on finding the best solution to a problem, QD algorithms explore a wide variety of solutions across a problem space and keep those that are not just high performing, but also diverse and unique.

Theoretical background

The following theoretical principles apply to all or almost all EAs.

No free lunch theorem

The no free lunch theorem of optimization states that all optimization strategies are equally effective when the set of all optimization problems is considered. Under the same condition, no evolutionary algorithm is fundamentally better than another. This can only be the case if the set of all problems is restricted. This is exactly what is inevitably done in practice. Therefore, to improve an EA, it must exploit problem knowledge in some form (e.g. by choosing a certain mutation strength or a problem-adapted coding). Thus, if two EAs are compared, this constraint is implied. In addition, an EA can use problem specific knowledge by, for example, not randomly generating the entire start population, but creating some individuals through heuristics or other procedures. Another possibility to tailor an EA to a given problem domain is to involve suitable heuristics, local search procedures or other problem-related procedures in the process of generating the offspring. This form of extension of an EA is also known as a memetic algorithm. Both extensions play a major role in practical applications, as they can speed up the search process and make it more robust.

Convergence

For EAs in which, in addition to the offspring, at least the best individual of the parent generation is used to form the subsequent generation (so-called elitist EAs), there is a general proof of convergence under the condition that an optimum exists. Without loss of generality, a maximum search is assumed for the proof:

From the property of elitist offspring acceptance and the existence of the optimum it follows that per generation $k$ an improvement of the fitness $F$ of the respective best individual $x^{'}$ will occur with a probability $P > 0$ . Thus:

F (x_{1}^{'}) \leq F (x_{2}^{'}) \leq F (x_{3}^{'}) \leq \dots \leq F (x_{k}^{'}) \leq \dots

I.e., the fitness values represent a monotonically non-decreasing sequence, which is bounded due to the existence of the optimum. From this follows the convergence of the sequence against the optimum.

Since the proof makes no statement about the speed of convergence, it is of little help in practical applications of EAs. But it does justify the recommendation to use elitist EAs. However, when using the usual panmictic population model, elitist EAs tend to converge prematurely more than non-elitist ones. In a panmictic population model, mate selection (step 2 of the section about implementation) is such that every individual in the entire population is eligible as a mate. In non-panmictic populations, selection is suitably restricted, so that the dispersal speed of better individuals is reduced compared to panmictic ones. Thus, the general risk of premature convergence of elitist EAs can be significantly reduced by suitable population models that restrict mate selection.

Virtual alphabets

With the theory of virtual alphabets, David E. Goldberg showed in 1990 that by using a representation with real numbers, an EA that uses classical recombination operators (e.g. uniform or n-point crossover) cannot reach certain areas of the search space, in contrast to a coding with binary numbers. This results in the recommendation for EAs with real representation to use arithmetic operators for recombination (e.g. arithmetic mean or intermediate recombination). With suitable operators, real-valued representations are more effective than binary ones, contrary to earlier opinion.

Comparison to biological processes

A possible limitation of many evolutionary algorithms is their lack of a clear genotype–phenotype distinction. In nature, the fertilized egg cell undergoes a complex process known as embryogenesis to become a mature phenotype. This indirect encoding is believed to make the genetic search more robust (i.e. reduce the probability of fatal mutations), and also may improve the evolvability of the organism. Such indirect (also known as generative or developmental) encodings also enable evolution to exploit the regularity in the environment. Recent work in the field of artificial embryogeny, or artificial developmental systems, seeks to address these concerns. And gene expression programming successfully explores a genotype–phenotype system, where the genotype consists of linear multigenic chromosomes of fixed length and the phenotype consists of multiple expression trees or computer programs of different sizes and shapes.

Comparison to Monte-Carlo methods

Both method classes have in common that their individual search steps are determined by chance. The main difference, however, is that EAs, like many other metaheuristics, learn from past search steps and incorporate this experience into the execution of the next search steps in a method-specific form. With EAs, this is done firstly through the fitness-based selection operators for partner choice and the formation of the next generation. And secondly, in the type of search steps: In EA, they start from a current solution and change it or they mix the information of two solutions. In contrast, when dicing out new solutions in Monte-Carlo methods, there is usually no connection to existing solutions.

If, on the other hand, the search space of a task is such that there is nothing to learn, Monte-Carlo methods are an appropriate tool, as they do not contain any algorithmic overhead that attempts to draw suitable conclusions from the previous search. An example of such tasks is the proverbial search for a needle in a haystack, e.g. in the form of a flat (hyper)plane with a single narrow peak.

Applications

The areas in which evolutionary algorithms are practically used are almost unlimited and range from industry, engineering, complex scheduling, agriculture, robot movement planning and finance to research and art. The application of an evolutionary algorithm requires some rethinking from the inexperienced user, as the approach to a task using an EA is different from conventional exact methods and this is usually not part of the curriculum of engineers or other disciplines. For example, the fitness calculation must not only formulate the goal but also support the evolutionary search process towards it, e.g. by rewarding improvements that do not yet lead to a better evaluation of the original quality criteria. For example, if peak utilisation of resources such as personnel deployment or energy consumption is to be avoided in a scheduling task, it is not sufficient to assess the maximum utilisation. Rather, the number and duration of exceedances of a still acceptable level should also be recorded in order to reward reductions below the actual maximum peak value. There are therefore some publications that are aimed at the beginner and want to help avoiding beginner's mistakes as well as leading an application project to success.This includes clarifying the fundamental question of when an EA should be used to solve a problem and when it is better not to.

Related techniques and other global search methods

There are some other proven and widely used methods of nature inspired global search techniques such as

Memetic algorithm – A hybrid method, inspired by Richard Dawkins's notion of a meme. It commonly takes the form of a population-based algorithm (frequently an EA) coupled with individual learning procedures capable of performing local refinements. Emphasizes the exploitation of problem-specific knowledge and tries to orchestrate local and global search in a synergistic way.
A celular evolutionary or memetic algorithm uses a topological neighbouhood relation between the individuals of a population for restricting the mate selection and by that reducing the propagation speed of above-average individuals. The idea is to maintain genotypic diversity in the poulation over a longer period of time to reduce the risk of premature convergence.
Ant colony optimization is based on the ideas of ant foraging by pheromone communication to form paths. Primarily suited for combinatorial optimization and graph problems.
Particle swarm optimization is based on the ideas of animal flocking behaviour. Also primarily suited for numerical optimization problems.
Gaussian adaptation – Based on information theory. Used for maximization of manufacturing yield, mean fitness or average information. See for instance Entropy in thermodynamics and information theory.

In addition, many new nature-inspired or methaphor-guided algorithms have been proposed since the beginning of this century. For criticism of most publications on these, see the remarks at the end of the introduction to the article on metaheuristics.

Examples

In 2020, Google stated that their AutoML-Zero can successfully rediscover classic algorithms such as the concept of neural networks.

The computer simulations Tierra and Avida attempt to model macroevolutionary dynamics.