From Wikipedia, the free encyclopedia
Bayesian programming is a formalism and a methodology for having a technique to specify
probabilistic models and solve problems when less than the necessary information is available.
Bayes’ Theorem
is the central concept behind this programming approach, which states
that the probability of something occurring in the future can be
inferred by past conditions related to the event.
Edwin T. Jaynes
proposed that probability could be considered as an alternative and an
extension of logic for rational reasoning with incomplete and uncertain
information. In his founding book
Probability Theory: The Logic of Science he developed this theory and proposed what he called “the robot,” which was not
a physical device, but an inference engine to automate probabilistic reasoning—a kind of
Prolog for probability instead of logic. Bayesian programming is a formal and concrete implementation of this "robot".
Bayesian programming may also be seen as an algebraic formalism to specify
graphical models such as, for instance,
Bayesian networks,
dynamic Bayesian networks,
Kalman filters or
hidden Markov models. Indeed, Bayesian Programming is more general than
Bayesian networks and has a power of expression equivalent to probabilistic
factor graphs.
Formalism
A Bayesian program is a means of specifying a family of probability distributions.
The constituent elements of a Bayesian program are presented below:
- A program is constructed from a description and a question.
- A description is constructed using some specification ()
as given by the programmer and an identification or learning process
for the parameters not completely specified by the specification, using a
data set ().
- A specification is constructed from a set of pertinent variables, a decomposition and a set of forms.
- Forms are either parametric forms or questions to other Bayesian programs.
- A question specifies which probability distribution has to be computed.
Description
The purpose of a description is to specify an effective method of computing a
joint probability distribution
on a set of
variables given a set of experimental data
and some
specification
. This
joint distribution is denoted as:
.
To specify preliminary knowledge
, the programmer must undertake the following:
- Define the set of relevant variables on which the joint distribution is defined.
- Decompose the joint distribution (break it into relevant independent or conditional probabilities).
- Define the forms of each of the distributions (e.g., for each variable, one of the list of probability distributions).
Decomposition
Given a partition of
containing
subsets,
variables are defined
, each corresponding to one of these subsets.
Each variable
is obtained as the conjunction of the variables
belonging to the
subset. Recursive application of
Bayes' theorem leads to:
Conditional independence hypotheses then allow further simplifications. A conditional
independence hypothesis for variable
is defined by choosing some variable
among the variables appearing in the conjunction
, labelling
as the
conjunction of these chosen variables and setting:
We then obtain:
Such a simplification of the joint distribution as a product of simpler distributions is
called a decomposition, derived using the
chain rule.
This ensures that each variable appears at the most once on the left of a conditioning
bar, which is the necessary and sufficient condition to write mathematically valid
decompositions.
Forms
Each distribution
appearing in the product is then associated
with either a parametric form (i.e., a function
) or a question to another Bayesian program
.
When it is a form
, in general,
is a vector of parameters that may depend on
or
or both. Learning
takes place when some of these parameters are computed using the data set
.
An important feature of Bayesian Programming is this capacity to
use questions to other Bayesian programs as components of the definition
of a new Bayesian program.
is obtained by some inferences done by another Bayesian program defined by the specifications
and the data
. This is similar to calling a subroutine in classical programming and provides an easy way to build
hierarchical models.
Question
Given a description (i.e.,
), a question is obtained by partitioning
into three sets: the searched variables, the known variables and
the free variables.
The 3 variables
,
and
are defined as the
conjunction of the variables belonging to
these sets.
A question is defined as the set
of distributions:
made of many "instantiated questions" as the cardinal of
,
each instantiated question being the distribution:
Inference
Given the joint distribution
, it is always possible to compute any possible question using the following general inference:
where the first equality results from the marginalization rule, the second
results from
Bayes' theorem
and the third corresponds to a second application of marginalization.
The denominator appears to be a normalization term and can be replaced
by a constant
.
Theoretically, this allows to solve any Bayesian inference problem. In practice,
however, the cost of computing exhaustively and exactly
is too great in almost all cases.
Replacing the joint distribution by its decomposition we get:
which is usually a much simpler expression to compute, as the
dimensionality of the problem is considerably reduced by the
decomposition into a product of lower dimension distributions.
Example
Bayesian spam detection
The purpose of
Bayesian spam filtering is to eliminate junk e-mails.
The problem is very easy to formulate. E-mails should be
classified
into one of two categories: non-spam or spam. The only available
information to classify the e-mails is their content: a set of words.
Using these words without taking the order into account is commonly
called a
bag of words model.
The classifier should furthermore be able to adapt to its user and to learn
from experience. Starting from an initial standard setting, the classifier should
modify its internal parameters when the user disagrees with its own decision.
It will hence adapt to the user’s criteria to differentiate between non-spam and
spam. It will improve its results as it encounters increasingly classified e-mails.
Variables
The variables necessary to write this program are as follows:
- : a binary variable, false if the e-mail is not spam and true otherwise.
- : binary variables. is true if the word of the dictionary is present in the text.
These
binary variables sum up all the information
about an e-mail.
Decomposition
Starting from the joint distribution and applying recursively
Bayes' theorem we obtain:
This is an exact mathematical expression.
It can be drastically simplified by assuming that the probability
of appearance of a word knowing the nature of the text (spam or not) is
independent of the appearance of the other words. This is the
naive Bayes assumption and this makes this spam filter a
naive Bayes model.
For instance, the programmer can assume that:
to finally obtain:
This kind of assumption is known as the
naive Bayes' assumption.
It is "naive" in the sense that the independence between words is
clearly not completely true. For instance, it completely neglects that
the appearance of pairs of words may be more significant than isolated
appearances. However, the programmer may assume this hypothesis and may
develop the model and the associated inferences to test how reliable and
efficient it is.
Parametric forms
To be able to compute the joint distribution, the programmer must now specify the
distributions appearing in the decomposition:
- is a prior defined, for instance, by
- Each of the forms may be specified using Laplace rule of succession (this is a pseudocounts-based smoothing technique to counter the zero-frequency problem of words never-seen-before):
where
stands for the number of appearances of the
word in non-spam e-mails and
stands for the total number of non-spam e-mails. Similarly,
stands for the number of appearances of the
word in spam e-mails and
stands for the total number of spam e-mails.
Identification
The
forms
are not yet completely specified because the
parameters
,
,
and
have no values yet.
The identification of these parameters could be done either by
batch processing a series of classified e-mails or by an incremental
updating of the parameters using the user's classifications of the
e-mails as they arrive.
Both methods could be combined: the system could start with
initial standard values of these parameters issued from a generic
database, then some incremental learning customizes the classifier to
each individual user.
Question
The
question asked to the program is: "what is the probability for a given
text to be spam knowing which words appear and don't appear in this
text?"
It can be formalized by:
which can be computed as follows:
The denominator appears to be a
normalization constant. It is not necessary to compute it to decide if we are dealing with spam. For instance, an easy trick is to compute the ratio:
This computation is faster and easier because it requires only
products.
Bayesian program
The Bayesian spam filter program is completely defined by:
Bayesian filter, Kalman filter and hidden Markov model
Bayesian filters (often called
Recursive Bayesian estimation)
are generic probabilistic models for time evolving processes. Numerous
models are particular instances of this generic approach, for instance:
the
Kalman filter or the
Hidden Markov model (HMM).
Variables
- Variables are a time series of state variables considered to be on a time horizon ranging from to .
- Variables are a time series of observation variables on the same horizon.
Decomposition
The decomposition is based:
- on , called the system model, transition model or dynamic model, which formalizes the transition from the state at time to the state at time ;
- on , called the observation model, which expresses what can be observed at time when the system is in state ;
- on an initial state at time : .
Parametrical forms
The
parametrical forms are not constrained and different choices lead to
different well-known models: see Kalman filters and Hidden Markov models
just below.
Question
The typical question for such models is
: what is the probability distribution for the state at time
knowing the observations from instant
to
?
The most common case is Bayesian filtering where
, which searches for the present state, knowing past observations.
However it is also possible
, to extrapolate a future state from past observations, or to do smoothing
, to recover a past state from observations made either before or after that instant.
More complicated questions may also be asked as shown below in the HMM section.
Bayesian filters
have a very interesting recursive property, which contributes greatly to their attractiveness.
may be computed simply from
with the following formula:
Another interesting point of view for this equation is to consider that there are two phases: a
prediction phase and an estimation phase:
- During the prediction phase, the state is predicted using the
dynamic model and the estimation of the state at the previous moment:
-
- During the estimation phase, the prediction is either confirmed or invalidated using the last observation:
-
Bayesian program
Kalman filter
The very well-known
Kalman filters are a special case of Bayesian
filters.
They are defined by the following Bayesian program:
- Variables are continuous.
- The transition model and the observation model are both specified using Gaussian laws with means that are linear functions of the conditioning variables.
With these hypotheses and by using the recursive formula, it is possible to solve
the inference problem analytically to answer the usual
question.
This leads to an extremely efficient algorithm, which explains the
popularity of Kalman filters and the number of their everyday
applications.
When there are no obvious linear transition and observation models, it is still often
possible, using a first-order Taylor's expansion, to treat these models as locally linear.
This generalization is commonly called the
extended Kalman filter.
Hidden Markov model
Hidden Markov models (HMMs) are another very popular specialization of Bayesian filters.
They are defined by the following Bayesian program:
- Variables are treated as being discrete.
- The transition model and the observation model are
both specified using probability matrices.
- The question most frequently asked of HMMs is:
-
What is the most probable series of states that leads to the present state, knowing the past observations?
This particular question may be answered with a specific and very efficient algorithm
called the
Viterbi algorithm.
The
Baum–Welch algorithm has been developed
for HMMs.
Applications
Academic applications
Since 2000, Bayesian programming has been used to develop both
robotics applications and life sciences models.
Robotics
In robotics, bayesian programming was applied to
autonomous robotics, robotic
CAD systems,
advanced driver-assistance systems,
robotic arm control,
mobile robotics, human-robot interaction, human-vehicle interaction (Bayesian autonomous driver models)
video game avatar programming and training and real-time strategy games (AI).
Life sciences
In life sciences, bayesian programming was used in vision to reconstruct shape from motion, to model visuo-vestibular interaction and to study saccadic eye movements; in speech perception and control to study early speech acquisition and the emergence of articulatory-acoustic systems; and to model handwriting perception and control.
Pattern recognition
Bayesian
program learning has potential applications voice recognition and
synthesis, image recognition and natural language processing. It employs
the principles of
compositionality (building abstract representations from parts),
causality (building complexity from parts) and
learning to learn (using previously recognized concepts to ease the creation of new concepts).
Possibility theories
The
comparison between probabilistic approaches (not only bayesian
programming) and possibility theories continues to be debated.
Possibility theories like, for instance,
fuzzy sets,
fuzzy logic and
possibility theory
are alternatives to probability to model uncertainty. They argue that
probability is insufficient or inconvenient to model certain aspects of
incomplete/uncertain knowledge.
The defense of probability is mainly based on
Cox's theorem,
which starts from four postulates concerning rational reasoning in the
presence of uncertainty. It demonstrates that the only mathematical
framework that satisfies these postulates is probability theory. The
argument is that any approach other than probability necessarily
infringes one of these postulates and the value of that infringement.
Probabilistic programming
The purpose of
probabilistic programming is to unify the scope of classical programming languages with probabilistic modeling (especially
bayesian networks) to deal with uncertainty while profiting from the programming languages' expressiveness to encode complexity.
Extended classical programming languages include logical languages as proposed in
Probabilistic Horn Abduction, Independent Choice Logic, PRISM, and ProbLog which proposes an extension of Prolog.
It can also be extensions of
functional programming languages (essentially
Lisp and
Scheme)
such as IBAL or CHURCH. The underlying programming languages can be
object-oriented as in BLOG and FACTORIE or more standard ones as in CES
and FIGARO.
The purpose of Bayesian programming is different. Jaynes' precept
of "probability as logic" argues that probability is an extension of
and an alternative to logic above which a complete theory of
rationality, computation and programming can be rebuilt. Bayesian programming attempts to replace classical languages with a programming approach based on probability that considers
incompleteness and
uncertainty.
The precise comparison between the
semantics and power of expression of Bayesian and probabilistic programming is an open question.