From Wikipedia, the free encyclopedia
Variational Bayesian methods are a family of techniques for approximating intractable
integrals arising in
Bayesian inference and
machine learning. They are typically used in complex
statistical models consisting of observed variables (usually termed "data") as well as unknown
parameters and
latent variables, with various sorts of relationships among the three types of
random variables, as might be described by a
graphical model.
As is typical in Bayesian inference, the parameters and latent
variables are grouped together as "unobserved variables". Variational
Bayesian methods are primarily used for two purposes:
- To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
- To derive a lower bound for the marginal likelihood (sometimes called the "evidence") of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing model selection,
the general idea being that a higher marginal likelihood for a given
model indicates a better fit of the data by that model and hence a
greater probability that the model in question was the one that
generated the data.
In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to
Monte Carlo sampling methods — particularly,
Markov chain Monte Carlo methods such as
Gibbs sampling — for taking a fully Bayesian approach to
statistical inference over complex
distributions that are difficult to directly evaluate or
sample
from. In particular, whereas Monte Carlo techniques provide a numerical
approximation to the exact posterior using a set of samples,
Variational Bayes provides a locally-optimal, exact analytical solution
to an approximation of the posterior.
Variational Bayes can be seen as an extension of the EM (
expectation-maximization) algorithm from
maximum a posteriori estimation
(MAP estimation) of the single most probable value of each parameter to
fully Bayesian estimation which computes (an approximation to) the
entire
posterior distribution
of the parameters and latent variables. As in EM, it finds a set of
optimal parameter values, and it has the same alternating structure as
does EM, based on a set of interlocked (mutually dependent) equations
that cannot be solved analytically.
For many applications, variational Bayes produces solutions of
comparable accuracy to Gibbs sampling at greater speed. However,
deriving the set of equations used to iteratively update the parameters
often requires a large amount of work compared with deriving the
comparable Gibbs sampling equations. This is the case even for many
models that are conceptually quite simple, as is demonstrated below in
the case of a basic non-hierarchical model with only two parameters and
no latent variables.
In
variational inference, the posterior distribution over a set of unobserved variables
given some data
is approximated
by a variational distribution,
:
The distribution
is restricted to belong to a family of distributions of simpler
form than
, selected with the intention of making
similar to the true posterior,
. The lack of similarity is measured in terms of
a dissimilarity function
and hence inference is performed by selecting the distribution
that minimizes
.
The most common type of variational Bayes uses the
Kullback–Leibler divergence (KL-divergence) of
P from
Q as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as
Variational techniques are typically used to form an approximation for:
The marginalization over
to calculate
in the denominator is typically intractable. For example because the search space of
is combinatorially large. Therefore, we seek an approximation, using
.
The KL-divergence can be written as
or
As the
log evidence is fixed with respect to
, maximizing the final term
minimizes the KL divergence of
from
. By appropriate choice of
,
becomes tractable to compute and to maximize. Hence we have both an analytical approximation
for the posterior
, and a lower bound
for the evidence
(since the KL-divergence is always strictly non-negative). The lower bound
is known as the (negative)
variational free energy in analogy with
thermodynamic free energy because it can also be expressed as an "energy"
plus the entropy of
.
By generalized Pythagorean theorem of
Bregman divergence, of which KL-divergence is a special case, it can be shown that:
where
is a convex set and the equality holds if:
In this case, the global minimizer
with
can be found as follows:
in which the normalizing constant is:
The term
is often called the
evidence lower bound (
ELBO) in practice, since
, as shown above.
By interchanging the roles of
and
we can iteratively compute the approximated
and
of the true model's marginals
and
respectively. Although this iterative scheme is guaranteed to converge monotonically , the converged
is only a local minimizer of
.
If the constrained space
is confined within independent space, i.e.
the above iterative scheme will become the so-called mean field approximation
as shown below.
Mean field approximation
The variational distribution
is usually assumed to factorize over some
partition of the latent variables, i.e. for some partition of the latent variables
into
,
It can be shown using the
calculus of variations (hence the name "variational Bayes") that the "best" distribution
for each of the factors
(in terms of the distribution minimizing the KL divergence, as described above) can be expressed as:
where
is the
expectation of the logarithm of the
joint probability of the data and latent variables, taken over all variables not in the partition.
In practice, we usually work in terms of logarithms, i.e.:
The constant in the above expression is related to the
normalizing constant (the denominator in the expression above for
)
and is usually reinstated by inspection, as the rest of the expression
can usually be recognized as being a known type of distribution (e.g.
Gaussian,
gamma, etc.).
Using the properties of expectations, the expression
can usually be simplified into a function of the fixed
hyperparameters of the
prior distributions over the latent variables and of expectations (and sometimes higher
moments such as the
variance) of latent variables not in the current partition (i.e. latent variables not included in
). This creates
circular dependencies
between the parameters of the distributions over variables in one
partition and the expectations of variables in the other partitions.
This naturally suggests an
iterative algorithm, much like EM (the
expectation-maximization
algorithm), in which the expectations (and possibly higher moments) of
the latent variables are initialized in some fashion (perhaps randomly),
and then the parameters of each distribution are computed in turn using
the current values of the expectations, after which the expectation of
the newly computed distribution is set appropriately according to the
computed parameters. An algorithm of this sort is guaranteed to
converge.
In other words, for each of the partitions of variables, by
simplifying the expression for the distribution over the partition's
variables and examining the distribution's functional dependency on the
variables in question, the family of the distribution can usually be
determined (which in turn determines the value of the constant). The
formula for the distribution's parameters will be expressed in terms of
the prior distributions' hyperparameters (which are known constants),
but also in terms of expectations of functions of variables in other
partitions. Usually these expectations can be simplified into functions
of expectations of the variables themselves (i.e. the
means); sometimes expectations of squared variables (which can be related to the
variance of the variables), or expectations of higher powers (i.e. higher
moments)
also appear. In most cases, the other variables' distributions will be
from known families, and the formulas for the relevant expectations can
be looked up. However, those formulas depend on those distributions'
parameters, which depend in turn on the expectations about other
variables. The result is that the formulas for the parameters of each
variable's distributions can be expressed as a series of equations with
mutual,
nonlinear
dependencies among the variables. Usually, it is not possible to solve
this system of equations directly. However, as described above, the
dependencies suggest a simple iterative algorithm, which in most cases
is guaranteed to converge. An example will make this process clearer.
A basic example
Consider a simple non-hierarchical Bayesian model consisting of a set of
i.i.d. observations from a
Gaussian distribution, with unknown
mean and
variance. In the following, we work through this model in great detail to illustrate the workings of the variational Bayes method.
For mathematical convenience, in the following example we work in terms of the
precision — i.e. the reciprocal of the variance (or in a multivariate Gaussian, the inverse of the
covariance matrix) — rather than the variance itself. (From a theoretical standpoint, precision and variance are equivalent since there is a
one-to-one correspondence between the two.)
The mathematical model
We place
conjugate prior
distributions on the unknown mean and variance, i.e. the mean also
follows a Gaussian distribution while the precision follows a
gamma distribution. In other words:
We are given
data points
and our goal is to infer the
posterior distribution of the parameters
and
The joint probability
where the individual factors are
where
Factorized approximation
Assume that
, i.e. that the posterior distribution factorizes into independent factors for
and
.
This type of assumption underlies the variational Bayesian method.
The true posterior distribution does not in fact factor this way (in
fact, in this simple case, it is known to be a
Gaussian-gamma distribution), and hence the result we obtain will be an approximation.
Derivation of q(μ)
Then
In the above derivation,
,
and
refer to values that are constant with respect to
. Note that the term
is not a function of
and will have the same value regardless of the value of
. Hence in line 3 we can absorb it into the constant term at the end. We do the same thing in line 7.
The last line is simply a quadratic polynomial in
. Since this is the logarithm of
, we can see that
itself is a
Gaussian distribution.
With a certain amount of tedious math (expanding the squares
inside of the braces, separating out and grouping the terms involving
and
and
completing the square over
), we can derive the parameters of the Gaussian distribution:
In other words:
Derivation of
The derivation of
is similar to above, although we omit some of the details for the sake of brevity.
Exponentiating both sides, we can see that
is a
gamma distribution. Specifically:
Algorithm for computing the parameters
Let us recap the conclusions from the previous sections:
and
In each case, the parameters for the distribution over one of the
variables depend on expectations taken with respect to the other
variable. We can expand the expectations, using the standard formulas
for the expectations of moments of the Gaussian and gamma distributions:
Applying these formulas to the above equations is trivial in most cases, but the equation for
takes more work:
We can then write the parameter equations as follows, without any expectations:
Note that there are circular dependencies among the formulas for
and
. This naturally suggests an
EM-like algorithm:
- Compute and Use these values to compute and
-
- Initialize to some arbitrary value.
- Use the current value of along with the known values of the other parameters, to compute .
- Use the current value of along with the known values of the other parameters, to compute .
- Repeat the last two steps until convergence (i.e. until neither value has changed more than some small amount).
We then have values for the hyperparameters of the approximating
distributions of the posterior parameters, which we can use to compute
any properties we want of the posterior — e.g. its mean and variance, a
95% highest-density region (the smallest interval that includes 95% of
the total probability), etc.
It can be shown that this algorithm is guaranteed to converge to a local maximum.
Note also that the posterior distributions have the same form as the corresponding prior distributions. We did
not
assume this; the only assumption we made was that the distributions
factorize, and the form of the distributions followed naturally. It
turns out (see below) that the fact that the posterior distributions
have the same form as the prior distributions is not a coincidence, but a
general result whenever the prior distributions are members of the
exponential family, which is the case for most of the standard distributions.
Further discussion
Step-by-step recipe
- Describe the network with a graphical model, identifying the observed variables (data) and unobserved variables (parameters and latent variables ) and their conditional probability distributions. Variational Bayes will then construct an approximation to the posterior probability . The approximation has the basic property that it is a factorized distribution, i.e. a product of two or more independent distributions over disjoint subsets of the unobserved variables.
- Partition the unobserved variables into two or more subsets, over
which the independent factors will be derived. There is no universal
procedure for doing this; creating too many subsets yields a poor
approximation, while creating too few makes the entire variational Bayes
procedure intractable. Typically, the first split is to separate the
parameters and latent variables; often, this is enough by itself to
produce a tractable result. Assume that the partitions are called .
- For a given partition , write down the formula for the best approximating distribution using the basic equation .
- Fill in the formula for the joint probability distribution using the graphical model. Any component conditional distributions that don't involve any of the variables in can be ignored; they will be folded into the constant term.
- Simplify the formula and apply the expectation operator, following
the above example. Ideally, this should simplify into expectations of
basic functions of variables not in (e.g. first or second raw moments,
expectation of a logarithm, etc.). In order for the variational Bayes
procedure to work well, these expectations should generally be
expressible analytically as functions of the parameters and/or hyperparameters
of the distributions of these variables. In all cases, these
expectation terms are constants with respect to the variables in the
current partition.
- The functional form of the formula with respect to the variables in
the current partition indicates the type of distribution. In
particular, exponentiating the formula generates the probability density function (PDF) of the distribution (or at least, something proportional to it, with unknown normalization constant).
In order for the overall method to be tractable, it should be possible
to recognize the functional form as belonging to a known distribution.
Significant mathematical manipulation may be required to convert the
formula into a form that matches the PDF of a known distribution. When
this can be done, the normalization constant can be reinstated by
definition, and equations for the parameters of the known distribution
can be derived by extracting the appropriate parts of the formula.
- When all expectations can be replaced analytically with functions of
variables not in the current partition, and the PDF put into a form
that allows identification with a known distribution, the result is a
set of equations expressing the values of the optimum parameters as
functions of the parameters of variables in other partitions.
- When this procedure can be applied to all partitions, the result is a
set of mutually linked equations specifying the optimum values of all
parameters.
- An expectation maximization
(EM) type procedure is then applied, picking an initial value for each
parameter and the iterating through a series of steps, where at each
step we cycle through the equations, updating each parameter in turn.
This is guaranteed to converge.
Most important points
Due to all of the mathematical manipulations involved, it is easy to lose track of the big picture. The important things are:
- The idea of variational Bayes is to construct an analytical approximation to the posterior probability
of the set of unobserved variables (parameters and latent variables),
given the data. This means that the form of the solution is similar to
other Bayesian inference methods, such as Gibbs sampling
— i.e. a distribution that seeks to describe everything that is known
about the variables. As in other Bayesian methods — but unlike e.g. in expectation maximization (EM) or other maximum likelihood methods — both types of unobserved variables (i.e. parameters and latent variables) are treated the same, i.e. as random variables.
Estimates for the variables can then be derived in the standard
Bayesian ways, e.g. calculating the mean of the distribution to get a
single point estimate or deriving a credible interval, highest density region, etc.
- "Analytical approximation" means that a formula can be written down
for the posterior distribution. The formula generally consists of a
product of well-known probability distributions, each of which factorizes over a set of unobserved variables (i.e. it is conditionally independent
of the other variables, given the observed data). This formula is not
the true posterior distribution, but an approximation to it; in
particular, it will generally agree fairly closely in the lowest moments of the unobserved variables, e.g. the mean and variance.
- The result of all of the mathematical manipulations is (1) the
identity of the probability distributions making up the factors, and (2)
mutually dependent formulas for the parameters of these distributions.
The actual values of these parameters are computed numerically, through
an alternating iterative procedure much like EM.
Compared with expectation maximization (EM)
Variational Bayes (VB) is often compared with
expectation maximization
(EM). The actual numerical procedure is quite similar, in that both
are alternating iterative procedures that successively converge on
optimum parameter values. The initial steps to derive the respective
procedures are also vaguely similar, both starting out with formulas for
probability densities and both involving significant amounts of
mathematical manipulations.
However, there are a number of differences. Most important is what is being computed.
- EM computes point estimates of posterior distribution of those
random variables that can be categorized as "parameters", but only
estimates of the actual posterior distributions of the latent variables
(at least in "soft EM", and often only when the latent variables are
discrete). The point estimates computed are the modes of these parameters; no other information is available.
- VB, on the other hand, computes estimates of the actual posterior
distribution of all variables, both parameters and latent variables.
When point estimates need to be derived, generally the mean
is used rather than the mode, as is normal in Bayesian inference.
Concomitant with this, it should be noted that the parameters computed
in VB do not have the same significance as those in EM. EM
computes optimum values of the parameters of the Bayes network itself.
VB computes optimum values of the parameters of the distributions used
to approximate the parameters and latent variables of the Bayes network.
For example, a typical Gaussian mixture model
will have parameters for the mean and variance of each of the mixture
components. EM would directly estimate optimum values for these
parameters. VB, however, would first fit a distribution to these
parameters — typically in the form of a prior distribution, e.g. a normal-scaled inverse gamma distribution — and would then compute values for the parameters of this prior distribution, i.e. essentially hyperparameters.
In this case, VB would compute optimum estimates of the four
parameters of the normal-scaled inverse gamma distribution that
describes the joint distribution of the mean and variance of the
component.
A more complex example
Bayesian Gaussian mixture model using plate notation.
Smaller squares indicate fixed parameters; larger circles indicate
random variables. Filled-in shapes indicate known values. The
indication [K] means a vector of size K; [D,D] means a matrix of size D×D; K alone means a categorical variable with K outcomes. The squiggly line coming from z ending in a crossbar indicates a switch — the value of this variable selects, for the other incoming variables, which value to use out of the size-K array of possible values.
Note:
The interpretation of the above variables is as follows:
- is the set of data points, each of which is a -dimensional vector distributed according to a multivariate Gaussian distribution.
-
is a set of latent variables, one per data point, specifying which
mixture component the corresponding data point belongs to, using a
"one-of-K" vector representation with components for , as described above.
- is the mixing proportions for the mixture components.
- and specify the parameters (mean and precision) associated with each mixture component.
The joint probability of all variables can be rewritten as
where the individual factors are
where
Assume that
.
Then
where we have defined
Exponentiating both sides of the formula for
yields
Requiring that this be normalized ends up requiring that the
sum to 1 over all values of
, yielding
where
In other words,
is a product of single-observation
multinomial distributions, and factors over each individual
, which is distributed as a single-observation multinomial distribution with parameters
for
.
Furthermore, we note that
which is a standard result for categorical distributions.
Now, considering the factor
, note that it automatically factors into
due to the structure of the graphical model defining our Gaussian mixture model, which is specified above.
Then,
Taking the exponential of both sides, we recognize
as a
Dirichlet distribution
where
where
Finally
Grouping and reading off terms involving
and
, the result is a
Gaussian-Wishart distribution given by
given the definitions
Finally, notice that these functions require the values of
, which make use of
, which is defined in turn based on
,
, and
. Now that we have determined the distributions over which these expectations are taken, we can derive formulas for them:
These results lead to
These can be converted from proportional to absolute values by normalizing over
so that the corresponding values sum to 1.
Note that:
- The update equations for the parameters , , and of the variables and depend on the statistics , , and , and these statistics in turn depend on .
- The update equations for the parameters of the variable depend on the statistic , which depends in turn on .
- The update equation for has a direct circular dependence on , , and as well as an indirect circular dependence on , and through and .
This suggests an iterative procedure that alternates between two steps:
- An E-step that computes the value of using the current values of all the other parameters.
- An M-step that uses the new value of to compute new values of all the other parameters.
Note that these steps correspond closely with the standard EM algorithm to derive a
maximum likelihood or
maximum a posteriori (MAP) solution for the parameters of a
Gaussian mixture model. The responsibilities
in the E step correspond closely to the
posterior probabilities of the latent variables given the data, i.e.
; the computation of the statistics
,
, and
corresponds closely to the computation of corresponding "soft-count"
statistics over the data; and the use of those statistics to compute new
values of the parameters corresponds closely to the use of soft counts
to compute new parameter values in normal EM over a Gaussian mixture
model.
Exponential-family distributions
Note
that in the previous example, once the distribution over unobserved
variables was assumed to factorize into distributions over the
"parameters" and distributions over the "latent data", the derived
"best" distribution for each variable was in the same family as the
corresponding prior distribution over the variable. This is a general
result that holds true for all prior distributions derived from the
exponential family.