Ladder paradox: Difference between revisions

Latest revision as of 15:41, 2 January 2014

28 year-old Painting Investments Worker Truman from Regina, usually spends time with pastimes for instance interior design, property developers in new launch ec Singapore and writing. Last month just traveled to City of the Renaissance.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually termed "data") as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As is typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
To derive a lower bound for the marginal likelihood (sometimes called the "evidence") of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the Bayes factor article.)

In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to Monte Carlo sampling methods — particularly, Markov chain Monte Carlo methods such as Gibbs sampling — for taking a fully Bayesian approach to statistical inference over complex distributions that are difficult to directly evaluate or sample from. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, Variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.

Variational Bayes can be seen as an extension of the EM (expectation-maximization) algorithm from maximum a posteriori estimation (MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.

For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to iteratively update the parameters often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.

Mathematical derivation of the mean-field approximation

In variational inference, the posterior distribution over a set of unobserved variables $\mathbf {Z} =\{Z_{1}\dots Z_{n}\}$ given some data $\mathbf {X}$ is approximated by a variational distribution, $Q(\mathbf {Z} )$ :

P(\mathbf {Z} \mid \mathbf {X} )\approx Q(\mathbf {Z} ).

The distribution $Q(\mathbf {Z} )$ is restricted to belong to a family of distributions of simpler form than $P(\mathbf {Z} \mid \mathbf {X} )$ , selected with the intention of making $Q(\mathbf {Z} )$ similar to the true posterior, $P(\mathbf {Z} \mid \mathbf {X} )$ . The lack of similarity is measured in terms of a dissimilarity function $d(Q;P)$ and hence inference is performed by selecting the distribution $Q(\mathbf {Z} )$ that minimizes $d(Q;P)$ .

The most common type of variational Bayes, known as mean-field variational Bayes, uses the Kullback–Leibler divergence (KL-divergence) of P from Q as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as

D_{\mathrm {KL} }(Q||P)=\sum _{\mathbf {Z} }Q(\mathbf {Z} )\log {\frac {Q(\mathbf {Z} )}{P(\mathbf {Z} \mid \mathbf {X} )}}.

Note that Q and P are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the expectation-maximization algorithm. (Using the KL-divergence in the other way produces the expectation propagation algorithm.)

The KL-divergence can be written as

D_{\mathrm {KL} }(Q||P)=\sum _{\mathbf {Z} }Q(\mathbf {Z} )\log {\frac {Q(\mathbf {Z} )}{P(\mathbf {Z} ,\mathbf {X} )}}+\log P(\mathbf {X} ),

or

{\begin{aligned}\log P(\mathbf {X} )&=D_{\mathrm {KL} }(Q||P)-\sum _{\mathbf {Z} }Q(\mathbf {Z} )\log {\frac {Q(\mathbf {Z} )}{P(\mathbf {Z} ,\mathbf {X} )}}\\&=D_{\mathrm {KL} }(Q||P)+{\mathcal {L}}(Q).\end{aligned}}

As the log evidence $\log P(\mathbf {X} )$ is fixed with respect to $Q$ , maximizing the final term ${\mathcal {L}}(Q)$ minimizes the KL divergence of $P$ from $Q$ . By appropriate choice of $Q$ , ${\mathcal {L}}(Q)$ becomes tractable to compute and to maximize. Hence we have both an analytical approximation $Q$ for the posterior $P(\mathbf {Z} \mid \mathbf {X} )$ , and a lower bound ${\mathcal {L}}(Q)$ for the evidence $\log P(\mathbf {X} )$ . The lower bound ${\mathcal {L}}(Q)$ is known as the (negative) variational free energy because it can also be expressed as an "energy" $\operatorname {E} _{Q}[\log P(\mathbf {Z} ,\mathbf {X} )]$ plus the entropy of $Q$ .

In practice

The variational distribution $Q(\mathbf {Z} )$ is usually assumed to factorize over some partition of the latent variables, i.e. for some partition of the latent variables $\mathbf {Z}$ into $\mathbf {Z} _{1}\dots \mathbf {Z} _{M}$ ,

Q(\mathbf {Z} )=\prod _{i=1}^{M}q_{i}(\mathbf {Z} _{i}\mid \mathbf {X} )

It can be shown using the calculus of variations (hence the name "variational Bayes") that the "best" distribution $q_{j}^{*}$ for each of the factors $q_{j}$ (in terms of the distribution minimizing the KL divergence, as described above) can be expressed as:

q_{j}^{*}(\mathbf {Z} _{j}\mid \mathbf {X} )={\frac {e^{\operatorname {E} _{i\neq j}[\ln p(\mathbf {Z} ,\mathbf {X} )]}}{\int e^{\operatorname {E} _{i\neq j}[\ln p(\mathbf {Z} ,\mathbf {X} )]}\,d\mathbf {Z} _{j}}}

where $\operatorname {E} _{i\neq j}[\ln p(\mathbf {Z} ,\mathbf {X} )]$ is the expectation of the logarithm of the joint probability of the data and latent variables, taken over all variables not in the partition.

In practice, we usually work in terms of logarithms, i.e.:

\ln q_{j}^{*}(\mathbf {Z} _{j}\mid \mathbf {X} )=\operatorname {E} _{i\neq j}[\ln p(\mathbf {Z} ,\mathbf {X} )]+{\text{constant}}

The constant in the above expression is related to the normalizing constant (the denominator in the expression above for $q_{j}^{*}$ ) and is usually reinstated by inspection, as the rest of the expression can usually be recognized as being a known type of distribution (e.g. Gaussian, gamma, etc.).

Using the properties of expectations, the expression $\operatorname {E} _{i\neq j}[\ln p(\mathbf {Z} ,\mathbf {X} )]$ can usually be simplified into a function of the fixed hyperparameters of the prior distributions over the latent variables and of expectations (and sometimes higher moments such as the variance) of latent variables not in the current partition (i.e. latent variables not included in $\mathbf {Z} _{j}$ ). This creates circular dependencies between the parameters of the distributions over variables in one partition and the expectations of variables in the other partitions. This naturally suggests an iterative algorithm, much like EM (the expectation-maximization algorithm), in which the expectations (and possibly higher moments) of the latent variables are initialized in some fashion (perhaps randomly), and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters. An algorithm of this sort is guaranteed to converge.^[1] Furthermore, if the distributions in question are part of the exponential family, which is usually the case, convergence will be to a global maximum, since the exponential family is convex.^[2]

In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution's functional dependency on the variables in question, the family of the distribution can usually be determined (which in turn determines the value of the constant). The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters (which are known constants), but also in terms of expectations of functions of variables in other partitions. Usually these expectations can be simplified into functions of expectations of the variables themselves (i.e. the means); sometimes expectations of squared variables (which can be related to the variance of the variables), or expectations of higher powers (i.e. higher moments) also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual, nonlinear dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer.

A basic example

Consider a simple non-hierarchical Bayesian model consisting of a set of i.i.d. observations from a Gaussian distribution, with unknown mean and variance.^[3] In the following, we work through this model in great detail to illustrate the workings of the variational Bayes method.

For mathematical convenience, in the following example we work in terms of the precision — i.e. the reciprocal of the variance — rather than the variance itself. (From a theoretical standpoint, precision and variance are equivalent since there is a one-to-one correspondence between the two.)

The mathematical model

We place conjugate prior distributions on the unknown mean and variance, i.e. the mean also follows a Gaussian distribution while the precision follows a gamma distribution. In other words:

{\begin{aligned}\mu &\sim {\mathcal {N}}(\mu _{0},(\lambda _{0}\tau )^{-1})\\\tau &\sim \operatorname {Gamma} (a_{0},b_{0})\\\{x_{1},\dots ,x_{N}\}&\sim {\mathcal {N}}(\mu ,\tau ^{-1})\\N&={\text{number of data points}}\end{aligned}}

We are given $N$ data points $\mathbf {X} =\{x_{1},\dots ,x_{N}\}$ and our goal is to infer the posterior distribution $q(\mu ,\tau )=p(\mu ,\tau \mid x_{1},\ldots ,x_{N})$ of the parameters $\mu$ and $\tau$ .

The hyperparameters $\mu _{0}$ , $\lambda _{0}$ , $a_{0}$ and $b_{0}$ are fixed, given values. They can be set to small positive numbers to give broad prior distributions indicating ignorance about the prior distributions of $\mu$ and $\tau$ .

The joint probability

The joint probability of all variables can be rewritten as

p(\mathbf {X} ,\mu ,\tau )=p(\mathbf {X} \mid \mu ,\tau )p(\mu \mid \tau )p(\tau )

where the individual factors are

{\begin{aligned}p(\mathbf {X} \mid \mu ,\tau )&=\prod _{n=1}^{N}{\mathcal {N}}(x_{n}\mid \mu ,\tau ^{-1})\\p(\mu \mid \tau )&={\mathcal {N}}(\mu \mid \mu _{0},(\lambda _{0}\tau )^{-1})\\p(\tau )&=\operatorname {Gamma} (\tau \mid a_{0},b_{0})\end{aligned}}

where

{\begin{aligned}{\mathcal {N}}(x\mid \mu ,\sigma ^{2})&={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{\frac {-(x-\mu )^{2}}{2\sigma ^{2}}}\\\operatorname {Gamma} (\tau \mid a,b)&={\frac {1}{\Gamma (a)}}b^{a}\tau ^{a-1}e^{-b\tau }\end{aligned}}

Factorized approximation

Assume that $q(\mu ,\tau )=q(\mu )q(\tau )$ , i.e. that the posterior distribution factorizes into independent factors for $\mu$ and $\tau$ . This type of assumption underlies the variational Bayesian method. The true posterior distribution does not in fact factor this way (in fact, in this simple case, it is known to be a Gaussian-gamma distribution), and hence the result we obtain will be an approximation.

Derivation of q(μ)

Then

{\begin{aligned}\ln q_{\mu }^{*}(\mu )&=\operatorname {E} _{\tau }\left[\ln p(\mathbf {X} \mid \mu ,\tau )+\ln p(\mu \mid \tau )+\ln p(\tau )\right]+C\\&=\operatorname {E} _{\tau }\left[\ln p(\mathbf {X} \mid \mu ,\tau )\right]+\operatorname {E} _{\tau }\left[\ln p(\mu \mid \tau )\right]+\operatorname {E} _{\tau }\left[\ln p(\tau )\right]+C\\&=\operatorname {E} _{\tau }\left[\ln \prod _{n=1}^{N}{\mathcal {N}}(x_{n}\mid \mu ,\tau ^{-1})\right]+\operatorname {E} _{\tau }\left[\ln {\mathcal {N}}(\mu \mid \mu _{0},(\lambda _{0}\tau )^{-1})\right]+C_{2}\\&=\operatorname {E} _{\tau }\left[\ln \prod _{n=1}^{N}{\sqrt {\frac {\tau }{2\pi }}}e^{-{\frac {(x_{n}-\mu )^{2}\tau }{2}}}\right]+\operatorname {E} _{\tau }\left[\ln {\sqrt {\frac {\lambda _{0}\tau }{2\pi }}}e^{-{\frac {(\mu -\mu _{0})^{2}\lambda _{0}\tau }{2}}}\right]+C_{2}\\&=\operatorname {E} _{\tau }\left[\sum _{n=1}^{N}\left({\frac {1}{2}}(\ln \tau -\ln 2\pi )-{\frac {(x_{n}-\mu )^{2}\tau }{2}})\right)\right]+\operatorname {E} _{\tau }\left[{\frac {1}{2}}(\ln \lambda _{0}+\ln \tau -\ln 2\pi )-{\frac {(\mu -\mu _{0})^{2}\lambda _{0}\tau }{2}}\right]+C_{2}\\&=\operatorname {E} _{\tau }\left[\sum _{n=1}^{N}-{\frac {(x_{n}-\mu )^{2}\tau }{2}}\right]+\operatorname {E} _{\tau }\left[-{\frac {(\mu -\mu _{0})^{2}\lambda _{0}\tau }{2}}\right]+\operatorname {E} _{\tau }\left[\sum _{n=1}^{N}{\frac {1}{2}}(\ln \tau -\ln 2\pi )\right]+\operatorname {E} _{\tau }\left[{\frac {1}{2}}(\ln \lambda _{0}+\ln \tau -\ln 2\pi )\right]+C_{2}\\&=\operatorname {E} _{\tau }\left[\sum _{n=1}^{N}-{\frac {(x_{n}-\mu )^{2}\tau }{2}}\right]+\operatorname {E} _{\tau }\left[-{\frac {(\mu -\mu _{0})^{2}\lambda _{0}\tau }{2}}\right]+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right\}+C_{3}\end{aligned}}

In the above derivation, $C$ , $C_{2}$ and $C_{3}$ refer to values that are constant with respect to $\mu$ . Note that the term $\operatorname {E} _{\tau }[\ln p(\tau )]$ is not a function of $\mu$ and will have the same value regardless of the value of $\mu$ . Hence in line 3 we can absorb it into the constant term at the end. We do the same thing in line 7.

The last line is simply a quadratic polynomial in $\mu$ . Since this is the logarithm of $q_{\mu }^{*}(\mu )$ , we can see that $q_{\mu }^{*}(\mu )$ itself is a Gaussian distribution.

With a certain amount of tedious math (expanding the squares inside of the braces, separating out and grouping the terms involving $\mu$ and $\mu ^{2}$ and completing the square over $\mu$ ), we can derive the parameters of the Gaussian distribution:

{\begin{aligned}\ln q_{\mu }^{*}(\mu )&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\{\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\}+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\{\sum _{n=1}^{N}(x_{n}^{2}-2x_{n}\mu +\mu ^{2})+\lambda _{0}(\mu ^{2}-2\mu _{0}\mu +\mu _{0}^{2})\}+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\{(\sum _{n=1}^{N}x_{n}^{2})-2(\sum _{n=1}^{N}x_{n})\mu +(\sum _{n=1}^{N}\mu ^{2})+\lambda _{0}\mu ^{2}-2\lambda _{0}\mu _{0}\mu +\lambda _{0}\mu _{0}^{2}\}+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\{(\lambda _{0}+N)\mu ^{2}-2(\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n})\mu +(\textstyle \sum _{n=1}^{N}x_{n}^{2})+\lambda _{0}\mu _{0}^{2}\}+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\{(\lambda _{0}+N)\mu ^{2}-2(\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n})\mu \}+C_{4}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\mu ^{2}-2{\frac {\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}(\lambda _{0}+N)\mu \right\}+C_{4}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\left(\mu ^{2}-2{\frac {\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\mu \right)\right\}+C_{4}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\left(\mu ^{2}-2{\frac {\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\mu +\left({\frac {\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}-\left({\frac {\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}\right)\right\}+C_{4}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\left(\mu ^{2}-2{\frac {\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\mu +\left({\frac {\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}\right)\right\}+C_{5}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\left(\mu -{\frac {\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}\right\}+C_{5}\\&=-{\frac {1}{2}}\left\{(\lambda _{0}+N)\operatorname {E} _{\tau }[\tau ]\left(\mu -{\frac {\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}\right\}+C_{5}\\\end{aligned}}

Note that all of the above steps can be shortened by using the formula for the sum of two quadratics.

In other words:

{\begin{aligned}q_{\mu }^{*}(\mu )&\sim {\mathcal {N}}(\mu \mid \mu _{N},\lambda _{N}^{-1})\\\mu _{N}&={\frac {\lambda _{0}\mu _{0}+N{\bar {x}}}{\lambda _{0}+N}}\\\lambda _{N}&=(\lambda _{0}+N)\operatorname {E} [\tau ]\\{\bar {x}}&={\frac {1}{N}}\sum _{n=1}^{N}x_{n}\end{aligned}}

Derivation of q(τ)

The derivation of $q_{\tau }^{*}(\tau )$ is similar to above, although we omit some of the details for the sake of brevity.

{\begin{aligned}\ln q_{\tau }^{*}(\tau )&=\operatorname {E} _{\mu }[\ln p(\mathbf {X} \mid \mu ,\tau )+\ln p(\mu \mid \tau )]+\ln p(\tau )+{\text{constant}}\\&=(a_{0}-1)\ln \tau -b_{0}\tau +{\frac {1}{2}}\ln \tau +{\frac {N}{2}}\ln \tau -{\frac {\tau }{2}}\operatorname {E} _{\mu }[\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}]+{\text{constant}}\end{aligned}}

Exponentiating both sides, we can see that $q_{\tau }^{*}(\tau )$ is a gamma distribution. Specifically:

{\begin{aligned}q_{\tau }^{*}(\tau )&\sim \operatorname {Gamma} (\tau \mid a_{N},b_{N})\\a_{N}&=a_{0}+{\frac {N+1}{2}}\\b_{N}&=b_{0}+{\frac {1}{2}}\operatorname {E} _{\mu }\left[\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right]\end{aligned}}

Algorithm for computing the parameters

Let us recap the conclusions from the previous sections:

{\begin{aligned}q_{\mu }^{*}(\mu )&\sim {\mathcal {N}}(\mu \mid \mu _{N},\lambda _{N}^{-1})\\\mu _{N}&={\frac {\lambda _{0}\mu _{0}+N{\bar {x}}}{\lambda _{0}+N}}\\\lambda _{N}&=(\lambda _{0}+N)\operatorname {E} [\tau ]\\{\bar {x}}&={\frac {1}{N}}\sum _{n=1}^{N}x_{n}\end{aligned}}

and

{\begin{aligned}q_{\tau }^{*}(\tau )&\sim \operatorname {Gamma} (\tau \mid a_{N},b_{N})\\a_{N}&=a_{0}+{\frac {N+1}{2}}\\b_{N}&=b_{0}+{\frac {1}{2}}\operatorname {E} _{\mu }\left[\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right]\end{aligned}}

In each case, the parameters for the distribution over one of the variables depend on expectations taken with respect to the other variable. We can expand the expectations, using the standard formulas for the expectations of moments of the Gaussian and gamma distributions:

{\begin{aligned}\operatorname {E} [\tau \mid a_{N},b_{N}]&={\frac {a_{N}}{b_{N}}}\\\operatorname {E} [\mu \mid \mu _{N},\lambda _{N}^{-1}]&=\mu _{N}\\\operatorname {E} \left[X^{2}\right]&=\operatorname {Var} (X)+(\operatorname {E} [X])^{2}\\\operatorname {E} [\mu ^{2}\mid \mu _{N},\lambda _{N}^{-1}]&=\lambda _{N}^{-1}+\mu _{N}^{2}\end{aligned}}

Applying these formulas to the above equations is trivial in most cases, but the equation for $b_{N}$ takes more work:

{\begin{aligned}b_{N}&=b_{0}+{\frac {1}{2}}\operatorname {E} _{\mu }\left[\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right]\\&=b_{0}+{\frac {1}{2}}\operatorname {E} _{\mu }\left[(\lambda _{0}+N)\mu ^{2}-2(\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n})\mu +(\textstyle \sum _{n=1}^{N}x_{n}^{2})+\lambda _{0}\mu _{0}^{2}\right]\\&=b_{0}+{\frac {1}{2}}\left[(\lambda _{0}+N)\operatorname {E} _{\mu }[\mu ^{2}]-2(\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n})\operatorname {E} _{\mu }[\mu ]+(\textstyle \sum _{n=1}^{N}x_{n}^{2})+\lambda _{0}\mu _{0}^{2}\right]\\&=b_{0}+{\frac {1}{2}}\left[(\lambda _{0}+N)(\lambda _{N}^{-1}+\mu _{N}^{2})-2(\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n})\mu _{N}+(\textstyle \sum _{n=1}^{N}x_{n}^{2})+\lambda _{0}\mu _{0}^{2}\right]\\\end{aligned}}

We can then write the parameter equations as follows, without any expectations:

{\begin{aligned}\mu _{N}&={\frac {\lambda _{0}\mu _{0}+N{\bar {x}}}{\lambda _{0}+N}}\\\lambda _{N}&=(\lambda _{0}+N){\frac {a_{N}}{b_{N}}}\\{\bar {x}}&={\frac {1}{N}}\sum _{n=1}^{N}x_{n}\\a_{N}&=a_{0}+{\frac {N+1}{2}}\\b_{N}&=b_{0}+{\frac {1}{2}}\left[(\lambda _{0}+N)(\lambda _{N}^{-1}+\mu _{N}^{2})-2(\lambda _{0}\mu _{0}+\textstyle \sum _{n=1}^{N}x_{n})\mu _{N}+(\textstyle \sum _{n=1}^{N}x_{n}^{2})+\lambda _{0}\mu _{0}^{2}\right]\end{aligned}}

Note that there are circular dependencies among the formulas for $\mu _{N}$ , $\lambda _{N}$ and $b_{N}$ . This naturally suggests an EM-like algorithm:

Compute $\sum _{n=1}^{N}x_{n}$ and $\sum _{n=1}^{N}x_{n}^{2}.$ Use these values to compute $\mu _{N}$ and $a_{N}.$
Initialize $\lambda _{N}$ to some arbitrary value.
Use the current value of $\lambda _{N},$ along with the known values of the other parameters, to compute $b_{N}$ .
Use the current value of $b_{N},$ along with the known values of the other parameters, to compute $\lambda _{N}$ .
Repeat the last two steps until convergence (i.e. until neither value has changed more than some small amount).

We then have values for the hyperparameters of the approximating distributions of the posterior parameters, which we can use to compute any properties we want of the posterior — e.g. its mean and variance, a 95% highest-density region (the smallest interval that includes 95% of the total probability), etc.

It can be shown that this algorithm is guaranteed to converge to a local maximum, and since both posterior distributions are in the exponential family, this local maximum will be a global maximum.

Note also that the posterior distributions have the same form as the corresponding prior distributions. We did not assume this; the only assumption we made was that the distributions factorize, and the form of the distributions followed naturally. It turns out (see below) that the fact that the posterior distributions have the same form as the prior distributions is not a coincidence, but a general result whenever the prior distributions are members of the exponential family, which is the case for most of the standard distributions.

Further discussion

Step-by-step recipe

The above example shows the method by which the variational-Bayesian approximation to a posterior probability density in a given Bayesian network is derived:

Describe the network with a graphical model, identifying the observed variables (data) $\mathbf {X}$ and unobserved variables (parameters ${\boldsymbol {\Theta }}$ and latent variables $\mathbf {Z}$ ) and their conditional probability distributions. Variational Bayes will then construct an approximation to the posterior probability $p(\mathbf {Z} ,{\boldsymbol {\Theta }}\mid \mathbf {X} )$ . The approximation has the basic property that it is a factorized distribution, i.e. a product of two or more independent distributions over disjoint subsets of the unobserved variables.
Partition the unobserved variables into two or more subsets, over which the independent factors will be derived. There is no universal procedure for doing this; creating too many subsets yields a poor approximation, while creating too few makes the entire variational Bayes procedure intractable. Typically, the first split is to separate the parameters and latent variables; often, this is enough by itself to produce a tractable result. Assume that the partitions are called $\mathbf {Z} _{1},\ldots ,\mathbf {Z} _{M}$ .
For a given partition $\mathbf {Z} _{j}$ , write down the formula for the best approximating distribution $q_{j}^{*}(\mathbf {Z} _{j}\mid \mathbf {X} )$ using the basic equation $\ln q_{j}^{*}(\mathbf {Z} _{j}\mid \mathbf {X} )=\operatorname {E} _{i\neq j}[\ln p(\mathbf {Z} ,\mathbf {X} )]+{\text{constant}}$ .
Fill in the formula for the joint probability distribution using the graphical model. Any component conditional distributions that don't involve any of the variables in $\mathbf {Z} _{j}$ can be ignored; they will be folded into the constant term.
Simplify the formula and apply the expectation operator, following the above example. Ideally, this should simplify into expectations of basic functions of variables not in $\mathbf {Z} _{j}$ (e.g. first or second raw moments, expectation of a logarithm, etc.). In order for the variational Bayes procedure to work well, these expectations should generally be expressible analytically as functions of the parameters and/or hyperparameters of the distributions of these variables. In all cases, these expectation terms are constants with respect to the variables in the current partition.
The functional form of the formula with respect to the variables in the current partition indicates the type of distribution. In particular, exponentiating the formula generates the probability density function (PDF) of the distribution (or at least, something proportional to it, with unknown normalization constant). In order for the overall method to be tractable, it should be possible to recognize the functional form as belonging to a known distribution. Significant mathematical manipulation may be required to convert the formula into a form that matches the PDF of a known distribution. When this can be done, the normalization constant can be reinstated by definition, and equations for the parameters of the known distribution can be derived by extracting the appropriate parts of the formula.
When all expectations can be replaced analytically with functions of variables not in the current partition, and the PDF put into a form that allows identification with a known distribution, the result is a set of equations expressing the values of the optimum parameters as functions of the parameters of variables in other partitions.
When this procedure can be applied to all partitions, the result is a set of mutually linked equations specifying the optimum values of all parameters.
An expectation maximization (EM) type procedure is then applied, picking an initial value for each parameter and the iterating through a series of steps, where at each step we cycle through the equations, updating each parameter in turn. This is guaranteed to converge.

Most important points

Due to all of the mathematical manipulations involved, it is easy to lose track of the big picture. The important things are:

The idea of variational Bayes is to construct an analytical approximation to the posterior probability of the set of unobserved variables (parameters and latent variables), given the data. This means that the form of the solution is similar to other Bayesian inference methods, such as Gibbs sampling — i.e. a distribution that seeks to describe everything that is known about the variables. As in other Bayesian methods — but unlike e.g. in expectation maximization (EM) or other maximum likelihood methods — both types of unobserved variables (i.e. parameters and latent variables) are treated the same, i.e. as random variables. Estimates for the variables can then be derived in the standard Bayesian ways, e.g. calculating the mean of the distribution to get a single point estimate or deriving a credible interval, highest density region, etc.
"Analytical approximation" means that a formula can be written down for the posterior distribution. The formula generally consists of a product of well-known probability distributions, each of which factorizes over a set of unobserved variables (i.e. it is conditionally independent of the other variables, given the observed data). This formula is not the true posterior distribution, but an approximation to it; in particular, it will generally agree fairly closely in the lowest moments of the unobserved variables, e.g. the mean and variance.
The result of all of the mathematical manipulations is (1) the identity of the probability distributions making up the factors, and (2) mutually dependent formulas for the parameters of these distributions. The actual values of these parameters are computed numerically, through an alternating iterative procedure much like EM.

Compared with expectation maximization (EM)

Variational Bayes (VB) is often compared with expectation maximization (EM). The actual numerical procedure is quite similar, in that both are alternating iterative procedures that successively converge on optimum parameter values. The initial steps to derive the respective procedures are also vaguely similar, both starting out with formulas for probability densities and both involving significant amounts of mathematical manipulations.

However, there are a number of differences. Most important is what is being computed.

EM computes point estimates of posterior distribution of those random variables that can be categorized as "parameters", but estimates of the actual posterior distributions of the latent variables (at least in "soft EM", and often only when the latent variables are discrete). The point estimates computed are the modes of these parameters; no other information is available.
VB, on the other hand, computes estimates of the actual posterior distribution of all variables, both parameters and latent variables. When point estimates need to be derived, generally the mean is used rather than the mode, as is normal in Bayesian inference. Concomitant with this, it should be noted that the parameters computed in VB do not have the same significance as those in EM. EM computes optimum values of the parameters of the Bayes network itself. VB computes optimum values of the parameters of the distributions used to approximate the parameters and latent variables of the Bayes network. For example, a typical Gaussian mixture model will have parameters for the mean and variance of each of the mixture components. EM would directly estimate optimum values for these parameters. VB, however, would first fit a distribution to these parameters — typically in the form of a prior distribution, e.g. a normal-scaled inverse gamma distribution — and would then compute values for the parameters of this prior distribution, i.e. essentially hyperparameters. In this case, VB would compute optimum estimates of the four parameters of the normal-scaled inverse gamma distribution that describes the joint distribution of the mean and variance of the component.

A more complex example

Bayesian Gaussian mixture model using plate notation. Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size K; [D,D] means a matrix of size D×D; K alone means a categorical variable with K outcomes. The squiggly line coming from z ending in a crossbar indicates a *switch* — the value of this variable selects, for the other incoming variables, which value to use out of the size-K array of possible values.

Imagine a Bayesian Gaussian mixture model described as follows:

{\begin{aligned}\mathbf {\pi } &\sim \operatorname {SymDir} (K,\alpha _{0})\\\mathbf {\Lambda } _{i=1\dots K}&\sim {\mathcal {W}}(\mathbf {W} _{0},\nu _{0})\\\mathbf {\mu } _{i=1\dots K}&\sim {\mathcal {N}}(\mathbf {\mu } _{0},(\beta _{0}\mathbf {\Lambda } _{i})^{-1})\\\mathbf {z} [i=1\dots N]&\sim \operatorname {Mult} (1,\mathbf {\pi } )\\\mathbf {x} _{i=1\dots N}&\sim {\mathcal {N}}(\mathbf {\mu } _{z_{i}},{\mathbf {\Lambda } _{z_{i}}}^{-1})\\K&={\text{number of mixing components}}\\N&={\text{number of data points}}\end{aligned}}

Note:

SymDir() is the symmetric Dirichlet distribution of dimension $K$ , with the hyperparameter for each component set to $\alpha _{0}$ . The Dirichlet distribution is the conjugate prior of the categorical distribution or multinomial distribution.
${\mathcal {W}}()$ is the Wishart distribution, which is the conjugate prior of the precision matrix (inverse covariance matrix) for a multivariate Gaussian distribution.
Mult() is a multinomial distribution over a single observation (equivalent to a categorical distribution). The state space is a "one-of-K" representation, i.e. a $K$ -dimensional vector in which one of the elements is 1 (specifying the identity of the observation) and all other elements are 0.
${\mathcal {N}}()$ is the Gaussian distribution, in this case specifically the multivariate Gaussian distribution.

The interpretation of the above variables is as follows:

$\mathbf {X} =\{\mathbf {x} _{1},\dots ,\mathbf {x} _{N}\}$ is the set of $N$ data points, each of which is a $K$ -dimensional vector distributed according to a multivariate Gaussian distribution.
$\mathbf {Z} =\{\mathbf {z} _{1},\dots ,\mathbf {z} _{N}\}$ is a set of latent variables, one per data point, specifying which mixture component the corresponding data point belongs to, using a "one-of-K" vector representation with components $z_{nk}$ for $k=1\dots K$ , as described above.
$\mathbf {\pi }$ is the mixing proportions for the $K$ mixture components.
$\mathbf {\mu } _{i=1\dots K}$ and $\mathbf {\Lambda } _{i=1\dots K}$ specify the parameters (mean and precision) associated with each mixture component.

The joint probability of all variables can be rewritten as

p(\mathbf {X} ,\mathbf {Z} ,\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )=p(\mathbf {X} \mid \mathbf {Z} ,\mathbf {\mu } ,\mathbf {\Lambda } )p(\mathbf {Z} \mid \mathbf {\pi } )p(\mathbf {\pi } )p(\mathbf {\mu } \mid \mathbf {\Lambda } )p(\mathbf {\Lambda } )

where the individual factors are

{\begin{aligned}p(\mathbf {X} \mid \mathbf {Z} ,\mathbf {\mu } ,\mathbf {\Lambda } )&=\prod _{n=1}^{N}\prod _{k=1}^{K}{\mathcal {N}}(\mathbf {x} _{n}\mid \mathbf {\mu } _{k},\mathbf {\Lambda } _{k}^{-1})^{z_{nk}}\\p(\mathbf {Z} \mid \mathbf {\pi } )&=\prod _{n=1}^{N}\prod _{k=1}^{K}\pi _{k}^{z_{nk}}\\p(\mathbf {\pi } )&={\frac {\Gamma (K\alpha _{0})}{\Gamma (\alpha _{0})^{K}}}\prod _{k=1}^{K}\pi _{k}^{\alpha _{0}-1}\\p(\mathbf {\mu } \mid \mathbf {\Lambda } )&=\prod _{k=1}^{K}{\mathcal {N}}(\mathbf {\mu } _{k}\mid \mathbf {\mu } _{0},(\beta _{0}\mathbf {\Lambda } _{k})^{-1})\\p(\mathbf {\Lambda } )&=\prod _{k=1}^{K}{\mathcal {W}}(\mathbf {\Lambda } _{k}\mid \mathbf {W} _{0},\nu _{0})\end{aligned}}

where

{\begin{aligned}{\mathcal {N}}(\mathbf {x} \mid \mathbf {\mu } ,\mathbf {\Sigma } )&={\frac {1}{(2\pi )^{D/2}}}{\frac {1}{|\mathbf {\Sigma } |^{1/2}}}\exp\{-{\frac {1}{2}}(\mathbf {x} -\mathbf {\mu } )^{\rm {T}}\mathbf {\Sigma } ^{-1}(\mathbf {x} -\mathbf {\mu } )\}\\{\mathcal {W}}(\mathbf {\Lambda } \mid \mathbf {W} ,\nu )&=B(\mathbf {W} ,\nu )|\mathbf {\Lambda } |^{(\nu -D-1)/2}\exp \left(-{\frac {1}{2}}\operatorname {Tr} (\mathbf {W} ^{-1}\mathbf {\Lambda } )\right)\\B(\mathbf {W} ,\nu )&=|\mathbf {W} |^{-\nu /2}(2^{\nu D/2}\pi ^{D(D-1)/4}\prod _{i=1}^{D}\Gamma ({\frac {\nu +1-i}{2}}))^{-1}\\D&={\text{dimensionality of each data point}}\end{aligned}}

Assume that $q(\mathbf {Z} ,\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )=q(\mathbf {Z} )q(\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )$ .

Then

{\begin{aligned}\ln q^{*}(\mathbf {Z} )&=\operatorname {E} _{\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } }[\ln p(\mathbf {X} ,\mathbf {Z} ,\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )]+{\text{constant}}\\&=\operatorname {E} _{\mathbf {\pi } }[\ln p(\mathbf {Z} \mid \mathbf {\pi } )]+\operatorname {E} _{\mathbf {\mu } ,\mathbf {\Lambda } }[\ln p(\mathbf {X} \mid \mathbf {Z} ,\mathbf {\mu } ,\mathbf {\Lambda } )]+{\text{constant}}\\&=\sum _{n=1}^{N}\sum _{k=1}^{K}z_{nk}\ln \rho _{nk}+{\text{constant}}\end{aligned}}

where we have defined

\ln \rho _{nk}=\operatorname {E} [\ln \pi _{k}]+{\frac {1}{2}}\operatorname {E} [\ln |\mathbf {\Lambda } _{k}|]-{\frac {D}{2}}\ln(2\pi )-{\frac {1}{2}}\operatorname {E} _{\mathbf {\mu } _{k},\mathbf {\Lambda } _{k}}[(\mathbf {x} _{n}-\mathbf {\mu } _{k})^{\rm {T}}\mathbf {\Lambda } _{k}(\mathbf {x} _{n}-\mathbf {\mu } _{k})]

Exponentiating both sides of the formula for $\ln q^{*}(\mathbf {Z} )$ yields

q^{*}(\mathbf {Z} )\propto \prod _{n=1}^{N}\prod _{k=1}^{K}\rho _{nk}^{z_{nk}}

Requiring that this be normalized ends up requiring that the $\rho _{nk}$ sum to 1 over all values of $k$ , yielding

q^{*}(\mathbf {Z} )=\prod _{n=1}^{N}\prod _{k=1}^{K}r_{nk}^{z_{nk}}

where

r_{nk}={\frac {\rho _{nk}}{\sum _{j=1}^{K}\rho _{nj}}}

In other words, $q^{*}(\mathbf {Z} )$ is a product of single-observation multinomial distributions, and factors over each individual $\mathbf {z} _{n}$ , which is distributed as a single-observation multinomial distribution with parameters $r_{nk}$ for $k=1\dots K$ .

Furthermore, we note that

\operatorname {E} [z_{nk}]=r_{nk}\,

which is a standard result for categorical distributions.

Now, considering the factor $q(\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )$ , note that it automatically factors into $q(\mathbf {\pi } )\prod _{k=1}^{K}q(\mathbf {\mu } _{k},\mathbf {\Lambda } _{k})$ due to the structure of the graphical model defining our Gaussian mixture model, which is specified above.

Then,

{\begin{aligned}\ln q^{*}(\mathbf {\pi } )&=\ln p(\mathbf {\pi } )+\operatorname {E} _{\mathbf {Z} }[\ln p(\mathbf {Z} \mid \mathbf {\pi } )]+{\text{constant}}\\&=(\alpha _{0}-1)\sum _{k=1}^{K}\ln \pi _{k}+\sum _{n=1}^{N}\sum _{k=1}^{K}r_{nk}\ln \pi _{k}+{\text{constant}}\end{aligned}}

Taking the exponential of both sides, we recognize $q^{*}(\mathbf {\pi } )$ as a Dirichlet distribution

q^{*}(\mathbf {\pi } )\sim \operatorname {Dir} (\mathbf {\alpha } )\,

where

\alpha _{k}=\alpha _{0}+N_{k}\,

where

N_{k}=\sum _{n=1}^{N}r_{nk}\,

Finally

\ln q^{*}(\mathbf {\mu } _{k},\mathbf {\Lambda } _{k})=\ln p(\mathbf {\mu } _{k},\mathbf {\Lambda } _{k})+\sum _{n=1}^{N}\operatorname {E} [z_{nk}]\ln {\mathcal {N}}(\mathbf {x} _{n}\mid \mathbf {\mu } _{k},\mathbf {\Lambda } _{k}^{-1})+{\text{constant}}

Grouping and reading off terms involving $\mathbf {\mu } _{k}$ and $\mathbf {\Lambda } _{k}$ , the result is a Gaussian-Wishart distribution given by

q^{*}(\mathbf {\mu } _{k},\mathbf {\Lambda } _{k})={\mathcal {N}}(\mathbf {\mu } _{k}\mid \mathbf {m} _{k},(\beta _{k}\mathbf {\Lambda } _{k})^{-1}){\mathcal {W}}(\mathbf {\Lambda } _{k}\mid \mathbf {W} _{k},\nu _{k})

given the definitions

{\begin{aligned}\beta _{k}&=\beta _{0}+N_{k}\\\mathbf {m} _{k}&={\frac {1}{\beta _{k}}}(\beta _{0}\mathbf {\mu } _{0}+N_{k}{\bar {\mathbf {x} }}_{k})\\\mathbf {W} _{k}^{-1}&=\mathbf {W} _{0}^{-1}+N_{k}\mathbf {S} _{k}+{\frac {\beta _{0}N_{k}}{\beta _{0}+N_{k}}}({\bar {\mathbf {x} }}_{k}-\mathbf {\mu } _{0})({\bar {\mathbf {x} }}_{k}-\mathbf {\mu } _{0})^{\rm {T}}\\\nu _{k}&=\nu _{0}+N_{k}\\N_{k}&=\sum _{n=1}^{N}r_{nk}\\{\bar {\mathbf {x} }}_{k}&={\frac {1}{N_{k}}}\sum _{n=1}^{N}r_{nk}\mathbf {x} _{n}\\\mathbf {S} _{k}&={\frac {1}{N_{k}}}\sum _{n=1}^{N}(\mathbf {x} _{n}-{\bar {\mathbf {x} }}_{k})(\mathbf {x} _{n}-{\bar {\mathbf {x} }}_{k})^{\rm {T}}\end{aligned}}

Finally, notice that these functions require the values of $r_{nk}$ , which make use of $\rho _{nk}$ , which is defined in turn based on $\operatorname {E} [\ln \pi _{k}]$ , $\operatorname {E} [\ln |\mathbf {\Lambda } _{k}|]$ , and $\operatorname {E} _{\mathbf {\mu } _{k},\mathbf {\Lambda } _{k}}[(\mathbf {x} _{n}-\mathbf {\mu } _{k})^{\rm {T}}\mathbf {\Lambda } _{k}(\mathbf {x} _{n}-\mathbf {\mu } _{k})]$ . Now that we have determined the distributions over which these expectations are taken, we can derive formulas for them:

{\begin{aligned}\operatorname {E} _{\mathbf {\mu } _{k},\mathbf {\Lambda } _{k}}[(\mathbf {x} _{n}-\mathbf {\mu } _{k})^{\rm {T}}\mathbf {\Lambda } _{k}(\mathbf {x} _{n}-\mathbf {\mu } _{k})]&=D\beta _{k}^{-1}+\nu _{k}(\mathbf {x} _{n}-\mathbf {m} _{k})^{\rm {T}}\mathbf {W} _{k}(\mathbf {x} _{n}-\mathbf {m} _{k})\\\ln {\tilde {\Lambda }}_{k}&\equiv \operatorname {E} [\ln |\mathbf {\Lambda } _{k}|]=\sum _{i=1}^{D}\psi \left({\frac {\nu _{k}+1-i}{2}}\right)+D\ln 2+\ln |\mathbf {W} _{k}|\\\ln {\tilde {\pi }}_{k}&\equiv \operatorname {E} \left[\ln |\pi _{k}|\right]=\psi (\alpha _{k})-\psi \left(\sum _{i=1}^{K}\alpha _{i}\right)\end{aligned}}

These results lead to

r_{nk}\propto {\tilde {\pi }}_{k}{\tilde {\Lambda }}_{k}^{1/2}\exp \left\{-{\frac {D}{2\beta _{k}}}-{\frac {\nu _{k}}{2}}(\mathbf {x} _{n}-\mathbf {m} _{k})^{\rm {T}}\mathbf {W} _{k}(\mathbf {x} _{n}-\mathbf {m} _{k})\right\}

These can be converted from proportional to absolute values by normalizing over $k$ so that the corresponding values sum to 1.

Note that:

The update equations for the parameters $\beta _{k}$ , $\mathbf {m} _{k}$ , $\mathbf {W} _{k}$ and $\nu _{k}$ of the variables $\mathbf {\mu } _{k}$ and $\mathbf {\Lambda } _{k}$ depend on the statistics $N_{k}$ , ${\bar {\mathbf {x} }}_{k}$ , and $\mathbf {S} _{k}$ , and these statistics in turn depend on $r_{nk}$ .
The update equations for the parameters $\alpha _{1\dots K}$ of the variable $\mathbf {\pi }$ depend on the statistic $N_{k}$ , which depends in turn on $r_{nk}$ .
The update equation for $r_{nk}$ has a direct circular dependence on $\beta _{k}$ , $\mathbf {m} _{k}$ , $\mathbf {W} _{k}$ and $\nu _{k}$ as well as an indirect circular dependence on $\mathbf {W} _{k}$ , $\nu _{k}$ and $\alpha _{1\dots K}$ through ${\tilde {\pi }}_{k}$ and ${\tilde {\Lambda }}_{k}$ .

This suggests an iterative procedure that alternates between two steps:

An E-step that computes the value of $r_{nk}$ using the current values of all the other parameters.
An M-step that uses the new value of $r_{nk}$ to compute new values of all the other parameters.

Note that these steps correspond closely with the standard EM algorithm to derive a maximum likelihood or maximum a posteriori (MAP) solution for the parameters of a Gaussian mixture model. The responsibilities $r_{nk}$ in the E step correspond closely to the posterior probabilities of the latent variables given the data, i.e. $p(\mathbf {Z} \mid \mathbf {X} )$ ; the computation of the statistics $N_{k}$ , ${\bar {\mathbf {x} }}_{k}$ , and $\mathbf {S} _{k}$ corresponds closely to the computation of corresponding "soft-count" statistics over the data; and the use of those statistics to compute new values of the parameters corresponds closely to the use of soft counts to compute new parameter values in normal EM over a Gaussian mixture model.

Exponential-family distributions

Note that in the previous example, once the distribution over unobserved variables was assumed to factorize into distributions over the "parameters" and distributions over the "latent data", the derived "best" distribution for each variable was in the same family as the corresponding prior distribution over the variable. This is a general result that holds true for all prior distributions derived from the exponential family.

Notes

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

References

20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

My blog: http://www.primaboinca.com/view_profile.php?userid=5889534

External links

Variational-Bayes Repository A repository of papers, software, and links related to the use of variational methods for approximate Bayesian learning
The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay provides an introduction to variational methods (p. 422).
Variational Algorithms for Approximate Bayesian Inference, by M. J. Beal includes comparisons of EM to Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs.
A Tutorial on Variational Bayes. Fox, C. and Roberts, S. 2012. Artificial Intelligence Review, 21 year-old Glazier James Grippo from Edam, enjoys hang gliding, industrial property developers in singapore developers in singapore and camping. Finds the entire world an motivating place we have spent 4 months at Alejandro de Humboldt National Park..
High-Level Explanation of Variational Inference by Jason Eisner may be worth reading before a more mathematically detailed treatment.

↑ 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

My blog: http://www.primaboinca.com/view_profile.php?userid=5889534
↑ Christopher Bishop, Pattern Recognition and Machine Learning, 2006
↑ Based on Chapter 10 of Pattern Recognition and Machine Learning by Christopher M. Bishop

[1] 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

My blog: http://www.primaboinca.com/view_profile.php?userid=5889534

[2] Christopher Bishop, Pattern Recognition and Machine Learning, 2006

[3] Based on Chapter 10 of Pattern Recognition and Machine Learning by Christopher M. Bishop

[1]

[2]

[3]

Ladder paradox: Difference between revisions

Latest revision as of 15:41, 2 January 2014

Contents

Mathematical derivation of the mean-field approximation

In practice

A basic example

The mathematical model

The joint probability

Factorized approximation

Derivation of q(μ)

Derivation of q(τ)

Algorithm for computing the parameters

Further discussion

Step-by-step recipe

Most important points

Compared with expectation maximization (EM)

A more complex example

Exponential-family distributions

See also

Notes

References

External links

Navigation menu

@@ Line 1: / Line 1: @@
-== Ray Ban Zonnebrillen Belgie  ofwel ==
+{{For|the method of approximation in quantum mechanics|Variational method (quantum mechanics)}}
-Port Grimaud is een prachtig resort en we zouden hier zeker weer te reizen. De "Poort" zelf is zeer schilderachtig met een goede selectie van [http://www.studiodeprez.be/studioverhuur/images/reservatie.asp?r=35-Ray-Ban-Zonnebrillen-Belgie Ray Ban Zonnebrillen Belgie] bars en restaurants. Het was groot en wellkept met een wellstocked supermarkt ter plaatse. Kwaliteiten [http://www.boligna.be/backoffice/ckeditor/nieuws.asp?u=97-Ugg-Boots-Belgium Ugg Boots Belgium] zoals deze zijn cruciaal in positie Liudmila's, aan het roer van een marktleider met 2.000.000 Mobile, Internet Fix klanten op een totale bevolking van ongeveer 3 miljoen. Ze is zich terdege bewust van de bijbehorende verantwoordelijkheden:. "De telecomsector is een katalysator voor de Moldavische economie, waar het goed is voor ongeveer 10% van het BBP We spelen een centrale rol [http://www.slagerij-kris.be/_private/mail.asp?polo=82-Ralph-Lauren-Winkel-Brugge Ralph Lauren Winkel Brugge] als leverancier van diensten die de economische groei te stimuleren en als een schepper van banen. " Wat is ze het meest trots op? <br><br>Ik heb gediend als academische team coach voor 2 jaar bij Nevisdale Elementary School, en diende ook als hun cheerleading sponsor. Ik was academisch team coach voor Whitley County Middle School voor 9 jaar. Ik heb gediend als ambtenaar voor vele districts-en regionale academische wedstrijden, evenals gehost ene wijk en de regionale wedstrijd, en was trots op het succes van de teams ik gecoacht, krijgen te participeren in niveau van de staat concurrentie zien ..<br><br>Terwijl de oven opwarmt, verhit een ovenbestendig gietijzeren koekenpan op mediumhigh hitte. Smelt een 1/2inch stukje boter in de koekenpan, dan schroei elke kant van de biefstuk voor twee minuten. Zet de vlam en zet de biefstuk en de pan in de oven. Er zijn vele websites van deze webserver. De operator [http://www.daelprinting.be/en/cms/inc/categorie.asp?page=110-Woolrich-Online Woolrich Online] gebruikt deze server voor vele hosting klanten. In totaal zijn er minstens 99 websites op deze server.<br><br>Istanbul is groter dan New York en ik ga niet vergelijken met londen, londen is niet eens kwart van istanbul toch Hoe dan ook, hoe dan ook wij weten wat ze bedoelde ja het toont haar opleidingsniveau en zelfs dat ze niet eens de zorg over het onderzoek?.!. NYC naar Istanbul is 9 uur vliegen. <br><br>Ook kan PMillar niet gevangen, ofwel, dat de Form 990 geeft Restauratie Huis geen financiering van welke aard overheid. Wij hier bij KnoxNews zijn in de afgelopen jaren gewend aan het onderzoek van de lokale non-profitorganisaties met een oog naar de bepaling whetheror hoe muchany van hen kan worden fleecing belastingbetalers. <br><br>Bent u op zoek naar een Richard Gasquet v Grigor Dimitrov Gratis live streaming link? Dan bent u hier op de juiste plaats. U kunt een live online stroom van Richard Gasquet v Grigor Dimitrov kijken op deze site. Het is niet nodig om ergens anders kijken. Zoals hierboven reeds besproken, moet u een visueel verleidelijke website om bezoekers te lokken en te behouden hun interesse hebben. Nogmaals, moeten locaties een volledig scala van functionaliteiten te bieden aan gebruikers en navigatie moet geen probleem zijn.<ul>
+'''Variational Bayesian''' methods are a family of techniques for approximating intractable [[integral]]s arising in [[Bayesian inference]] and [[machine learning]].  They are typically used in complex [[statistical model]]s consisting of observed variables (usually termed "data") as well as unknown [[parameter]]s and [[latent variable]]s, with various sorts of relationships among the three types of [[random variable]]s, as might be described by a [[graphical model]].  As is typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
+#To provide an analytical approximation to the [[posterior probability]] of the unobserved variables, in order to do [[statistical inference]] over these variables.
-   <li>[http://yingshi.vivis.cn/forum.php?mod=viewthread&tid=225449&extra= http://yingshi.vivis.cn/forum.php?mod=viewthread&tid=225449&extra=]</li>
+#To derive a [[lower bound]] for the [[marginal likelihood]] (sometimes called the "evidence") of the observed data (i.e. the [[marginal probability]] of the data given the model, with marginalization performed over unobserved variables).  This is typically used for performing [[model selection]], the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the [[Bayes factor]] article.)
-   <li>[http://www.apachina.org/bbs/forum.php?mod=viewthread&tid=1325584 http://www.apachina.org/bbs/forum.php?mod=viewthread&tid=1325584]</li>
-   <li>[http://dirtyhotshot.com/activity/p/28679/ http://dirtyhotshot.com/activity/p/28679/]</li>
-   <li>[http://ks35439.kimsufi.com/spip.php?article453/ http://ks35439.kimsufi.com/spip.php?article453/]</li>
-   <li>[http://www.yaocq.com/news/html/?360561.html http://www.yaocq.com/news/html/?360561.html]</li>
- </ul>
-== Michael Kors Sales  maar zijn niet geschikt ==
+In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to [[Monte Carlo sampling]] methods — particularly, [[Markov chain Monte Carlo]] methods such as [[Gibbs sampling]] — for taking a fully Bayesian approach to [[statistical inference]] over complex [[probability distribution|distributions]] that are difficult to directly evaluate or [[sample (statistics)|sample]] from. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, Variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.
-En terwijl hij het ermee eens het eten is zeker rijk en een aantal van de gerechten mag niet lowcal zijn, je hoeft alleen maar om goede keuzes te maken. "Net als alles, geniet ervan met mate. Excuseer mijn laatste post omdat het niet echt betrekking op de belangrijkste kwestie hier. Ik probeerde te zeggen dat je moet voorzichtig zijn met Baclofen te zijn, omdat het werkt op GABA B, net als <br><br>Je bent een lelijke vrouw, je geen kans op het krijgen van een tv-baan. Denk je dat je te kleden als [Bob] Sager om een baan te krijgen nu. Ik vind verlichting van Cannabis. Benzo's werken, maar zijn niet geschikt, omdat ik uiteindelijk gewoon te veel. Wat ik zeker weet is dat wanneer ik de kans met die Montreal trui te spelen op mijn rug, is er geen betere motivatie dan een Canadien zijn. <br><br>In Peril van Another Rampart Scandal, panel vindt FBI ziet Hezbollah overal in de VS Florida Politie DUI Instructeur gearresteerd voor rijden onder invloed Gemeenschap eist justitie in de dood van Jessie Lee Williams Jr Vindication: Politie daling wiretap rekent Exjailer pleit schuldig patroon claims van misbruik in de gevangenis <br><br>Militaire voertuigen en acteurs in historische militaire kledij zal een aanvulling op de Flying Fortress. Tijdens een kort programma, zullen we eren overlevende B17 bemanningsleden en productie werknemers uit de omgeving van Seattle. Vraagt u zich af wat te doen met het familiebedrijf / vastgoed? Bezorgd over estate planning? Je hebt een wellconceived en uitgevoerd financieel plan. <br><br>Zo ja, dan moet je kijken naar hairstyling opties voor dik golvend haar. Mediumlength Kapsels Je kon krijgen je lokken vormgegeven in een medium gelaagde beste antwoord: Ik heb hetzelfde haar als jij. Een andere schrikken tactiek om mensen te stimuleren om de herindeling terug naar de hogere schema accepteren. <br><br>Wie wil dat niet? Maar er is een keerzijde aan die opvatting. Augustus, Maya Moore en Lindsay Whalen word gewoon de Lynx Olympiërs. Naties. De onderste foto op de kaart hieronder is de Hayes schoorsteen en de bovenste foto is de plek [http://www.metallink.be/flash/produkten.asp?m=8-Michael-Kors-Sales Michael Kors Sales] Naties .. Hey iedereen, Ik heb wat Chinese raws, ik heb geen idee of het voltooid is of niet tho. <br><br>De Pentelic marmer, waarvan de beelden worden gemaakt, natuurlijk verwerft Tan kleur vergelijkbaar met honing bij blootstelling aan de lucht, [http://www.boligna.be/backoffice/ckeditor/nieuws.asp?u=134-Uggs-Sale-Kids Uggs Sale Kids] wordt deze kleurplaat vaak bekend [http://www.ardovlam.be/intranet/contactok.asp?m=23-Mbt-Store Mbt Store] als de marmeren [42], maar Lord Duveen, die de hele onderneming gefinancierd, handelend [http://www.bondvlaamsearchitecten.be/sub/diensten.asp?id=122-Nike-Hakken Nike Hakken] onder de misvatting dat knikkers waren oorspronkelijk wit nofollow [43] waarschijnlijk georganiseerd team van metselaars werken in het project om verkleuring van enkele van de beelden te verwijderen.<ul>
+Variational Bayes can be seen as an extension of the EM ([[expectation-maximization]]) algorithm from [[maximum a posteriori estimation]] (MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire [[posterior distribution]] of the parameters and latent variables.  As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.
-   <li>[http://israeliz.net/activity/p/39501/ http://israeliz.net/activity/p/39501/]</li>
-   <li>[http://www.bbs.read-walker.com/forum.php?mod=viewthread&tid=135161 http://www.bbs.read-walker.com/forum.php?mod=viewthread&tid=135161]</li>
-   <li>[http://bdkuaican.com/news/html/?4727.html http://bdkuaican.com/news/html/?4727.html]</li>
-   <li>[http://kakaland.com/forum.php?mod=viewthread&tid=36123 http://kakaland.com/forum.php?mod=viewthread&tid=36123]</li>
-   <li>[http://www.917yx.net/t-191253-1-1.html http://www.917yx.net/t-191253-1-1.html]</li>
- </ul>
-== Nike Air Max Verkooppunten Belgie ==
+For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed.  However, deriving the set of equations used to iteratively update the parameters often requires a large amount of work compared with deriving the comparable Gibbs sampling equations.  This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.
-A: [http://www.concord-remarketing.com/fr/includes/customer.asp?n=14-Nike-Air-Max-Verkooppunten-Belgie Nike Air Max Verkooppunten Belgie] A: Beste Rusel, Als je een laxeermiddel aankoop een over de toonbank na een gesprek met uw arts nodig. Als u problemen ondervindt op dit gebied proberen het eten van voedingsmiddelen met meer vezels zoals fruit, groenten en pruimen zijn bekend om te helpen. Ik denk dat uw probleem is meer dan alleen de huid diep dieet en stress kan ook leiden tot acne breakout dus ik stel voor om blijvend te proberen'' s Schoon 9 is het regime om toxines in het lichaam te verwijderen. <br><br>Ellen Kardashian, 63, getrouwd Robert in 2003 slechts twee maanden voor zijn dood aan kanker. Ze zegt dat haar [http://www.rivaclub.be/Rivaclubfotoalbum/res/contact.asp?lv=78-Louis-Vuitton-Bags Louis Vuitton Bags] overleden man verdacht Khloe was het resultaat van een affaire Kris Jenner hadden vóór hun huwelijk eindigde in 1990. Barron's GRE met CDROM 18e editie De CD heeft 2 volle lengte testen. <br><br>Zeker als iedereen weet dat Peca wil om terug te komen. Als Buffalo stond open voor het brengen hem terug, zou hij terug. Ze merkte een groot aantal stukken, niet te vergeten hoe het is om een geweldige geven mentaliteit moeten hebben veel meer gemakkelijk weten precies een verscheidenheid [http://www.bondvlaamsearchitecten.be/sub/diensten.asp?id=112-Nike-Online-Shop-Outlet Nike Online Shop Outlet] van slopende dingen. <br><br>Tussen 22 januari en 17 maart sneeuw viel elke dag ergens in het land. De meest rampzalige lawine in het Verenigd Koninkrijk kwam in Lewes, East Sussex op 27 december 1836. Acht mensen werden gedood en verscheidene huizen werden verwoest .. Meer en meer bloggers en webmasters zijn [http://www.ilpastaiolo.be/Test_site/OLd/Slide/slidepasta.asp?k=7-Oakley-Brillen-Belgie Oakley Brillen Belgie] met behulp van RSS-syndicatie hun weblog bezoekers om te zetten in reguliere lezers. Maar voordat je kunt beginnen met het krijgen van hen in te schrijven, moet u eerst hun aandacht te vangen. Duidelijke en pakkende RSS icoon kan een goed begin zijn.<br><br>"We hebben onze strategische focus vernauwd; herstructurering van de onderneming, met inbegrip van onze LifeSize divisie, en geprioriteerd onze middelen om grote nieuwe producten voor tablets te maken, terwijl de pc-markt blijft wegen op onderdelen van ons bedrijf, en de aanhoudende economische onzekerheid in veel van. Europa vertoont geen tekenen van verbetering, ons product portfolio en aanduidingen van stabilisatie in Amerika en Azië, in combinatie met de kostenbesparingen als gevolg van onze FY 2013 herstructureringsmaatregelen, plaatst ons voor een verbeterde winstgevendheid in FY 2014. "voor het boekjaar 2014, eindigend in maart 31, 2014, Logitech momenteel verwacht een omzet van ongeveer 2 miljard dollar bedrijfsresultaat van ongeveer $ 50.000.000 en de brutomarge van ongeveer 34 procent.<br><br>Ik hoop echt dat Live Search wordt beter. Maar voor nu ben ik een beetje teleurgesteld: (Live Search is gebruiksvriendelijk, heeft een geweldig design, maar de zoekresultaten moeten veel completer naar mijn mening .. De bijzonder intelligent lijken met de kleding heeft de neiging om deze extreem populair te maken met dames, terwijl<ul>
+==Mathematical derivation of the mean-field approximation==
+In [[Calculus of variations|variational]] inference, the posterior distribution over a set of unobserved variables <math>\mathbf{Z} = \{Z_1 \dots Z_n\}</math> given some data <math>\mathbf{X}</math> is approximated
-   <li>[http://cs.dbwhcb.com/forum.php?mod=viewthread&tid=835727 http://cs.dbwhcb.com/forum.php?mod=viewthread&tid=835727]</li>
+by a variational distribution, <math>Q(\mathbf{Z})</math>:
+:<math>P(\mathbf{Z}\mid \mathbf{X}) \approx Q(\mathbf{Z}).</math>
-   <li>[http://grhdx.site02.51eway.com/news/html/?100551.html http://grhdx.site02.51eway.com/news/html/?100551.html]</li>
-   <li>[http://www.ezlsw.com/news/html/?23703.html http://www.ezlsw.com/news/html/?23703.html]</li>
-   <li>[http://124.128.87.5:8888/discuz/forum.php?mod=viewthread&tid=435837 http://124.128.87.5:8888/discuz/forum.php?mod=viewthread&tid=435837]</li>
-   <li>[http://soft.zfk8.com/bbs/forum.php?mod=viewthread&tid=1403211 http://soft.zfk8.com/bbs/forum.php?mod=viewthread&tid=1403211]</li>
- </ul>
-== New Balance Hardloopschoenen Aanbieding  ' ==
+The distribution <math>Q(\mathbf{Z})</math> is restricted to belong to a family of distributions of simpler
+form than <math>P(\mathbf{Z}\mid \mathbf{X})</math>, selected with the intention of making <math>Q(\mathbf{Z})</math> similar to the true posterior, <math>P(\mathbf{Z}\mid \mathbf{X})</math>. The lack of similarity is measured in terms of
+a dissimilarity function <math>d(Q; P)</math> and hence inference is performed by selecting the distribution
+<math>Q(\mathbf{Z})</math> that minimizes <math>d(Q; P)</math>.
-Met zo veel verschillende social media platformen te overwegen is er de natuurlijke verleiding om een bos van hen proberen. Deze verleiding wordt nog versterkt door de schijnbare afwezigheid [http://www.kaasbistro.be/includes/curiosa.asp?new=88-New-Balance-Hardloopschoenen-Aanbieding New Balance Hardloopschoenen Aanbieding] van kosten aan deze platforms en de veronderstelde "cool factor" een merk gebruiken, kunnen denken dat hun krijgen door ze te gebruiken. <br><br>Ik durf te voegen dat degenen die vicepremier Uhuru Kenyatta's Over pagina zag en ging te worden geserveerd met deze tastbare boosheid in het artikel van Sarah Elderkin zal nooit stemmen voor Raila Odinga, omdat het lijkt dat hij de hel gebogen over het beoefenen van destructieve politiek in plaats van constructieve politiek. In plaats van [http://www.fractal.be/css/ri/search.asp?id=67-Roshe-Run-Id Roshe Run Id] het creëren van een pagina over dat dit zou uitdagen hij besluit geen van zijn tech 'goeroes' zijn tot de taak en beslist de enige weg vooruit is naar de website van Uhuru's prullenbak. <br><br>Ze kan niet opstaan in haar eentje of staan. Ik denk dat de beste manier om een vis pijnloos te doden is om gewoon in een ijsbad, de vis gewoon gos. Is dat zo Craven de beste slogan ooit? Ja .. Elke groep die streeft naar herbouwen sociale structuren, in het bijzonder door middel van geweldloze acties van burgerlijke ongehoorzaamheid, in plaats van fysiek te vernietigen, wordt, in de ogen van Black Bloc anarchisten, de vijand. Black Bloc anarchisten grootste deel van hun woede niet op de architecten van de North American Free Trade Agreement (NAFTA) of globalisme, maar op die, zoals de Zapatistas, die reageren op het probleem. <br><br>Twitter Search Terwijl iedereen is druk bezig om erachter te komen hoe om meer mensen naar hun 140character ontploffing van schittering op Twitter volgen, er is een enorme, rijk en valueintense zoekmachine voor Twitter, dat is het echte goud. Neem een paar minuten en ga naar Twitter Search, pop in de merken die u vertegenwoordigt, uw eigen naam en zelfs de industrie u te dienen. Welke soorten gesprekken zijn al gevoerd over uw bedrijf "in het wild? ' <br><br>Gezien de dynamiek van het web overvloed aan inhoud, consumenten creëren van hun eigen distributienetwerken hoe nieuwsorganisaties en journalisten het beste dienen politieke nieuws de consument? Het is duidelijk, er is geen vervanging voor originele politieke verslaggeving, die [http://www.mortier-agri.be/site/scripts/historiek.asp?groep=27-Vibram-Fivefingers-Bikila Vibram Fivefingers Bikila] een nieuw licht kunnen werpen op de kandidaten, het beleid, platforms, politieke beloften, en partijdige aanvallen. Feit bijeenkomst is nog steeds de basis van de politieke journalistiek. <br><br>Deze dingen zijn gewoon knallen uit mijn hoofd dus stop met het doen van werk voor Troid maar gaan beginnen met het werken voor jezelf als je sumbitting 10K articles.Darn waarom uw eigen artikel directory niet maken. Man je realestate geven aan Triod terwijl u kunt toevoegen aan uw eigen adsense of andere montizable stront in je eigen directory je make-up een goede site goede lay-out met een aantal (10k) [http://www.fractal.be/css/ri/search.asp?id=41-Nike-Roshe-Run-Belgium Nike Roshe Run Belgium] kwaliteit artikelen. Wat denk je dat gaat doen met de SE?<ul>
+The most common type of variational Bayes, known as ''mean-field variational Bayes'', uses the [[Kullback–Leibler divergence]] (KL-divergence) of ''P'' from ''Q'' as the choice of dissimilarity function.  This choice makes this minimization tractable.  The KL-divergence is defined as
-   <li>[http://yaobuy.com.cn/news/html/?217597.html http://yaobuy.com.cn/news/html/?217597.html]</li>
-   <li>[http://www.bv58.cn/news/html/?11439.html http://www.bv58.cn/news/html/?11439.html]</li>
-   <li>[http://51mb.cn/news/html/?37394.html http://51mb.cn/news/html/?37394.html]</li>
-   <li>[http://www.chantal.cn/bbs/forum.php?mod=viewthread&tid=54843&fromuid=16974 http://www.chantal.cn/bbs/forum.php?mod=viewthread&tid=54843&fromuid=16974]</li>
-   <li>[http://www.hlyjq.cn/forum.php?mod=viewthread&tid=2696236 http://www.hlyjq.cn/forum.php?mod=viewthread&tid=2696236]</li>
- </ul>
-== Ralph Lauren Polo Kopen  goed opgeleide feministische reacti ==
+:<math>D_{\mathrm{KL}}(Q || P) = \sum_\mathbf{Z}  Q(\mathbf{Z}) \log \frac{Q(\mathbf{Z})}{P(\mathbf{Z}\mid \mathbf{X})}.</math>
-U kunt veel enge foto's. We hebben een grote foto galerij van verschillende soorten categorieën. Ik mijn wachtwoord veranderen. Ik fing ingelogd [http://www.slagerij-kris.be/_private/mail.asp?polo=45-Ralph-Lauren-Polo-Kopen Ralph Lauren Polo Kopen] Nadat ik dacht aan mijn eerste, instinctieve reactie had ik mijn tweede, goed opgeleide feministische reactie, die was "Nou, waarom zouden ze opleveren als dat, als dat is wat ze voelt op dit moment. Heb ik <br><br>Hoopte ik kon mijn Lancer niveau met behulp van een heleboel gevallen zoals ik kon met mijn tanks / healers in Rift en WoW en hun bijvoorbeeld Uitrusting en toebehoren. Maar de EXP is dan zielig. De doorvoer van Jupiter na 15 mei 2012, aan het 10e huis van de inboorlingen zal een aantal uitdagingen voor hen te brengen. <br><br>Hij won wedstrijden voor ons toen we zouden hebben verloren. Ja, kipper is een van die keepers die als hij een slecht spel laat hij een ton binnen Dit verdunt zijn besparen gemiddeld, maar wat [http://www.rivaclub.be/Rivaclubfotoalbum/res/contact.asp?lv=64-Louis-Vuitton-Antwerpen Louis Vuitton Antwerpen] is het meest indrukwekkend over zijn verleden is zijn overwinningen. Het volgende wat een verrassing over het display is de hoop van kabels die met het. En er zijn alle bijbehorende poorten in het achterpaneel van de display om al die kabels aansluiten op. <br><br>Toegewijde gamers moeten hun toetsenbord om meerdere toetsaanslagen correct te verwerken, terwijl mensen die gewoonlijk werken 's nachts rustig en verlicht toetsenbord zou willen. Er is geen ultieme oplossing. Je zou vaak zien mensen een beroep op de netwerken zoals CPA, CPC en een paar affiliate netwerken. In feite, mensen ondanks de wetenschap over dezelfde neiging om te negeren voor het maken van geld.<br><br>Geplaatst onder de categorie Other op Lead411, De zon van Calgary is een Calgary, ABbased organisatie. Mike Dohy dient als hun manager of Information Technology. Als je een boek of twee om u te helpen en kan een restanten boekhandel in uw buurt vindt u waarschijnlijk vinden een C voor Dummies / Idiots (ongelukkige titels no offense bedoeld maar vaak zeer nuttige boeken voor beginners) of soortgelijke boeken [http://www.bondvlaamsearchitecten.be/sub/diensten.asp?id=112-Nike-Online-Shop-Outlet Nike Online Shop Outlet] beschikbaar voor minder dan $ 10. <br><br>"Ik denk dat het veilig is om te zeggen dat dit team doet dat nu," voegde hij eraan toe. "En ik kon niet meer trots op onze spelers, omdat we niet een heel goed team twee, drie maanden geleden. Hoe dit moet laten zien is een uitlezing van de temperatuur instellingen het moederbord is monitoring, samen met andere informatie, zoals ventilator <br><br>De wet verandert van tijd tot tijd en hoewel we er alles aan doen om de in onze blogs en websites actuele informatie te houden, [http://www.fractal.be/css/ri/search.asp?id=19-Nike-Roshe-Run-Black-White Nike Roshe Run Black White] kunnen we niet beloven of garanderen dat de informatie juist, volledig of up to date is, met name de in oudere blogberichten omdat de informatie de wet kan veranderd vanaf het moment deze posten werden gepubliceerd.<ul>
+Note that ''Q'' and ''P'' are reversed from what one might expect.  This use of reversed KL-divergence is conceptually similar to the [[expectation-maximization algorithm]]. (Using the KL-divergence in the other way produces the [[expectation propagation]] algorithm.)
-   <li>[http://xbzd.bdu.edu.cn/xbzd/zwx1/bbs/boke.asp?fsxwnvaa.showtopic.65855.html http://xbzd.bdu.edu.cn/xbzd/zwx1/bbs/boke.asp?fsxwnvaa.showtopic.65855.html]</li>
+The KL-divergence can be written as
-   <li>[http://shanafanghua.imotor.com/viewthread.php?tid=34325&extra= http://shanafanghua.imotor.com/viewthread.php?tid=34325&extra=]</li>
+:<math>D_{\mathrm{KL}}(Q || P) = \sum_\mathbf{Z}  Q(\mathbf{Z}) \log \frac{Q(\mathbf{Z})}{P(\mathbf{Z},\mathbf{X})} + \log P(\mathbf{X}),</math>
-   <li>[http://www.kmqdyp.com/news/html/?9794.html http://www.kmqdyp.com/news/html/?9794.html]</li>
+or
-   <li>[http://anmey201010.w20485.fxdns.cn/news/html/?75523.html http://anmey201010.w20485.fxdns.cn/news/html/?75523.html]</li>
+: <math>
+\begin{align}
-   <li>[http://www.cabinsnh.com/forum.php?mod=viewthread&tid=82002&fromuid=14655 http://www.cabinsnh.com/forum.php?mod=viewthread&tid=82002&fromuid=14655]</li>
+\log P(\mathbf{X}) & = D_{\mathrm{KL}}(Q||P) - \sum_\mathbf{Z} Q(\mathbf{Z}) \log \frac{Q(\mathbf{Z})}{P(\mathbf{Z},\mathbf{X})} \\
+& = D_{\mathrm{KL}}(Q||P) + \mathcal{L}(Q).
- </ul>
+\end{align}
+</math>
+As the ''log [[model evidence|evidence]]'' <math>\log P(\mathbf{X})</math> is fixed with respect to <math>Q</math>, maximizing the final term <math>\mathcal{L}(Q)</math> minimizes the KL divergence of <math>P</math> from <math>Q</math>.  By appropriate choice of <math>Q</math>, <math>\mathcal{L}(Q)</math> becomes tractable to compute and to maximize. Hence we have both an analytical approximation <math>Q</math> for the posterior <math>P(\mathbf{Z}\mid \mathbf{X})</math>, and a lower bound <math>\mathcal{L}(Q)</math> for the evidence <math>\log P(\mathbf{X})</math>. The lower bound <math>\mathcal{L}(Q)</math> is known as the (negative) ''variational [[Thermodynamic free energy|free energy]]'' because it can also be expressed as an "energy" <math>\operatorname{E}_{Q}[\log P(\mathbf{Z},\mathbf{X})]</math> plus the entropy of <math>Q</math>.
+==In practice==
+The variational distribution <math>Q(\mathbf{Z})</math> is usually assumed to factorize over some [[partition of a set|partition]] of the latent variables, i.e. for some partition of the latent variables <math>\mathbf{Z}</math> into <math>\mathbf{Z}_1 \dots \mathbf{Z}_M</math>,
+:<math>Q(\mathbf{Z}) = \prod_{i=1}^M q_i(\mathbf{Z}_i\mid \mathbf{X})</math>
+It can be shown using the [[calculus of variations]] (hence the name "variational Bayes") that the "best" distribution <math>q_j^{*}</math> for each of the factors <math>q_j</math> (in terms of the distribution minimizing the KL divergence, as described above) can be expressed as:
+:<math>q_j^{*}(\mathbf{Z}_j\mid \mathbf{X}) = \frac{e^{\operatorname{E}_{i \neq j} [\ln p(\mathbf{Z}, \mathbf{X})]}}{\int e^{\operatorname{E}_{i \neq j} [\ln p(\mathbf{Z}, \mathbf{X})]}\, d\mathbf{Z}_j}</math>
+where <math>\operatorname{E}_{i \neq j} [\ln p(\mathbf{Z}, \mathbf{X})]</math> is the [[expected value|expectation]] of the logarithm of the [[joint probability]] of the data and latent variables, taken over all variables not in the partition.
+In practice, we usually work in terms of logarithms, i.e.:
+:<math>\ln q_j^{*}(\mathbf{Z}_j\mid \mathbf{X}) = \operatorname{E}_{i \neq j} [\ln p(\mathbf{Z}, \mathbf{X})] + \text{constant}</math>
+The constant in the above expression is related to the [[normalizing constant]] (the denominator in the expression above for <math>q_j^{*}</math>) and is usually reinstated by inspection, as the rest of the expression can usually be recognized as being a known type of distribution (e.g. [[Gaussian distribution|Gaussian]], [[gamma distribution|gamma]], etc.).
+Using the properties of expectations, the expression <math>\operatorname{E}_{i \neq j} [\ln p(\mathbf{Z}, \mathbf{X})]</math> can usually be simplified into a function of the fixed [[hyperparameter]]s of the [[prior distribution]]s over the latent variables and of expectations (and sometimes higher [[moment (mathematics)|moment]]s such as the [[variance]]) of latent variables not in the current partition (i.e. latent variables not included in <math>\mathbf{Z}_j</math>). This creates circular dependencies between the parameters of the distributions over variables in one partition and the expectations of variables in the other partitions.  This naturally suggests an [[iterative]] algorithm, much like EM (the [[expectation-maximization]] algorithm), in which the expectations (and possibly higher moments) of the latent variables are initialized in some fashion (perhaps randomly), and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters. An algorithm of this sort is guaranteed to [[limit of a sequence|converge]].<ref>{{cite book|title=Convex Optimization|first1=Stephen P.|last1=Boyd|first2=Lieven|last2=Vandenberghe|year=2004|publisher=Cambridge University Press|isbn=978-0-521-83378-3|url=http://www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf|format=pdf|accessdate=October 15, 2011}}</ref> Furthermore, if the distributions in question are part of the [[exponential family]], which is usually the case, convergence will be to a [[global maximum]], since the exponential family is [[convex function|convex]].<ref>Christopher Bishop, ''Pattern Recognition and Machine Learning'', 2006</ref>
+In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution's functional dependency on the variables in question, the family of the distribution can usually be determined (which in turn determines the value of the constant).  The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters (which are known constants), but also in terms of expectations of functions of variables in other partitions.  Usually these expectations can be simplified into functions of expectations of the variables themselves (i.e. the [[mean]]s); sometimes expectations of squared variables (which can be related to the [[variance]] of the variables), or expectations of higher powers (i.e. higher [[moment (mathematics)|moment]]s) also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual, [[nonlinear]] dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer.
+==A basic example==
+Consider a simple non-hierarchical Bayesian model consisting of a set of [[independent identically distributed|i.i.d.]] observations from a [[Gaussian distribution]], with unknown [[mean]] and [[variance]].<ref>Based on Chapter 10 of ''Pattern Recognition and Machine Learning'' by [[Christopher M. Bishop]]</ref> In the following, we work through this model in great detail to illustrate the workings of the variational Bayes method.
+For mathematical convenience, in the following example we work in terms of the [[precision (statistics)|precision]] — i.e. the reciprocal of the variance — rather than the variance itself.  (From a theoretical standpoint, precision and variance are equivalent since there is a [[one-to-one correspondence]] between the two.)
+===The mathematical model===
+We place [[conjugate prior]] distributions on the unknown mean and variance, i.e. the mean also follows a Gaussian distribution while the precision follows a [[gamma distribution]].  In other words:
+:<math>
+\begin{align}
+\mu & \sim \mathcal{N}(\mu_0, (\lambda_0 \tau)^{-1}) \\
+\tau & \sim \operatorname{Gamma}(a_0, b_0) \\
+\{x_1, \dots, x_N\} & \sim \mathcal{N}(\mu, \tau^{-1}) \\
+N &= \text{number of data points}
+\end{align}
+</math>
+We are given <math>N</math> data points <math>\mathbf{X} = \{x_1, \dots, x_N\}</math> and our goal is to infer the [[posterior distribution]] <math>q(\mu, \tau)=p(\mu,\tau\mid x_1, \ldots, x_N)</math> of the parameters <math>\mu</math> and <math>\tau</math>.
+The [[hyperparameter]]s <math>\mu_0</math>, <math>\lambda_0</math>, <math>a_0</math> and <math>b_0</math> are fixed, given values.  They can be set to small positive numbers to give broad prior distributions indicating ignorance about the prior distributions of <math>\mu</math> and <math>\tau</math>.
+===The joint probability===
+The [[joint probability]] of all variables can be rewritten as
+:<math>p(\mathbf{X},\mu,\tau) = p(\mathbf{X}\mid \mu,\tau) p(\mu\mid \tau) p(\tau)</math>
+where the individual factors are
+:<math>
+\begin{align}
+p(\mathbf{X}\mid \mu,\tau) & = \prod_{n=1}^N \mathcal{N}(x_n\mid \mu,\tau^{-1}) \\
+p(\mu\mid \tau) & = \mathcal{N}(\mu\mid \mu_0, (\lambda_0 \tau)^{-1}) \\
+p(\tau) & = \operatorname{Gamma}(\tau\mid a_0, b_0)
+\end{align}
+</math>
+where
+:<math>
+\begin{align}
+\mathcal{N}(x\mid \mu,\sigma^2) & = \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x-\mu)^2}{2\sigma^2}} \\
+\operatorname{Gamma}(\tau\mid a,b) & = \frac{1}{\Gamma(a)} b^a \tau^{a-1} e^{-b \tau}
+\end{align}
+</math>
+===Factorized approximation===
+Assume that <math>q(\mu,\tau) = q(\mu)q(\tau)</math>, i.e. that the posterior distribution factorizes into independent factors for <math>\mu</math> and <math>\tau</math>.  This type of assumption underlies the variational Bayesian method.  The true posterior distribution does not in fact factor this way (in fact, in this simple case, it is known to be a [[Gaussian-gamma distribution]]), and hence the result we obtain will be an approximation.
+===Derivation of q(μ)===
+Then
+:<math>
+\begin{align}
+\ln q_\mu^*(\mu) &= \operatorname{E}_{\tau}\left[\ln p(\mathbf{X}\mid \mu,\tau) + \ln p(\mu\mid \tau) + \ln p(\tau)\right] + C \\
+ &= \operatorname{E}_{\tau}\left[\ln p(\mathbf{X}\mid \mu,\tau)\right] + \operatorname{E}_{\tau}\left[\ln p(\mu\mid \tau)\right] + \operatorname{E}_{\tau}\left[\ln p(\tau)\right] + C \\
+ &= \operatorname{E}_{\tau}\left[\ln \prod_{n=1}^N \mathcal{N}(x_n\mid \mu,\tau^{-1})\right] + \operatorname{E}_{\tau}\left[\ln \mathcal{N}(\mu\mid \mu_0, (\lambda_0 \tau)^{-1})\right] + C_2 \\
+ &= \operatorname{E}_{\tau}\left[\ln \prod_{n=1}^N \sqrt{\frac{\tau}{2\pi}} e^{-\frac{(x_n-\mu)^2\tau}{2}}\right] + \operatorname{E}_{\tau}\left[\ln \sqrt{\frac{\lambda_0 \tau}{2\pi}} e^{-\frac{(\mu-\mu_0)^2\lambda_0 \tau}{2}}\right] + C_2 \\
+ &= \operatorname{E}_{\tau}\left[\sum_{n=1}^N \left(\frac{1}{2}(\ln\tau - \ln 2\pi) - \frac{(x_n-\mu)^2\tau}{2})\right)\right] + \operatorname{E}_{\tau}\left[\frac{1}{2}(\ln \lambda_0 + \ln \tau - \ln 2\pi) - \frac{(\mu-\mu_0)^2\lambda_0 \tau}{2}\right] + C_2 \\
+ &= \operatorname{E}_{\tau}\left[\sum_{n=1}^N -\frac{(x_n-\mu)^2\tau}{2}\right] + \operatorname{E}_{\tau}\left[-\frac{(\mu-\mu_0)^2\lambda_0 \tau}{2}\right] + \operatorname{E}_{\tau}\left[\sum_{n=1}^N \frac{1}{2}(\ln\tau - \ln 2\pi)\right] + \operatorname{E}_{\tau}\left[\frac{1}{2}(\ln \lambda_0 + \ln \tau - \ln 2\pi)\right] + C_2 \\
+ &= \operatorname{E}_{\tau}\left[\sum_{n=1}^N -\frac{(x_n-\mu)^2\tau}{2}\right] + \operatorname{E}_{\tau}\left[-\frac{(\mu-\mu_0)^2\lambda_0 \tau}{2}\right] + C_3 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \left\{ \sum_{n=1}^N (x_n-\mu)^2 + \lambda_0(\mu-\mu_0)^2 \right\} + C_3
+\end{align}
+</math>
+In the above derivation, <math>C</math>, <math>C_2</math> and <math>C_3</math> refer to values that are constant with respect to <math>\mu</math>.  Note that the term <math>\operatorname{E}_{\tau}[\ln p(\tau)]</math> is not a function of <math>\mu</math> and will have the same value regardless of the value of <math>\mu</math>.  Hence in line 3 we can absorb it into the constant term at the end.  We do the same thing in line 7.
+The last line is simply a quadratic polynomial in <math>\mu</math>.  Since this is the logarithm of <math>q_\mu^*(\mu)</math>, we can see that <math>q_\mu^*(\mu)</math> itself is a [[Gaussian distribution]].
+With a certain amount of tedious math (expanding the squares inside of the braces, separating out and grouping the terms involving <math>\mu</math> and <math>\mu^2</math> and [[completing the square]] over <math>\mu</math>), we can derive the parameters of the Gaussian distribution:
+:<math>
+\begin{align}
+\ln q_\mu^*(\mu) &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \{ \sum_{n=1}^N (x_n-\mu)^2 + \lambda_0(\mu-\mu_0)^2 \} + C_3 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \{ \sum_{n=1}^N (x_n^2-2x_n\mu + \mu^2) + \lambda_0(\mu^2-2\mu_0\mu + \mu_0^2) \} + C_3 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \{ (\sum_{n=1}^N x_n^2)-2(\sum_{n=1}^N x_n)\mu + (\sum_{n=1}^N \mu^2) + \lambda_0\mu^2-2\lambda_0\mu_0\mu + \lambda_0\mu_0^2 \} + C_3 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \{ (\lambda_0+N)\mu^2 -2(\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n)\mu + (\textstyle\sum_{n=1}^N x_n^2) + \lambda_0\mu_0^2 \} + C_3 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \{ (\lambda_0+N)\mu^2 -2(\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n)\mu \} + C_4 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \left\{ (\lambda_0+N)\mu^2 -2\frac{\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n}{\lambda_0+N}(\lambda_0+N) \mu \right\} + C_4 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \left\{ (\lambda_0+N)\left(\mu^2 -2\frac{\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n}{\lambda_0+N} \mu\right) \right\} + C_4 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \left\{ (\lambda_0+N)\left(\mu^2 -2\frac{\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n}{\lambda_0+N} \mu + \left(\frac{\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n}{\lambda_0+N}\right)^2 - \left(\frac{\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n}{\lambda_0+N}\right)^2\right) \right\} + C_4 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \left\{ (\lambda_0+N)\left(\mu^2 -2\frac{\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n}{\lambda_0+N} \mu + \left(\frac{\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n}{\lambda_0+N}\right)^2 \right) \right\} + C_5 \\
+                    &= - \frac{\operatorname{E}_{\tau}[\tau]}{2} \left\{ (\lambda_0+N)\left(\mu-\frac{\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n}{\lambda_0+N}\right)^2 \right\} + C_5 \\
+                    &= - \frac{1}{2} \left\{ (\lambda_0+N)\operatorname{E}_{\tau}[\tau] \left(\mu-\frac{\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n}{\lambda_0+N}\right)^2 \right\} + C_5 \\
+\end{align}
+</math>
+Note that all of the above steps can be shortened by using the formula for the [[Normal distribution#The sum of two quadratics|sum of two quadratics]].
+In other words:
+:<math>
+\begin{align}
+q_\mu^*(\mu) &\sim \mathcal{N}(\mu\mid \mu_N,\lambda_N^{-1}) \\
+\mu_N &= \frac{\lambda_0 \mu_0 + N \bar{x}}{\lambda_0 + N} \\
+\lambda_N &= (\lambda_0 + N) \operatorname{E}[\tau] \\
+\bar{x} &= \frac{1}{N}\sum_{n=1}^N x_n
+\end{align}
+</math>
+===Derivation of q(τ)===
+The derivation of <math>q_\tau^*(\tau)</math> is similar to above, although we omit some of the details for the sake of brevity.
+:<math>
+\begin{align}
+\ln q_\tau^*(\tau) &= \operatorname{E}_{\mu}[\ln p(\mathbf{X}\mid \mu,\tau) + \ln p(\mu\mid \tau)] + \ln p(\tau) + \text{constant} \\
+                    &= (a_0 - 1) \ln \tau - b_0 \tau + \frac{1}{2} \ln \tau + \frac{N}{2} \ln \tau - \frac{\tau}{2} \operatorname{E}_\mu [\sum_{n=1}^N (x_n-\mu)^2 + \lambda_0(\mu - \mu_0)^2] + \text{constant}
+\end{align}
+</math>
+Exponentiating both sides, we can see that <math>q_\tau^*(\tau)</math> is a [[gamma distribution]].  Specifically:
+:<math>
+\begin{align}
+q_\tau^*(\tau) &\sim \operatorname{Gamma}(\tau\mid a_N, b_N) \\
+a_N &= a_0 + \frac{N+1}{2} \\
+b_N &= b_0 + \frac{1}{2} \operatorname{E}_\mu \left[\sum_{n=1}^N (x_n-\mu)^2 + \lambda_0(\mu - \mu_0)^2\right]
+\end{align}
+</math>
+===Algorithm for computing the parameters===
+Let us recap the conclusions from the previous sections:
+:<math>
+\begin{align}
+q_\mu^*(\mu) &\sim \mathcal{N}(\mu\mid\mu_N,\lambda_N^{-1}) \\
+\mu_N &= \frac{\lambda_0 \mu_0 + N \bar{x}}{\lambda_0 + N} \\
+\lambda_N &= (\lambda_0 + N) \operatorname{E}[\tau] \\
+\bar{x} &= \frac{1}{N}\sum_{n=1}^N x_n
+\end{align}
+</math>
+and
+:<math>
+\begin{align}
+q_\tau^*(\tau) &\sim \operatorname{Gamma}(\tau\mid a_N, b_N) \\
+a_N &= a_0 + \frac{N+1}{2} \\
+b_N &= b_0 + \frac{1}{2} \operatorname{E}_\mu \left[\sum_{n=1}^N (x_n-\mu)^2 + \lambda_0(\mu - \mu_0)^2\right]
+\end{align}
+</math>
+In each case, the parameters for the distribution over one of the variables depend on expectations taken with respect to the other variable.  We can expand the expectations, using the standard formulas for the expectations of moments of the Gaussian and gamma distributions:
+:<math>
+\begin{align}
+\operatorname{E}[\tau\mid a_N, b_N] &= \frac{a_N}{b_N} \\
+\operatorname{E}[\mu\mid\mu_N,\lambda_N^{-1}] &= \mu_N \\
+\operatorname{E}\left[X^2 \right] &= \operatorname{Var}(X) + (\operatorname{E}[X])^2 \\
+\operatorname{E}[\mu^2\mid\mu_N,\lambda_N^{-1}] &= \lambda_N^{-1} + \mu_N^2
+\end{align}
+</math>
+Applying these formulas to the above equations is trivial in most cases, but the equation for <math>b_N</math> takes more work:
+:<math>
+\begin{align}
+b_N &= b_0 + \frac{1}{2} \operatorname{E}_\mu \left[\sum_{n=1}^N (x_n-\mu)^2 + \lambda_0(\mu - \mu_0)^2\right] \\
+ &= b_0 + \frac{1}{2} \operatorname{E}_\mu \left[ (\lambda_0+N)\mu^2 -2(\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n)\mu + (\textstyle\sum_{n=1}^N x_n^2) + \lambda_0\mu_0^2 \right] \\
+ &= b_0 + \frac{1}{2} \left[ (\lambda_0+N)\operatorname{E}_\mu[\mu^2] -2(\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n)\operatorname{E}_\mu[\mu] + (\textstyle\sum_{n=1}^N x_n^2) + \lambda_0\mu_0^2 \right] \\
+ &= b_0 + \frac{1}{2} \left[ (\lambda_0+N)(\lambda_N^{-1} + \mu_N^2) -2(\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n)\mu_N + (\textstyle\sum_{n=1}^N x_n^2) + \lambda_0\mu_0^2 \right] \\
+\end{align}
+</math>
+We can then write the parameter equations as follows, without any expectations:
+:<math>
+\begin{align}
+\mu_N &= \frac{\lambda_0 \mu_0 + N \bar{x}}{\lambda_0 + N} \\
+\lambda_N &= (\lambda_0 + N) \frac{a_N}{b_N} \\
+\bar{x} &= \frac{1}{N}\sum_{n=1}^N x_n \\
+a_N &= a_0 + \frac{N+1}{2} \\
+b_N &= b_0 + \frac{1}{2} \left[ (\lambda_0+N)(\lambda_N^{-1} + \mu_N^2) -2(\lambda_0\mu_0 + \textstyle\sum_{n=1}^N x_n)\mu_N + (\textstyle\sum_{n=1}^N x_n^2) + \lambda_0\mu_0^2 \right]
+\end{align}
+</math>
+Note that there are circular dependencies among the formulas for <math>\mu_N</math>, <math>\lambda_N</math> and <math>b_N</math>.  This naturally suggests an [[expectation-maximization algorithm|EM]]-like algorithm:
+#Compute <math>\sum_{n=1}^N x_n</math> and <math>\sum_{n=1}^N x_n^2.</math> Use these values to compute <math>\mu_N</math> and <math>a_N.</math>
+#Initialize <math>\lambda_N</math> to some arbitrary value.
+#Use the current value of <math>\lambda_N,</math> along with the known values of the other parameters, to compute <math>b_N</math>.
+#Use the current value of <math>b_N,</math> along with the known values of the other parameters, to compute <math>\lambda_N</math>.
+#Repeat the last two steps until convergence (i.e. until neither value has changed more than some small amount).
+We then have values for the hyperparameters of the approximating distributions of the posterior parameters, which we can use to compute any properties we want of the posterior — e.g. its mean and variance, a 95% highest-density region (the smallest interval that includes 95% of the total probability), etc.
+It can be shown that this algorithm is guaranteed to converge to a local maximum, and since both posterior distributions are in the [[exponential family]], this local maximum will be a [[global maximum]].
+Note also that the posterior distributions have the same form as the corresponding prior distributions.  We did ''not'' assume this; the only assumption we made was that the distributions factorize, and the form of the distributions followed naturally.  It turns out (see below) that the fact that the posterior distributions have the same form as the prior distributions is not a coincidence, but a general result whenever the prior distributions are members of the [[exponential family]], which is the case for most of the standard distributions.
+==Further discussion==
+===Step-by-step recipe===
+The above example shows the method by which the variational-Bayesian approximation to a [[posterior probability]] density in a given [[Bayesian network]] is derived:
+#Describe the network with a [[graphical model]], identifying the observed variables (data) <math>\mathbf{X}</math> and unobserved variables ([[parameter]]s <math>\boldsymbol\Theta</math> and [[latent variable]]s <math>\mathbf{Z}</math>) and their [[conditional probability distribution]]s.  Variational Bayes will then construct an approximation to the posterior probability <math>p(\mathbf{Z},\boldsymbol\Theta\mid\mathbf{X})</math>.  The approximation has the basic property that it is a factorized distribution, i.e. a product of two or more [[statistical independence|independent]] distributions over disjoint subsets of the unobserved variables.
+#Partition the unobserved variables into two or more subsets, over which the independent factors will be derived.  There is no universal procedure for doing this; creating too many subsets yields a poor approximation, while creating too few makes the entire variational Bayes procedure intractable.  Typically, the first split is to separate the parameters and latent variables; often, this is enough by itself to produce a tractable result.  Assume that the partitions are called <math>\mathbf{Z}_1,\ldots,\mathbf{Z}_M</math>.
+#For a given partition <math>\mathbf{Z}_j</math>, write down the formula for the best approximating distribution <math>q_j^{*}(\mathbf{Z}_j\mid \mathbf{X})</math> using the basic equation <math>\ln q_j^{*}(\mathbf{Z}_j\mid \mathbf{X}) = \operatorname{E}_{i \neq j} [\ln p(\mathbf{Z}, \mathbf{X})] + \text{constant}</math> .
+#Fill in the formula for the [[joint probability]] distribution using the graphical model.  Any component conditional distributions that don't involve any of the variables in <math>\mathbf{Z}_j</math> can be ignored; they will be folded into the constant term.
+#Simplify the formula and apply the expectation operator, following the above example.  Ideally, this should simplify into expectations of basic functions of variables not in <math>\mathbf{Z}_j</math> (e.g. first or second raw [[moment (mathematics)|moment]]s, expectation of a logarithm, etc.).  In order for the variational Bayes procedure to work well, these expectations should generally be expressible analytically as functions of the parameters and/or [[hyperparameter]]s of the distributions of these variables.  In all cases, these expectation terms are constants with respect to the variables in the current partition.
+#The functional form of the formula with respect to the variables in the current partition indicates the type of distribution.  In particular, exponentiating the formula generates the [[probability density function]] (PDF) of the distribution (or at least, something proportional to it, with unknown [[normalization constant]]).  In order for the overall method to be tractable, it should be possible to recognize the functional form as belonging to a known distribution.  Significant mathematical manipulation may be required to convert the formula into a form that matches the PDF of a known distribution.  When this can be done, the normalization constant can be reinstated by definition, and equations for the parameters of the known distribution can be derived by extracting the appropriate parts of the formula.
+#When all expectations can be replaced analytically with functions of variables not in the current partition, and the PDF put into a form that allows identification with a known distribution, the result is a set of equations expressing the values of the optimum parameters as functions of the parameters of variables in other partitions.
+#When this procedure can be applied to all partitions, the result is a set of mutually linked equations specifying the optimum values of all parameters.
+#An [[expectation maximization]] (EM) type procedure is then applied, picking an initial value for each parameter and the iterating through a series of steps, where at each step we cycle through the equations, updating each parameter in turn.  This is guaranteed to converge.
+===Most important points===
+Due to all of the mathematical manipulations involved, it is easy to lose track of the big picture.  The important things are:
+#The idea of variational Bayes is to construct an analytical approximation to the [[posterior probability]] of the set of unobserved variables (parameters and latent variables), given the data.  This means that the form of the solution is similar to other [[Bayesian inference]] methods, such as [[Gibbs sampling]] — i.e. a distribution that seeks to describe everything that is known about the variables.  As in other Bayesian methods — but unlike e.g. in [[expectation maximization]] (EM) or other [[maximum likelihood]] methods — both types of unobserved variables (i.e. parameters and latent variables) are treated the same, i.e. as [[random variable]]s.  Estimates for the variables can then be derived in the standard Bayesian ways, e.g. calculating the mean of the distribution to get a single point estimate or deriving a [[credible interval]], highest density region, etc.
+#"Analytical approximation" means that a formula can be written down for the posterior distribution.  The formula generally consists of a product of well-known probability distributions, each of which ''factorizes'' over a set of unobserved variables (i.e. it is [[conditionally independent]] of the other variables, given the observed data).  This formula is not the true posterior distribution, but an approximation to it; in particular, it will generally agree fairly closely in the lowest [[moment (mathematics)|moment]]s of the unobserved variables, e.g. the [[mean]] and [[variance]].
+#The result of all of the mathematical manipulations is (1) the identity of the probability distributions making up the factors, and (2) mutually dependent formulas for the parameters of these distributions.  The actual values of these parameters are computed numerically, through an alternating iterative procedure much like EM.
+===Compared with expectation maximization (EM)===
+Variational Bayes (VB) is often compared with [[expectation maximization]] (EM).  The actual numerical procedure is quite similar, in that both are alternating iterative procedures that successively converge on optimum parameter values.  The initial steps to derive the respective procedures are also vaguely similar, both starting out with formulas for probability densities and both involving significant amounts of mathematical manipulations.
+However, there are a number of differences.  Most important is ''what'' is being computed.
+*EM computes point estimates of posterior distribution of those random variables that can be categorized as "parameters", but estimates of the actual posterior distributions of the latent variables (at least in "soft EM", and often only when the latent variables are discrete).  The point estimates computed are the [[mode (statistics)|mode]]s of these parameters; no other information is available.
+*VB, on the other hand, computes estimates of the actual posterior distribution of all variables, both parameters and latent variables.  When point estimates need to be derived, generally the [[mean]] is used rather than the mode, as is normal in Bayesian inference.  Concomitant with this, it should be noted that the parameters computed in VB do ''not'' have the same significance as those in EM.  EM computes optimum values of the parameters of the Bayes network itself.  VB computes optimum values of the parameters of the distributions used to approximate the parameters and latent variables of the Bayes network.  For example, a typical Gaussian [[mixture model]] will have parameters for the mean and variance of each of the mixture components.  EM would directly estimate optimum values for these parameters.  VB, however, would first fit a distribution to these parameters — typically in the form of a [[prior distribution]], e.g. a [[normal-scaled inverse gamma distribution]] — and would then compute values for the parameters of this prior distribution, i.e. essentially [[hyperparameters]].  In this case, VB would compute optimum estimates of the four parameters of the normal-scaled inverse gamma distribution that describes the joint distribution of the mean and variance of the component.
+<br style="clear:both" />
+==A more complex example==
+[[File:bayesian-gaussian-mixture-vb.svg|right|300px|thumb|Bayesian Gaussian mixture model using [[plate notation]].  Smaller squares indicate fixed parameters; larger circles indicate random variables.  Filled-in shapes indicate known values.  The indication [K] means a vector of size ''K''; [D,D] means a matrix of size ''D''&times;''D''; K alone means a [[categorical variable]] with ''K'' outcomes.  The squiggly line coming from ''z'' ending in a crossbar indicates a ''switch'' — the value of this variable selects, for the other incoming variables, which value to use out of the size-''K'' array of possible values.]]
+Imagine a Bayesian [[Gaussian mixture model]] described as follows:
+:<math>
+\begin{align}
+\mathbf{\pi} & \sim \operatorname{SymDir}(K, \alpha_0) \\
+\mathbf{\Lambda}_{i=1 \dots K} & \sim \mathcal{W}(\mathbf{W}_0, \nu_0) \\
+\mathbf{\mu}_{i=1 \dots K} & \sim \mathcal{N}(\mathbf{\mu}_0, (\beta_0 \mathbf{\Lambda}_i)^{-1}) \\
+\mathbf{z}[i = 1 \dots N] & \sim \operatorname{Mult}(1, \mathbf{\pi}) \\
+\mathbf{x}_{i=1 \dots N} & \sim \mathcal{N}(\mathbf{\mu}_{z_i}, {\mathbf{\Lambda}_{z_i}}^{-1}) \\
+K &= \text{number of mixing components} \\
+N &= \text{number of data points}
+\end{align}
+</math>
+Note:
+*SymDir() is the symmetric [[Dirichlet distribution]] of dimension <math>K</math>, with the hyperparameter for each component set to <math>\alpha_0</math>.  The Dirichlet distribution is the [[conjugate prior]] of the [[categorical distribution]] or [[multinomial distribution]].
+*<math>\mathcal{W}()</math> is the [[Wishart distribution]], which is the conjugate prior of the [[precision matrix]] (inverse [[covariance matrix]]) for a [[multivariate Gaussian distribution]].
+*Mult() is a [[multinomial distribution]] over a single observation (equivalent to a [[categorical distribution]]).  The state space is a "one-of-K" representation, i.e. a <math>K</math>-dimensional vector in which one of the elements is 1 (specifying the identity of the observation) and all other elements are 0.
+*<math>\mathcal{N}()</math> is the [[Gaussian distribution]], in this case specifically the [[multivariate Gaussian distribution]].
+The interpretation of the above variables is as follows:
+*<math>\mathbf{X} = \{\mathbf{x}_1, \dots, \mathbf{x}_N\}</math> is the set of <math>N</math> data points, each of which is a <math>K</math>-dimensional vector distributed according to a [[multivariate Gaussian distribution]].
+*<math>\mathbf{Z} = \{\mathbf{z}_1, \dots, \mathbf{z}_N\}</math> is a set of latent variables, one per data point, specifying which mixture component the corresponding data point belongs to, using a "one-of-K" vector representation with components <math>z_{nk}</math> for <math>k = 1 \dots K</math>, as described above.
+*<math>\mathbf{\pi}</math> is the mixing proportions for the <math>K</math> mixture components.
+*<math>\mathbf{\mu}_{i=1 \dots K}</math> and <math>\mathbf{\Lambda}_{i=1 \dots K}</math> specify the parameters ([[mean]] and [[precision (statistics)|precision]]) associated with each mixture component.
+The joint probability of all variables can be rewritten as
+:<math>p(\mathbf{X},\mathbf{Z},\mathbf{\pi},\mathbf{\mu},\mathbf{\Lambda}) = p(\mathbf{X}\mid \mathbf{Z},\mathbf{\mu},\mathbf{\Lambda}) p(\mathbf{Z}\mid \mathbf{\pi}) p(\mathbf{\pi}) p(\mathbf{\mu}\mid \mathbf{\Lambda}) p(\mathbf{\Lambda})</math>
+where the individual factors are
+:<math>
+\begin{align}
+p(\mathbf{X}\mid \mathbf{Z},\mathbf{\mu},\mathbf{\Lambda}) & = \prod_{n=1}^N \prod_{k=1}^K \mathcal{N}(\mathbf{x}_n\mid \mathbf{\mu}_k,\mathbf{\Lambda}_k^{-1})^{z_{nk}} \\
+p(\mathbf{Z}\mid \mathbf{\pi}) & = \prod_{n=1}^N \prod_{k=1}^K \pi_k^{z_{nk}} \\
+p(\mathbf{\pi}) & = \frac{\Gamma(K\alpha_0)}{\Gamma(\alpha_0)^K} \prod_{k=1}^K \pi_k^{\alpha_0-1} \\
+p(\mathbf{\mu}\mid \mathbf{\Lambda}) & = \prod_{k=1}^K \mathcal{N}(\mathbf{\mu}_k\mid \mathbf{\mu}_0,(\beta_0 \mathbf{\Lambda}_k)^{-1}) \\
+p(\mathbf{\Lambda}) & = \prod_{k=1}^K \mathcal{W}(\mathbf{\Lambda}_k\mid \mathbf{W}_0, \nu_0)
+\end{align}
+</math>
+where
+:<math>
+\begin{align}
+\mathcal{N}(\mathbf{x}\mid \mathbf{\mu},\mathbf{\Sigma}) & = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\mathbf{\Sigma}|^{1/2}} \exp \{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^{\rm T} \mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu}) \} \\
+\mathcal{W}(\mathbf{\Lambda}\mid \mathbf{W},\nu) & = B(\mathbf{W},\nu) |\mathbf{\Lambda}|^{(\nu-D-1)/2} \exp \left(-\frac{1}{2} \operatorname{Tr}(\mathbf{W}^{-1}\mathbf{\Lambda}) \right) \\
+B(\mathbf{W},\nu) & = |\mathbf{W}|^{-\nu/2} (2^{\nu D/2} \pi^{D(D-1)/4} \prod_{i=1}^{D} \Gamma(\frac{\nu + 1 - i}{2}))^{-1} \\
+D & = \text{dimensionality of each data point}
+\end{align}
+</math>
+Assume that <math>q(\mathbf{Z},\mathbf{\pi},\mathbf{\mu},\mathbf{\Lambda}) = q(\mathbf{Z})q(\mathbf{\pi},\mathbf{\mu},\mathbf{\Lambda})</math>.
+Then
+:<math>
+\begin{align}
+\ln q^*(\mathbf{Z}) &= \operatorname{E}_{\mathbf{\pi},\mathbf{\mu},\mathbf{\Lambda}}[\ln p(\mathbf{X},\mathbf{Z},\mathbf{\pi},\mathbf{\mu},\mathbf{\Lambda})] + \text{constant} \\
+                    &= \operatorname{E}_{\mathbf{\pi}}[\ln p(\mathbf{Z}\mid \mathbf{\pi})] + \operatorname{E}_{\mathbf{\mu},\mathbf{\Lambda}}[\ln p(\mathbf{X}\mid \mathbf{Z},\mathbf{\mu},\mathbf{\Lambda})] + \text{constant} \\
+                    &= \sum_{n=1}^N \sum_{k=1}^K z_{nk} \ln \rho_{nk} + \text{constant}
+\end{align}
+</math>
+where we have defined
+:<math>\ln \rho_{nk} = \operatorname{E}[\ln \pi_k] + \frac{1}{2} \operatorname{E}[\ln |\mathbf{\Lambda}_k|] - \frac{D}{2} \ln(2\pi) - \frac{1}{2} \operatorname{E}_{\mathbf{\mu}_k,\mathbf{\Lambda}_k} [(\mathbf{x}_n - \mathbf{\mu}_k)^{\rm T} \mathbf{\Lambda}_k (\mathbf{x}_n - \mathbf{\mu}_k)]</math>
+Exponentiating both sides of the formula for <math>\ln q^*(\mathbf{Z})</math> yields
+:<math>q^*(\mathbf{Z}) \propto \prod_{n=1}^N \prod_{k=1}^K \rho_{nk}^{z_{nk}}</math>
+Requiring that this be normalized ends up requiring that the <math>\rho_{nk}</math> sum to 1 over all values of <math>k</math>, yielding
+:<math>q^*(\mathbf{Z}) = \prod_{n=1}^N \prod_{k=1}^K r_{nk}^{z_{nk}}</math>
+where
+:<math>r_{nk} = \frac{\rho_{nk}}{\sum_{j=1}^K \rho_{nj}}</math>
+In other words, <math>q^*(\mathbf{Z})</math> is a product of single-observation [[multinomial distribution]]s, and factors over each individual <math>\mathbf{z}_n</math>, which is distributed as a single-observation multinomial distribution with parameters <math>r_{nk}</math> for <math>k = 1 \dots K</math>.
+Furthermore, we note that
+:<math>\operatorname{E}[z_{nk}] = r_{nk} \, </math>
+which is a standard result for categorical distributions.
+Now, considering the factor <math>q(\mathbf{\pi},\mathbf{\mu},\mathbf{\Lambda})</math>, note that it automatically factors into <math>q(\mathbf{\pi}) \prod_{k=1}^K q(\mathbf{\mu}_k,\mathbf{\Lambda}_k)</math> due to the structure of the graphical model defining our Gaussian mixture model, which is specified above.
+Then,
+:<math>
+\begin{align}
+\ln q^*(\mathbf{\pi}) &= \ln p(\mathbf{\pi}) + \operatorname{E}_{\mathbf{Z}}[\ln p(\mathbf{Z}\mid \mathbf{\pi})] + \text{constant} \\
+                    &= (\alpha_0 - 1) \sum_{k=1}^K \ln \pi_k + \sum_{n=1}^N \sum_{k=1}^K r_{nk} \ln \pi_k + \text{constant}
+\end{align}
+</math>
+Taking the exponential of both sides, we recognize <math>q^*(\mathbf{\pi})</math> as a [[Dirichlet distribution]]
+:<math>q^*(\mathbf{\pi}) \sim \operatorname{Dir}(\mathbf{\alpha}) \, </math>
+where
+:<math>\alpha_k = \alpha_0 + N_k \, </math>
+where
+:<math>N_k = \sum_{n=1}^N r_{nk} \, </math>
+Finally
+:<math>\ln q^*(\mathbf{\mu}_k,\mathbf{\Lambda}_k) = \ln p(\mathbf{\mu}_k,\mathbf{\Lambda}_k) + \sum_{n=1}^N \operatorname{E}[z_{nk}] \ln \mathcal{N}(\mathbf{x}_n\mid \mathbf{\mu}_k,\mathbf{\Lambda}_k^{-1}) + \text{constant}</math>
+Grouping and reading off terms involving <math>\mathbf{\mu}_k</math> and <math>\mathbf{\Lambda}_k</math>, the result is a [[Gaussian-Wishart distribution]] given by
+:<math>q^*(\mathbf{\mu}_k,\mathbf{\Lambda}_k) = \mathcal{N}(\mathbf{\mu}_k\mid \mathbf{m}_k,(\beta_k \mathbf{\Lambda}_k)^{-1}) \mathcal{W}(\mathbf{\Lambda}_k\mid \mathbf{W}_k,\nu_k)</math>
+given the definitions
+:<math>
+\begin{align}
+\beta_k             &= \beta_0 + N_k \\
+\mathbf{m}_k        &= \frac{1}{\beta_k} (\beta_0 \mathbf{\mu}_0 + N_k {\bar{\mathbf{x}}}_k) \\
+\mathbf{W}_k^{-1}   &= \mathbf{W}_0^{-1} + N_k \mathbf{S}_k + \frac{\beta_0 N_k}{\beta_0 + N_k} ({\bar{\mathbf{x}}}_k - \mathbf{\mu}_0)({\bar{\mathbf{x}}}_k - \mathbf{\mu}_0)^{\rm T} \\
+\nu_k               &= \nu_0 + N_k \\
+N_k                 &= \sum_{n=1}^N r_{nk} \\
+{\bar{\mathbf{x}}}_k &= \frac{1}{N_k} \sum_{n=1}^N r_{nk} \mathbf{x}_n \\
+\mathbf{S}_k        &= \frac{1}{N_k} \sum_{n=1}^N (\mathbf{x}_n - {\bar{\mathbf{x}}}_k) (\mathbf{x}_n - {\bar{\mathbf{x}}}_k)^{\rm T}
+\end{align}
+</math>
+Finally, notice that these functions require the values of <math>r_{nk}</math>, which make use of <math>\rho_{nk}</math>, which is defined in turn based on <math>\operatorname{E}[\ln \pi_k]</math>, <math>\operatorname{E}[\ln |\mathbf{\Lambda}_k|]</math>, and <math>\operatorname{E}_{\mathbf{\mu}_k,\mathbf{\Lambda}_k} [(\mathbf{x}_n - \mathbf{\mu}_k)^{\rm T} \mathbf{\Lambda}_k (\mathbf{x}_n - \mathbf{\mu}_k)]</math>.  Now that we have determined the distributions over which these expectations are taken, we can derive formulas for them:
+:<math>
+\begin{align}
+\operatorname{E}_{\mathbf{\mu}_k,\mathbf{\Lambda}_k}  [(\mathbf{x}_n - \mathbf{\mu}_k)^{\rm T} \mathbf{\Lambda}_k (\mathbf{x}_n - \mathbf{\mu}_k)] & = D\beta_k^{-1} + \nu_k (\mathbf{x}_n - \mathbf{m}_k)^{\rm T} \mathbf{W}_k (\mathbf{x}_n - \mathbf{m}_k) \\
+\ln {\tilde{\Lambda}}_k &\equiv \operatorname{E}[\ln |\mathbf{\Lambda}_k|] = \sum_{i=1}^D \psi \left(\frac{\nu_k + 1 - i}{2}\right) + D \ln 2 + \ln |\mathbf{W}_k| \\
+\ln {\tilde{\pi}}_k &\equiv \operatorname{E}\left[\ln |\pi_k|\right] = \psi(\alpha_k) - \psi\left(\sum_{i=1}^K \alpha_i\right)
+\end{align}
+</math>
+These results lead to
+:<math>r_{nk} \propto {\tilde{\pi}}_k {\tilde{\Lambda}}_k^{1/2} \exp \left\{ - \frac{D}{2 \beta_k} - \frac{\nu_k}{2} (\mathbf{x}_n - \mathbf{m}_k)^{\rm T} \mathbf{W}_k (\mathbf{x}_n - \mathbf{m}_k) \right\}</math>
+These can be converted from proportional to absolute values by normalizing over <math>k</math> so that the corresponding values sum to 1.
+Note that:
+#The update equations for the parameters <math>\beta_k</math>, <math>\mathbf{m}_k</math>, <math>\mathbf{W}_k</math> and <math>\nu_k</math> of the variables <math>\mathbf{\mu}_k</math> and <math>\mathbf{\Lambda}_k</math> depend on the statistics <math>N_k</math>, <math>{\bar{\mathbf{x}}}_k</math>, and <math>\mathbf{S}_k</math>, and these statistics in turn depend on <math>r_{nk}</math>.
+#The update equations for the parameters <math>\alpha_{1 \dots K}</math> of the variable <math>\mathbf{\pi}</math> depend on the statistic <math>N_k</math>, which depends in turn on <math>r_{nk}</math>.
+#The update equation for <math>r_{nk}</math> has a direct circular dependence on <math>\beta_k</math>, <math>\mathbf{m}_k</math>, <math>\mathbf{W}_k</math> and <math>\nu_k</math> as well as an indirect circular dependence on <math>\mathbf{W}_k</math>, <math>\nu_k</math> and <math>\alpha_{1 \dots K}</math> through <math>{\tilde{\pi}}_k</math> and <math>{\tilde{\Lambda}}_k</math>.
+This suggests an iterative procedure that alternates between two steps:
+#An E-step that computes the value of <math>r_{nk}</math> using the current values of all the other parameters.
+#An M-step that uses the new value of <math>r_{nk}</math> to compute new values of all the other parameters.
+Note that these steps correspond closely with the standard EM algorithm to derive a [[maximum likelihood]] or [[maximum a posteriori]] (MAP) solution for the parameters of a [[Gaussian mixture model]].  The responsibilities <math>r_{nk}</math> in the E step correspond closely to the [[posterior probability|posterior probabilities]] of the latent variables given the data, i.e. <math>p(\mathbf{Z}\mid \mathbf{X})</math>; the computation of the statistics <math>N_k</math>, <math>{\bar{\mathbf{x}}}_k</math>, and <math>\mathbf{S}_k</math> corresponds closely to the computation of corresponding "soft-count" statistics over the data; and the use of those statistics to compute new values of the parameters corresponds closely to the use of soft counts to compute new parameter values in normal EM over a Gaussian mixture model.
+==Exponential-family distributions==
+Note that in the previous example, once the distribution over unobserved variables was assumed to factorize into distributions over the "parameters" and distributions over the "latent data", the derived "best" distribution for each variable was in the same family as the corresponding prior distribution over the variable.  This is a general result that holds true for all prior distributions derived from the [[exponential family]].
+==See also==
+* [[Variational message passing]]: a modular algorithm for variational Bayesian inference.
+* [[Expectation-maximization algorithm]]: a related approach which corresponds to a special case of variational Bayesian inference.
+* [[Generalized filtering]]: a variational filtering scheme for nonlinear state space models.
+* [[Calculus of variations]]: the field of mathematical analysis that deals with maximizing or minimizing functionals.
+{{More footnotes|date=September 2010}}
+==Notes==
+{{reflist}}
+==References==
+* {{Cite book
+  | last1 = Bishop   | first1 = Christopher M.
+   | title = Pattern Recognition and Machine Learning
+  | year = 2006
+  | publisher = Springer
+  | ref = CITEREFBishop2006
+  | isbn = 0-387-31073-8
+  }}
+==External links==
+* [http://www.gatsby.ucl.ac.uk/vbayes/ Variational-Bayes Repository] A repository of papers, software, and links related to the use of variational methods for approximate Bayesian learning
+* [http://www.inference.phy.cam.ac.uk/mackay/itila/ The on-line textbook: Information Theory, Inference, and Learning Algorithms], by [[David J.C. MacKay]] provides an introduction to variational methods (p.&nbsp;422).
+* [http://www.cse.buffalo.edu/faculty/mbeal/thesis/index.html Variational Algorithms for Approximate Bayesian Inference], by M. J. Beal includes comparisons of EM to Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs.
+* [http://www.robots.ox.ac.uk/~sjrob/Pubs/fox_vbtut.pdf A Tutorial on Variational Bayes]. Fox, C. and Roberts, S. 2012. Artificial Intelligence Review, {{doi|10.1007/s10462-011-9236-8}}.
+* [http://www.cs.jhu.edu/~jason/tutorials/variational.html High-Level Explanation of Variational Inference] by Jason Eisner may be worth reading before a more mathematically detailed treatment.
+{{DEFAULTSORT:Variational Bayesian Methods}}
+[[Category:Bayesian statistics]]