# Full-covariance Gaussian observation model: Variational Bayesian Methods¶

TODO revise this!

## Generative model¶

The diagonal Gaussian observation model generates each data vector $$x_n$$ of length D from a multivariate Gaussian with mean $$\mu_k \in \mathbb{R}^D$$ and a diagonal covariance matrix:

$\begin{split}\begin{array}{c} x_{n1} \\ x_{n2} \\ \vdots \\ x_{nD} \end{array} \sim \mathcal{N} \left( \begin{array}{c c c c c} \mu_{k1} \\ \mu_{k2} \\ \vdots \\ \mu_{kD} \end{array} , \begin{array}{c c c c c} \lambda_{k1}^{-1} \\ & \lambda_{k2}^{-1} \\ & & \ddots \\ & & & & \lambda_{kD}^{-1} \end{array} \right)\end{split}$

### Global Random Variables¶

The global random variables are the cluster-specific means and precisions (inverse variances).

For each cluster k, we have the following global random variables:

$\begin{split}\mu_{k1}, \mu_{k2}, \ldots \mu_{kD} &\qquad \mu_{kd} \in \mathbb{R} \\ \lambda_{k1}, \lambda_{k2}, \ldots \lambda_{kD} &\qquad \lambda_{kd} \in (0, +\infty)\end{split}$

### Local Random Variables¶

Each dataset observation at index n has its own cluster assignment:

$z_n \in \{1, 2, \ldots K \}$

The generative model and approximate posterior for $$z_n$$ is determined by an allocation model. For all computations needed by our current observation model, we’ll assume either a point estimate or an approximate posterior for $$z_n$$ is known.

## Normal Wishart prior¶

Each dimension d has a mean $$\mu_{kd}$$ and variance $$\lambda_{kd}$$ which have a joint univariate Normal-Wishart prior with scalar hyperparameters $$\bar{\nu}, \bar{\beta}_d$$ for the Wishart prior and then $$\bar{m}_d, \bar{\kappa}$$ for the Normal prior:

$\begin{split}\lambda_{kd} &\sim \mathcal{W}_1(\bar{\nu}, \bar{\beta}_d) \\ \mu_{kd} &\sim \mathcal{N}_1(\bar{m}_d, \bar{\kappa}^{-1} \lambda_{kd}^{-1})\end{split}$

These are represented by the following numpy array attributes of the Prior parameter bag:

• nu : float
degrees of freedom
• beta : 1D array, size D
scale parameters that set mean of lambda
• m : 1D array, size D
mean of the parameter mu
• kappa : float
scalar precision of mu

Several keyword arguments can be used to determine the values of the prior hyperparameters when calling bnpy.run

• --nu : float
Sets value of $$\bar{\nu}$$. Defaults to D + 2.
• --kappa : float
Sets value of $$\bar{\kappa}$$. Defaults to ???.
• --ECovMat : str
Determines the expected value of data covariance under the prior. Possible values include ‘eye’ and ‘diagcovdata’. TODO
• --sF : float
These two options set the value of $$\bar{\beta}$$. TODO.
• TODO set m??

## Approximate posterior¶

We assume the following factorized approximate posterior family for variational optimization:

$q(z, \mu, \lambda) = \prod_{n=1}^N q(z_n) \cdot \prod_{k=1}^K (\mu_k, \lambda_k )$

The specific forms of the global and local factors are given below.

### Posterior for local assignments¶

For each observation vector at index n, we assume an independent approximate posterior over the assigned cluster indicator $$z_n \in \{1, 2, \ldots K \}$$.

$\begin{split}q( z ) &= \prod_{n=1}^N q(z_n | \hat{r}_n ) \\ &= \prod_{n=1}^N \mbox{Discrete}( z_n | \hat{r}_{n1}, \hat{r}_{n2}, \ldots \hat{r}_{nK})\end{split}$

Thus, for this observation model the only local variational parameter is the assignment responsibility array $$\hat{r} = \{ \{ \hat{r}_{nk} \}_{k=1}^K \}_{n=1}^N$$.

Inside the LP dict, this is represented by the resp numpy array:

• resp : 2D array, size N x K
Parameters of approximate posterior q(z) over cluster assignments. resp[n,k] = probability observation n is assigned to component k.

Remember, all computations required by our observation model assume that the resp array is given. The actual values of resp are updated by an allocation model.

### Posterior for global parameters¶

The goal of variational optimization is to find the best approximate posterior distribution for the mean and precision parameters of each cluster k:

$\begin{split}q( \mu, \lambda ) &= \prod_{k=1}^K \prod_{d=1}^D q( \mu_{kd}, \lambda_{kd} ) \\ &= \prod_{k=1}^K \prod_{d=1}^D \mathcal{W}_1( \lambda_{kd} | \hat{\nu}_k, \hat{\beta}_{kd} ) \mathcal{N}_1( \mu_{kd} | \hat{m}_{kd}, \hat{\kappa}_k^{-1} \lambda_{kd}^{-1} )\end{split}$

This approximate posterior is represented by the Post attribute of the DiagGaussObsModel. This is a ParamBag with the following attributes:

• K : int
number of active clusters
• nu : 1D array, size K
Defines $$\hat{\nu}_k$$ for each cluster
• beta : 2D array, size K x D
Defines $$\hat{\beta}_{kd}$$ for each cluster and dimension
• m : 2D array, size K x D
Defines $$\hat{m}_{kd}$$ for each cluster and dimension
• kappa : 2D array, size K
Defines $$\hat{\kappa}_{k}$$ for each cluster

### Objective function¶

Variational optimization will find the approximate posterior parameters that maximize the following objective function, given a fixed observed dataset $$x = \{x_1, \ldots x_N \}$$ and fixed prior hyparparameters $$\bar{\nu}, \bar{\beta}, \bar{m}, \bar{\kappa}$$.

$\begin{split}\mathcal{L}^{\smalltext{DiagGauss}}( \hat{\nu}, \hat{\beta}, \hat{m}, \hat{\kappa} ) &= \sum_{k=1}^K \sum_{d=1}^D c^{\smalltext{NW}}_{1,1}( \hat{\nu}_k, \hat{\beta}_{kd}, \hat{m}_{kd}, \hat{\kappa})_k - c^{\smalltext{NW}}_{1,1}( \bar{\nu}, \bar{\beta}_d, \bar{m}_d, \bar{\kappa}) \\ & \quad + \frac{1}{2} \sum_{k=1}^K \sum_{d=1}^D \left( N_k(\hat{r}) + \bar{\nu} - \hat{\nu}_k \right) \E_q[ \log \lambda_{kd} ] \\ & \quad - \frac{1}{2} \sum_{k=1}^K \sum_{d=1}^D \left( N_{k}(\hat{r}) + \bar{\kappa} - \hat{\kappa}_{k} \right) \E_q[ \lambda_{kd} ] \\ & \quad + \sum_{k=1}^K \sum_{d=1}^D \left( S_{kd}^{x}(x, \hat{r}) + \bar{\kappa} \bar{m}_d - \hat{\kappa}_k \hat{m}_{kd} \right) \E_q[ \lambda_{kd} \mu_{kd} ] \\ & \quad - \frac{1}{2} \sum_{k=1}^K \sum_{d=1}^D \left( S_{kd}^{x^2}(x, \hat{r}) + \bar{\beta}_d + \bar{\kappa} \bar{m}_{d}^2 - \hat{\beta}_{kd} - \hat{\kappa}_{k} \hat{m}_{kd}^2 \right) \E_q[ \lambda_{kd} \mu_{kd}^2 ]\end{split}$

This objective function is computed by calling the Python function calc_evidence.

### Sufficient statistics¶

The sufficient statistics of this observation model are functions of the local parameters $$\hat{r}$$ and the observed data $$x$$.

$\begin{split}N_{k}(\hat{r}) &= \sum_{n=1}^N \hat{r}_{nk} \\ S^{x}_{kd}(x, \hat{r}) &= \sum_{n=1}^N \hat{r}_{nk} x_{nd}^2 \\ S^{x^2}_{kd}(x, \hat{r}) &= \sum_{n=1}^N \hat{r}_{nk} x_{nd}^2\end{split}$

These fields are stored within the sufficient statistics parameter bag SS as the following fields:

• SS.N : 1D array, size K
SS.N[k] = $$N_k$$
• SS.x : 2D array, size K x D
SS.x[k,d] = $$S^{x}_{kd}(x, \hat{r})$$
• SS.xx : 2D array, size K x D
SS.xx[k,d] = $$S^{x^2}_{kd}(x, \hat{r})$$

### Cumulant function¶

The cumulant function of the univariate Normal-Wishart is evaluated for each dimension d separately. The function takes 4 scalar input arguments and produces a scalar output.

$c^{\smalltext{NW}}_{1,1}(\nu, \beta_d, m_d, \kappa) &= - \frac{1}{2} \log 2\pi + \frac{1}{2} \log \kappa + \frac{\nu}{2} \log \frac{\beta_d}{2} - \log \Gamma \left( \frac{\nu}{2} \right)$

### Local step update¶

As with all observation models, the local step computes the expected log conditional probability of assigning each observation to each cluster:

$\E[ \log p( x_n | \mu_k, \lambda_k ) ] = - \frac{D}{2} \log 2 \pi + \frac{1}{2} \sum_{d=1}^D \E[ \log \lambda_{kd} ] - \frac{1}{2} \sum_{d=1}^D \E[ \lambda_{kd} (x_{nd} - \mu_{kd})^2 ]$

where the elementary expectations required are:

$\begin{split}\E[ \log \lambda_{kd} ] &= \psi \left( \frac{\hat{\nu}_k}{2} \right) - \log \frac{\hat{\beta}_{kd}}{2} \\ \E_q \left[ \lambda_{kd} (x_{nd} - \mu_{kd})^2 \right] &= \frac{1}{\hat{\kappa}_{k}} + \frac{ \hat{\nu}_k }{ \hat{\beta}_{kd} } (x_{nd} - \hat{m}_{kd})^2\end{split}$

In our implementation, this is done via the function calc_local_params, which computes the following arrays and places them inside the local parameter dict LP.

• E_log_soft_ev : 2D array, N x K
log probability of assigning each observation n to each cluster k

### Global step update¶

The global step update produces an updated approximate posterior over the global random variables. Concretely, this means updated values for each field of the Post ParamBag attribute of the DiagGaussObsModel.

$\begin{split}\hat{\nu}_k &\gets N_k(\hat{r}) + \bar{\nu} \\ \hat{\kappa}_k &\gets N_k(\hat{r}) + \bar{\kappa} \\ \hat{m}_{kd} &\gets \frac{1}{\hat{\kappa}_k} \left( S_k^{x}(x, \hat{r}) + \bar{\kappa} \bar{m}_d \right) \\ \hat{\beta}_{kd} &\gets S_{kd}^{x^2}(x, \hat{r}) + \bar{\beta}_d + \bar{\kappa} \bar{m}_d^2 - \hat{\kappa}_k \hat{m}_{kd}^2\end{split}$

Our implementation performs this update when calling the function update_global_params.

## Initialization¶

Initialization creates valid values of the parameters which define the approximate posterior over the global random variables. Concretely, this means it creates a valid setting of the Post attribute of the DiagGaussObsModel object.

TODO