This article is an extension of “Gaussian Mixture Models and Expectation-Maximization (A full explanation)”. If you didn’t read it, this article might not be very useful.
The goal here is to derive the closed-form expressions necessary for the update of the parameters during the Maximization step of the EM algorithm applied to GMMs. This material was written as a separate article in order not to overload the main one.
Ok so recall that during the M-Step, we want to maximize the following lower bound with respect to Θ :
The lower bound is defined to be a concave function easy to optimize. So we are going perform a direct optimization procedure, that is, finding the parameters for which the partial derivatives are null. Also as we already said in the main article, we have to fulfill two constraints. The first one is that the sum of mixture weights must sum up to one and the second that the covariance matrix must be positive semidefinite.
Now recall that we initially defined the probability model using a mixture of Gaussian densities like so:
This initial definition was given prior to the introduction of the latent variable t that allows us to define the probability of a specific observation belonging to a specific Gaussian. With the latent variable introduced, we now write:
So all in all, we have to resolve the following optimization problem:
Now the second constraint regarding the covariance matrix that must be positive semidefinite is not really a constraint while performing direct optimization. We are not updating the parameters in an iterative fashion. The covariance matrix will be found using a closed-form expression so there is no need to worry about this second constraint.
The first constraint regarding the mixture weights is handled by the introduction of a Lagrangian multiplier. So the optimization objective becomes:
In order to resolve this problem, we have to found the parameters for which the partial derivatives are null:
First, let’s start with the equation with respect to λ:
Ok, so we are back to the definition of the constraint. No surprise here but not very useful.
Mixture weights derivation
Now let’s move on with the equation with respect to α_j. Note that in the following derivations, we use the fact that deriving for α_k with k different than j results in a constant (this basically means that the summation over the K clusters can be ignored):
Now in order to get rid of the λ, we use the definition of the constraint regarding the mixtures weights:
And this gives us the final closed-form expressions for the mixture weights:
This tells us that for each class, after the M-Step, the mixture weight will be the sum of all the individual weights for that class normalized by the sum of all the individual weights for all the classes. And this makes perfect sense, if all the observations put a little weight on a specific class compared to the other classes than this class will have a small overall weight, and vice-versa.
Mean vector derivation
Now let’s move on the derivation of the mean vector μ. This one is a little trickier because, this time, we are computing the partial derivative for vectors and matrices.
Ok so now we have simplified this expression as much as possible by extracting out every constant part with respect to mu. In order to go through this derivation, we are going to use a specific identity that can be found in Chapter 2 of the essential Matrix cookbook. It states that for any symmetric matrix W, any vector x and any scalar s:
The covariance matrix Σ is symmetric so its inverse is also and we can write:
This looks perfectly right. Out of the M-Step, the mean value for a specific Gaussian distribution will be the expected value of the data points with respect to the variational distribution q normalized by the sum of all the weights.
Covariance matrix derivation
Finally, let’s derive the analytical update expression for the covariance matrix Σ:
Once again, the expression is simplified as possible. Now as you can see we have to derive the inverse of the covariance matrix which can be quite tricky. So instead, we are going to make a change of variable in order to derive with respect to the inverse of the matrix and then inverse the final expression. So we state that:
Also, we make use of the following two identities derived from the matrix cookbook:
Finally, we have:
And this looks pretty dam right! The covariance matrix after the M-Step is a reweighted version of the previous covariance matrix.
End of the appendix