Mixing with Julia

We have such a mixture now, such a fusion of different genres. 
Ryszard Kapuscinski

A while ago, I wrote a small app in Julia. In some of my other work, I needed to look at some data that could be modeled as being drawn from a mixture of probability distributions.  There are many packages available in R and Python for analyzing mixture models. However, since version 1.1 of Julia was released recently, I decided to renew my acquaintance  with the language by using it to build a simple mixture of normal (Gaussian) distributions. Although Julia provides methods for handling mixture models, I wanted to build my own routine for analysis to get some practice with Julia.

Finite Mixture Models


A mixture model is a probabilistic distribution that combines a set of components to represent the overall distribution. That is, given a set of data ${\mathbf{X}} = \{ {{\mathbf{x}}_1},{{\mathbf{x}}_2}...{{\mathbf{x}}_n}\} $ where each ${{\mathbf{x}}_{\mathbf{i}}}$ is an m element  vector, we assume that the distribution of the data is given by a finite mixture of  components:

\[p({\mathbf{X}}|\Theta ) = \sum\limits_{k = 1}^K {{\lambda _k}{p_k}({\mathbf{X}}|} \,{{\mathbf{h}}_k},{\theta _k})\]

${p_k}({\mathbf{X}}|\,{{\mathbf{h}}_k},{\theta _k})$ are mixture components, i.e. distributions over ${\mathbf{X}}$ with parameters ${\theta _k}$.

${\lambda _k} = p({{\mathbf{h}}_k})$ are mixture weights. They represent the probability that ${{\mathbf{x}}_k}$ was generated by component k. $\sum\limits_{k = 1}^K {{\lambda _k} = 1} $.

${{\mathbf{h}}_k}$ is a binary indicator vector.  ${h_{ik}} = 1$ indicates that ${{\mathbf{x}}_i}$ was generated by mixture component k. Only one element of ${h_{ \bullet k}}$ can be non-zero.

$\Theta  = \{ {\lambda _1}...{\lambda _K},{\theta _1}...{\theta _K}\} $ is the complete set of parameters for the mixture model.

There's a nice description of the basics of mixture models here that the above is based on.

Using Bayes rule, we can compute a "weight", i.e. the probability of ${{\mathbf{x}}_i}$ being drawn from distribution k.

\[{w_{ik}} = p({z_{ik}} = 1|{{\mathbf{x}}_i},\Theta ) = \frac{{{p_k}({{\mathbf{x}}_i}|{{\mathbf{z}}_k},{\theta _k}){\lambda _k}}}{{\sum\limits_{l = 1}^K {{p_l}({{\mathbf{x}}_i}|{{\mathbf{z}}_l},{\theta _l}){\lambda _l}} }};1 \leqslant i \leqslant N;1 \leqslant k \leqslant K\]

N is the number of data points.

Mixture of Normals


If ${p_k}({{\mathbf{x}}_i}|{{\mathbf{z}}_k},{\theta _k}) \sim N({\mu_i},{\theta _k})$ the model is a mixture of normal or Gaussian distributions. In this case ${\theta _k} = \{ {\mu _k},\sigma _k^2\} $.

Our task is, given a set of data, estimate ${\mu _k}$, $\sigma _k^2$, and ${\lambda _k}$ for each component.

Generating Data


Let's generate some data from a mixture model. We will sample data from a three component mixture of normal distributions. We'll use the following values to generate data.

$${\bf{\lambda }} = \left( {\matrix{
   {0.3} & {0.4} & {0.3}  \cr

 } } \right)$$
$${\bf{\mu }} = \left( {\matrix{
   {4.0} & {0.0}  \cr
   {3.0} & {3.0}  \cr
   {1.0} & {1.0}  \cr

 } } \right)$$
$${\bf{\sigma }} = \left( {\matrix{
   {\left( {\matrix{
   {1.0} & {0.2}  \cr
   {0.2} & {1.0}  \cr

 } } \right)}  \cr
   {\left( {\matrix{
   {1.0} & { - 0.7}  \cr
   { - 0.7} & {1.0}  \cr

 } } \right)}  \cr
   {\left( {\matrix{
   {0.5} & {0.1}  \cr
   {0.1} & {0.5}  \cr

 } } \right)}  \cr

 } } \right)$$

To generate the data with Julia, we will first sample components by sampling 200 integers from the range [1,3]. The for each sampled component value, we will sample a value from the normal pdf  $N({\mu _k},{\sigma _k})$, where k is the component index. After that we save the samples in csv file and plot them.

Here's the Julia code.

#=
generate_mixture.jl

Generate data from a finite mixture of normals.

Author: Bill Thompson
Copyright: 2019
License: GPL
=#

using Compat, Random, Distributions
using StatsBase
using PyPlot

#=
generateSamples - generate a series of samples from a collection of normals
arguments:
lambda: mixture weights, vector of K components. 
        requires: sum(lambda) = 1
mu - means, vector of normal means K x m, where m is the number of dimensions of space.
sig - variance-covariance matrices, array of variances m x m X K
N - number of samples to draw
returns:
samples - an N x m array of sample points
components - an array of length N, component[i] = k means that samples[i, :] was drawn from
             normal component k. 
=#
function generateSamples(lambda::Array{Float64, 1},
                         mu::Array{Array{Float64, 1}, 1},
                         sig::Array{Float64, 3};
                         N::Int = 200)::Tuple{Array{Float64,2}, Array{Int64,1}}
    m = size(mu[1])[1]   # dimension of sample space
    K = size(lambda)[1]  # number of components
    components = sample(1:K, weights(lambda), N)

    samples = zeros(N, m)
    for i in 1:N
        c = components[i]
        mv = MvNormal(mu[c], sig[:, :, c])
        mu2 = rand(mv)
        samples[i, :] = mu2
    end

    return samples, components
end

#=
saveSamples - save samples to a cvs file
arguments:
out_file - the csv file name
samples - an N X m array generated by generateSamples
components - an array of length N, component[i] = k means that samples[i, :] was drawn from
             normal component k returned by generateSamples
=#
function saveSamples(out_file::String,
                     samples::Array{Float64,2},
                     components::Array{Int64,1})
    open(out_file, "w") do f
        for i in 1:size(samples)[1]
            c = components[i]
            write(f, "$c")
            for j in 1:size(samples)[2]
                s = samples[i, j]
                write(f, ",$s")
            end
            write(f, "\n")
        end
    end
end

#=
scatterPlotSamples - 2-D plot of the samples
                     plots the first 2 components of samples to a file
arguments:
plot_file - file name of plot file
samples - an N X m array generated by generateSamples
components - an array of length N, component[i] = k means that samples[i, :] was drawn from
             normal component k returned by generateSamples
K - number of mixture components
showIt - if true, plot is displayed on the screen
=#
function scatterPlotSamples(plot_file::String,
                            samples::Array{Float64,2},
                            components::Array{Int64,1},
                            K::Int = 2,
                            showIt::Bool = false)
    f = figure()
    for i in 1:K
        ndx = findall(x -> x == i, components)
        s = samples[ndx, :]
        X = s[:, 1]
        Y = s[:, 2]
        scatter(X, Y, label="component " * "$i")
    end
    legend(loc="upper right")

    if(showIt)
        plt[:show]() # to show the plot on the screen
    end
    
    savefig(plot_file)
end

#=
main - setup and generate sample
=#
function main()
    out_file = "data.csv"   # csv output file
    plot_file = "data.png"  # plot file

    # parameters
    lambda = [0.3, 0.4, 0.3]
    mu = [[4.0, 0.0],
          [3.0, 3.0],
          [1.0, 1.0]]
    sig = zeros(2, 2, 3)
    sig[:, :, 1] = [[1.0 0.2];
                    [0.2 1.0]]
    sig[:, :, 2] = [[1.0 -0.7];
                    [-0.7 1.0]]
    sig[:, :, 3] = [[0.5 0.1];
                    [0.1 0.5]]
    
    # genrate the data and save it
    samples, components = generateSamples(lambda, mu, sig)
    saveSamples(out_file, samples, components)
    scatterPlotSamples(plot_file, samples, components, 3)
end

main()

Notice the function header

function generateSamples(lambda::Array{Float64, 1},
                         mu::Array{Array{Float64, 1}, 1},
                         sig::Array{Float64, 3};
                         N::Int = 200)::Tuple{Array{Float64,2}, Array{Int64,1}}

One of the nice features of Julia is the ability to indicate argument and return data types of functions. This helps prevent runtime errors caused by passing an incompatible type to a function. We could have written 

function generateSamples(lambda, mu, sig; N=200)

and allowed the compiler to infer the types.

Here's a plot of the generated output.



Expectation-Maximization


Expectation-Maximization (EM) is is a method for obtaining maximum likelihood estimates of parameters. The EM algorithm iterates between performing an expectation (E) step, which calculates weights for the expectation of the log-likelihood based on estimate of the parameters, and a maximization (M) step, which computes maximum likelihood estimates of the parameters based on estimated weights found on the E step. The parameter estimates are then used for the next expectation step.

The method is initialized with a "guess" for the parameters.


E-step

Given an estimate of the parameters, we calculate ${w_{ik}}$. Each row of  ${\mathbf{w}}$ is normalized so the rows sum to 1. That is, each point must have been generated from one of the K components.

Using the weights, we can calculate the expect number of points gnerated by each component.

\[{N_k} = \sum\limits_{i = 1}^N {{w_{ik}}} \]

M-step

Given the estimate of the number of points from each component, we can calculate maximum likelihood estimates of the parameters.

\[{\lambda _k} = \frac{{{N_k}}}{N}\]

\[{{\mathbf{\mu }}_k} = \frac{1}{{{N_k}}}\sum\limits_{i = 1}^N {{w_{ik}}{{\mathbf{x}}_k}} \]

\[{{\mathbf{\sigma }}_k} = \frac{1}{{{N_k}}}\sum\limits_{i = 1}^N {{w_{ik}}(} {{\mathbf{x}}_i} - {{\mathbf{\mu }}_k}){({{\mathbf{x}}_i} - {{\mathbf{\mu }}_k})^t}\]

Each ${{\mathbf{\mu }}_k}$ is a m element vector and each  ${{\mathbf{\sigma }}_k}$ is an m by m matrix. The calculations of ${{\mathbf{\mu }}_k}$ and ${{\mathbf{\sigma }}_k}$ are effectively weighted averages.

E-M iterates between the expectation and maximization steps, updating the parameters at each maximization step and using them for the next expectation step. When to stop? The most common criteria is to exit when the likelihood of the data reaches a plateau. Rather than using the likelihood , which is likely to be very small for large amounts of data, it's common to use the log-likelihood.

$$\log l(\Theta ) = \sum\limits_{i = 1}^N {\log (p({{\bf{x}}_i}|\Theta ) = \sum\limits_{i = 1}^N {\sum\limits_{k = 1}^K {{\lambda _k}{p_k}({{\bf{x}}_i}|{h_k}{\theta _k})} } } $$

Since the log-likelihood is a complex function, it may have many local optima. Stoping when the log-likelihood value plateaus means we have reach a local maximum. It's common to run the E-M routine multiple times and record the parameters which result in the maximum likelihood.

Here's the Julia code for estimating the parameters of the mixture. It's a straightforward translation of the math shown above.

#=
mix.jl

Expectation-Maximization for estimating parameters of a mizture of normals.

Author: Bill Thompson
Copyright: 2019
License: GPL
=#

using CSV
using Statistics
using LinearAlgebra
using Distributions
using PyPlot

#=
log_score - calculate the log-likelihood of the data give a set of parameters
arguments:
lambda: mixture weights, vector of K components. 
        requires: sum(lambda) = 1
mu - means, vector of normal means K x m, where m is the number of dimensions of space.
sigma2 - variance-covariance matrices, array of variances m X m X K
K - number of mixture components
returns:
log(s) - the log_likelihood of the data
=#
function log_score(x::Array{Float64, 1},
                   mu::Array{Float64, 2},
                   sigma2::Array{Float64, 3},
                   lambda::Array{Float64, 2},
                   K::Int) :: Float64
    s = 0.0
    for k in 1:K
        mv = MvNormal(mu[k, :], sigma2[:, :, k])
        s += lambda[k] * pdf(mv, x)
    end

    return log(s)
end

#=
em - the expectation-maximization routine
arguments:
X - a N X m matrix of data values, where m is the dimenstion of the data space
K - number of mixture components
max_iter - the maximum allowed number of em iterations
returns:
mu - the estimated means of the normal distributions, a K X m matrix
sigma2 - the estimated variance-covariance matrix for each component
         an m X m X K matrix
lambda - the mixture weights
ll - the log-likelihood of the data given the estimated parameters
=#
function em(X::Array{Float64, 2};
            K::Int = 2,
            max_iter::Int = 100) :: Tuple{Array{Float64, 2},
                                          Array{Float64, 3},
                                          Array{Float64, 2},
                                          Float64} 
    tol = 1.0e-05   # the minimum log-likeihood difference between iterations
    ll_prev = -Inf
    ll = -Inf      # log likelihood
    N, m = size(X) 

    # initialization - random mu, default sigma2
    colsum = zeros(K)
    lambda = rand(K)
    lambda /= sum(lambda)

    mu = Array{Float64}(rand(K) * mean(X, dims=1))
    sigma2 = zeros(m, m, K)
    for k in 1:K
        sigma2[:, :, k] = Matrix{Float64}(I, m, m)
    end

    iter = 0
    while true
        # expectation
        w = zeros(N, K)  # probabilty of each data point being
                         # generated by each component
        for k in 1:K
            mv = MvNormal(mu[k, :], sigma2[:, :, k])
            for i in 1:N
                w[i, k] = pdf(mv, X[i, :]) * lambda[k]
            end
        end
        w = w ./ sum(w, dims = 2)  # normalize

        # maximize
        colsum = sum(w, dims=1)   # expected number data points generated by
                                  # each component
        lambda = colsum / sum(colsum) # expected component weights

        # estimate mu
        for k in 1:K
            s = zeros(m)
            for i in 1:N
                s[:] += w[i, k] * X[i, :]
            end
            mu[k, :] = s / colsum[k]
        end

        # estimate sigma2
        for k in 1:K
            s = zeros(m, m)
            for i in 1:N
                d = reshape(X[i, :] - mu[k, :], m, 1)
                s[:, :] += w[i, k] * (d * d')
            end
            sigma2[:, :, k] = s / colsum[k]
        end

        # calculate average log-likelihood
        ll = 0
        for i in 1:N
            ll += log_score(X[i, :], mu, sigma2, lambda, K)
        end
        ll /= size(X)[1]

        # degugging
#        println(mu)
#        println(sigma2)
#        println(K, " ", lambda)
        println(iter, " ", ll)

        # are we done?
        if abs(ll -ll_prev) < tol
            break
        end
        ll_prev = ll

        iter += 1
        if iter >= max_iter
            break
        end        
    end

    return mu, sigma2, lambda, ll
end

#=
plotContours - plot a contour map of the estimated data log-likelihood 
arguments:
data - a N X m matrix of data values, where m is the dimenstion of the data space
components - an array of length N, component[i] = k means that samples[i, :] was drawn from
             normal component k. 
mu - the estimated means of the normal distributions, a K X m matrix
sigma2 - the estimated variance-covariance matrix for each component
         an m X m X K matrix
lambda - the mixture weightsdata - 
K - number of mixture components
plotIt - a boolean, if true display plot
requires: PyPlot
=#
function plotContours(plot_file::String,
                      data::Array{Float64,2},
                      components::Array{Int64,1},
                      mu::Array{Float64, 2},
                      sigma2::Array{Float64, 3},
                      lambda::Array{Float64, 2};
                      K::Int = 2,
                      plotIt = true)
    # scatter plot of the data
    f = figure()
    for i in 1:K
        ndx = findall(x -> x == i, components)
        s = data[ndx, :]
        X = s[:, 1]
        Y = s[:, 2]
        scatter(X, Y, label="component " * "$i")
    end
    legend(loc="upper right")

    # plot log-likelihood contours
    n = 100
    x = range(minimum(data[:, 1]) - 0.5, maximum(data[:, 1]) + 0.5, length=n)
    y = range(minimum(data[:, 2]) - 0.5, maximum(data[:, 2]) + 0.5, length=n)

    xgrid = repeat(x',n,1)
    ygrid = repeat(y,1,n)

    z = zeros(n,n)

    for i in 1:n
        for j in 1:n
            z[i, j] = log_score([x[j], y[i]], mu, sigma2, lambda, K)
        end
    end

    contour(xgrid, ygrid, z, colors="black")

    savefig(plot_file)

    if plotIt
        plt[:show]() # show the plot on the screen
    end
end

#=
main - set up, run e-m, and plot
       run e-m 10 tmes and display the estmate with 
       the maximum log-likelihood
=#
function main()
    plot_file = "em_plot.png" # plot output
    data_file = "data.csv"    # data file
    num_runs = 10
    K = 3                     # number of mixture components

    # read the data
    # format is component, X, Y
    table = CSV.read(data_file, header=0)
    X = convert(Matrix{Float64}, table[:, [2,3]])
    components = convert(Array{Int64}, table[:, 1])

    # setup for e-m
    n, m = size(X)
    mu = Array{Float64}(rand(K) * mean(X, dims=1))
    sigma2 = zeros(m, m, K)
    for k in 1:K
        sigma2[:, :, k] = Matrix{Float64}(I, m, m)
    end
    lambda = zeros(K)
    ll = -Inf
    for i in 1:num_runs   # run 10 times
        println("run: ", i)
        mu_temp, sigma2_temp, lambda_temp, ll_temp = em(X, K=K, max_iter=100)
        if ll_temp > ll
            ll = ll_temp
            mu = copy(mu_temp)
            sigma2 = copy(sigma2_temp)
            lambda = copy(lambda_temp)
        end
        println()
    end

    # display and plot
    println("Mu = ", mu)
    println("\nsigma^2:")
    println(sigma2)
    println("\nlambda = ", lambda)
    println("\nlog likelihood = ", ll)

    plotContours(plot_file, X, components, mu, sigma2, lambda, K=K)
end

@time main()

Using the data plotted above, the program estimated these values for the parameters

$${\bf{\lambda }} = \left( {\matrix{
   {0.33} & {0.40} & {0.27}  \cr
 } } \right)$$
$${\bf{\mu }} = \left( {\matrix{
   {3.90} & { - 0.29}  \cr
   {3.11} & {2.95}  \cr
   {1.14} & {1.22}  \cr

 } } \right)$$
$$\sigma  = \left( {\matrix{
   {\left( {\matrix{
   {0.80} & {0.16}  \cr
   {0.16} & {0.67}  \cr

 } } \right)}  \cr
   {\left( {\matrix{
   {1.28} & { - 0.76}  \cr
   { - 0.76} & {1.00}  \cr

 } } \right)}  \cr
   {\left( {\matrix{
   {0.63} & {0.38}  \cr
   {0.38} & {0.88}  \cr

 } } \right)}  \cr

 } } \right)$$
$$ll =  - 3.45$$

Not too bad.  Using the original parameters yields an average log likelihood of  $ll =  - 3.50$. Given the small size of the original sample from the distribution, it's not surprising that EM yields slightly different parameters and a higher log likelihood.

Here's the contour plot of the log likelihood.

The E-M method gives us what are essentially point estimates of the parameters. That is, we don't have an idea of the uncertainty of our estimates. For that, we would need to consider a Bayesian method such as Gibbs sampling for estimating parameters. Secondly, we need to provide the E-M method with a good estimate of the number of components. In this simple example, we knew there were three, but in more realistic cases, som trial and error might be involved. Also, there's nothing in what we have see that is unique to normal distributions. In our example we knew that the data were drawn from a normal distribution, but in other cases, if we have strong suspicions that the data come from some other distribution, we can substitute that distribution for normal and replace the maximization step with the appropriate MLE estimate.

The code and data are available for download from here.

No comments:

Post a Comment