Algorithm::ExpectationMaximization

A pure-Perl implementation for clustering numerical multi-dimensional data with the Expectation-Maximiz...
Download

Algorithm::ExpectationMaximization Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Perl Artistic License
  • Price:
  • FREE
  • Publisher Name:
  • Avinash Kak
  • Publisher web site:
  • http://purdue.edu

Algorithm::ExpectationMaximization Tags


Algorithm::ExpectationMaximization Description

Algorithm::ExpectationMaximization is a Perl module for the Expectation-Maximization (EM) method of clustering numerical data that lends itself to modeling as a Gaussian mixture. Since the module is entirely in Perl (in the sense that it is not a Perl wrapper around a C library that actually does the clustering), the code in the module can easily be modified to experiment with several aspects of EM.Gaussian Mixture Modeling (GMM) is based on the assumption that the data consists of K Gaussian components, each characterized by its own mean vector and its own covariance matrix. Obviously, given observed data for clustering, we do not know which of the K Gaussian components was responsible for any of the data elements. GMM also associates a prior probability with each Gaussian component. In general, these priors will also be unknown. So the problem of clustering consists of estimating the posterior class probability at each data element and also estimating the class priors. Once these posterior class probabilities and the priors are estimated with EM, we can use the naive Bayes' classifier to partition the data into disjoint clusters. Or, for "soft" clustering, we can find all the data elements that belong to a Gaussian component on the basis of the posterior class probabilities at the data elements exceeding a prescribed threshold.If you do not mind the fact that it is possible for the EM algorithm to occasionally get stuck in a local minimum and to, therefore, produce a wrong answer even when you know the data to be perfectly multimodal Gaussian, EM is probably the most magical approach to clustering multidimensional data. Consider the case of clustering three-dimensional data. Each Gaussian cluster in 3D space is characterized by the following 10 variables: the 6 unique elements of the 3x3 covariance matrix (which must be symmetric positive-definite), the 3 unique elements of the mean, and the prior associated with the Gaussian. Now let's say you expect to see six Gaussians in your data. What that means is that you would want the values for 59 variables (remember the unit-summation constraint on the class priors which reduces the overall number of variables by one) to be estimated by the algorithm that seeks to discover the clusters in your data. What's amazing is that despite the large number of variables that are being simultaneously optimized, the chances are that the EM algorithm will give you a good approximation to the right answer.Clustering with the EM algorithm consists of iterating through the following two steps: 1. The Expectation Step: In this step, we estimate the posterior class probabilities at each data element; these constitute our expectations regarding the class labels at each data element. Bayes' Rule is used to express the posterior probabilities in terms of the distributions functions for the individual Gaussians and the priors associated with them. 2. The Maximization Step: Given the posterior class probabilities returned by the Expectation step, we now construct the maximum likelihood estimates for the means, the covariances, and the priors associated with the different Gaussians. Note that the first step requires that we have available to us the class priors and the parameters for the different Gaussians. In other words, in addition to the class priors, we must have the estimates for the means and the covariances of the K different Gaussians in order the calculate the posterior class probabilities in the Expectation Step. After the iterations have begun, that is exactly the information given to us by the Maximization step. But what does one do for the very first invocation of the Expectation Step? That is, when the Expectation Step is invoked for the very first time, what Gaussian distributions should we use for the the K different clusters? That takes us to the subject of "seeding" the EM algorithm.This module provides three different choices for seeding the clusters: (1) random, (2) kmeans, and (3) manual. When random seeding is chosen, the algorithm randomly selects K data elements as cluster seeds. That is, the data vectors associated with these seeds are treated as initial guesses for the means of the Gaussian distributions. The covariances are then set to the values calculated from the entire dataset with respect to the means corresponding to the seeds. With kmeans seeding, on the other hand, the means and the covariances are set to whatever values are returned by the kmeans algorithm. And, when seeding is set to manual, you are allowed to choose K data elements --- by specifying their tag names --- for the seeds. The rest of the EM initialization for the manual mode is the same as for the random mode. The algorithm allows for the initial priors to be specified for the manual mode of seeding.Much of code for the kmeans based seeding of EM was drawn from the Algorithm::KMeans module by me. The code from that module used here corresponds to the case when the cluster_seeding option in the Algorithm::KMeans module is set to smart. The smart option for KMeans consists of subjecting the data to a principal components analysis (PCA) to discover the direction of maximum variance in the data space. The data points are then projected on to this direction and a histogram constructed from the projections. Centers of the K largest peaks in this smoothed histogram are used to seed the KMeans based clusterer. And, as you'd expect, the output of the KMeans used to seed the EM algorithm.This module uses two different criteria to measure the quality of the clustering achieved. The first is the Minimum Description Length (MDL) proposed originally by Rissanen (J. Rissanen: "Modeling by Shortest Data Description," Automatica, 1978, and "A Universal Prior for Integers and Estimation by Minimum Description Length," Annals of Statistics, 1983.) The MDL criterion is a difference of a log-likelihood term for all of the observed data and a model-complexity penalty term. In general, both the log-likelihood and the model-complexity terms increase as the number of clusters increases. The form of the MDL criterion in this module uses for the penalty term the Bayesian Information Criterion (BIC) of G. Schwartz, "Estimating the Dimensions of a Model," The Annals of Statistics, 1978. In general, the smaller the value of MDL quality measure, the better the clustering of the data.For our second measure of clustering quality, we use `trace( SW^-1 . SB)' where SW is the within-class scatter matrix, more commonly denoted S_w, and SB the between-class scatter matrix, more commonly denoted S_b (the underscore means subscript). This measure can be thought of as the normalized average distance between the clusters, the normalization being provided by average cluster covariance SW^-1. Therefore, the larger the value of this quality measure, the better the separation between the clusters. Since this measure has its roots in the Fisher linear discriminant function, we incorporate the word fisher in the name of the quality measure. Note that this measure is good only when the clusters are disjoint. When the clusters exhibit significant overlap, the numbers produced by this quality measure tend to be generally meaningless.Every iterative algorithm requires a stopping criterion. The criterion implemented here is that we stop iterations when there is virtually no change in the values for the class priors as calculated by the Maximization Step.SYNOPSIS use Algorithm::ExpectationMaximization; # First name the data file: my $datafile = "mydatafile1.dat"; # Next, set the mask to indicate which columns of the datafile to use for # clustering and which column contains a symbolic ID for each data record. For # example, if the symbolic name is in the first column, you want the second column # to be ignored, and you want the next three columns to be used for 3D clustering: my $mask = "N0111"; # Now construct an instance of the clusterer. The parameter `K' controls the # number of clusters. Here is an example call to the constructor for instance # creation: my $clusterer = Algorithm::ExpectationMaximization->new( datafile => $datafile, mask => $mask, K => 3, max_em_iterations => 300, seeding => 'random', terminal_output => 1, debug => 0, ); # Note the choice for `seeding'. The choice `random' means that the clusterer will # randomly select `K' data points to serve as initial cluster centers. Other # possible choices for the constructor parameter `seeding' are `kmeans' and # `manual'. With the `kmeans' option for `seeding', a K-means cluster is used for # the cluster seeds and initial cluster covariances. If you use the `manual' # option for seeding, you must also specify the data elements to use for seeding # the clusters. # Here is an example of a call to the constructor when we choose the `manual' # option for seeding the clusters and for specifying the data elements for # seeding. The data elements are specified by their tag names. In this case, # these names are `a26', `b53', and `c49': my $clusterer = Algorithm::ExpectationMaximization->new( datafile => $datafile, mask => $mask, class_priors => , K => 3, max_em_iterations => 300, seeding => 'manual', seed_tags => , terminal_output => 1, debug => 0, ); # This example call to the constructor also illustrates how you can inject class # priors into the clustering process. The class priors are the prior probabilities # of the class distributions in your dataset. As explained later, injecting class # priors in the manner shown above makes statistical sense only for the case of # manual seeding. When you do inject class priors, the order in which the priors # are expressed must correspond to the manually specified seeds for the clusters. # After the invocation of the constructor, the following calls are mandatory # for reasons that should be obvious from the names of the methods: $clusterer->read_data_from_file(); srand(time); $clusterer->seed_the_clusters(); $clusterer->EM(); $clusterer->run_bayes_classifier(); my $clusters = $clusterer->return_disjoint_clusters(); # where the call to `EM()' is the invocation of the expectation-maximization # algorithm. The call to `srand(time)' is to seed the pseudo random number # generator afresh for each run of the cluster seeding procedure. If you want to # see repeatable results from one run to another of the algorithm with random # seeding, you would obviously not invoke `srand(time)'. # The call `run_bayes_classifier()' shown above carries out a disjoint clustering # of all the data points using the naive Bayes' classifier. And the call # `return_disjoint_clusters()' returns the clusters thus formed to you. Once you # have obtained access to the clusters in this manner, you can display them in # your terminal window by foreach my $index (0..@$clusters-1) { print "Cluster $index (Naive Bayes): @{$clusters->}\n\n" } # If you would like to also see the clusters purely on the basis of the posterior # class probabilities exceeding a threshold, call my $theta1 = 0.5; my $posterior_prob_clusters = $clusterer->return_clusters_with_posterior_probs_above_threshold($theta1); # where you can obviously set the threshold $theta1 to any value you wish. Note # that now you may end up with clusters that overlap. You can display them in # your terminal window in the same manner as shown above for the naive Bayes' # clusters. # You can write the naive Bayes' clusters out to files, one cluster per file, by # calling $clusterer->write_naive_bayes_clusters_to_files(); # The clusters are placed in files with names like naive_bayes_cluster1.dat naive_bayes_cluster2.dat ... # In the same manner, you can write out the posterior probability based possibly # overlapping clusters to files by calling: $clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1); # where the threshold $theta1 sets the probability threshold for deciding which # data elements to place in a cluster. These clusters are placed in files with # names like posterior_prob_cluster1.dat posterior_prob_cluster2.dat ... # CLUSTER VISUALIZATION: # You must first set the mask for cluster visualization. This mask tells the # module which 2D or 3D subspace of the original data space you wish to visualize # the clusters in: my $visualization_mask = "111"; $clusterer->visualize_clusters($visualization_mask); $clusterer->visualize_distributions($visualization_mask); $clusterer->plot_hardcopy_clusters($visualization_mask); $clusterer->plot_hardcopy_distributions($visualization_mask); # where the last two invocations are for writing out the PNG plots of the # visualization displays to disk files. The PNG image of the posterior # probability distributions is written out to a file named posterior_prob_plot.png # and the PNG image of the disjoint clusters to a file called cluster_plot.png. # SYNTHETIC DATA GENERATION: # The module has been provided with a class method for generating multivariate # data for experimenting with the EM algorithm. The data generation is controlled # by the contents of a parameter file that is supplied as an argument to the data # generator method. The priors, the means, and the covariance matrices in the # parameter file must be according to the syntax shown in the `param1.txt' file in # the `examples' directory. It is best to edit a copy of this file for your # synthetic data generation needs. my $parameter_file = "param1.txt"; my $out_datafile = "mydatafile1.dat"; Algorithm::ExpectationMaximization->cluster_data_generator( input_parameter_file => $parameter_file, output_datafile => $out_datafile, total_number_of_data_points => $N ); # where the value of $N is the total number of data points you would like to see # generated for all of the Gaussians. How this total number is divided up amongst # the Gaussians is decided by the prior probabilities for the Gaussian components # as declared in input parameter file. The synthetic data may be visualized in a # terminal window and the visualization written out as a PNG image to a diskfile # by my $data_visualization_mask = "11"; $clusterer->visualize_data($data_visualization_mask); $clusterer->plot_hardcopy_data($data_visualization_mask);Product's homepage


Algorithm::ExpectationMaximization Related Software