Class AlgorithmKMeans

  • All Implemented Interfaces:
    java.awt.event.ActionListener, java.awt.event.WindowListener, java.lang.Runnable, java.util.EventListener

    public class AlgorithmKMeans
    extends AlgorithmBase
    This program can be run on data of any dimensionality. However, VOIs will only be placed in images for 2D and 3D data. This program can be used for one dimensional segmenting of black and white images or two dimensional segmenting of color images. The data file format is as follows: # is used to start any comment line number of dimensions nDims extents[0] ... extents[nDims-1], length of each dimension scale[0] ... scale[nDims-1], scale factor of each dimension nPoints, number of points For each point a line of the form weight pos[0] ... pos[nDims-1] This program can be run to find the mean of centroids using Euclidean squared distances, the program can be run to find the median of centroids using city-block or Manhattan distances, or the program can be run to find the mean of centroids using Mahalanobis squared distances. Using medians with city-block distances is probably best in cases of noise and outlier data points since a few outliers can greatly affect the mean value. Median components are found separately for each dimension. The determinant(withinGroupSumOfSquaresMatrix) = det(W) is minimized by minimizing the Mahalanobis squared distance, which equals the (distance vector)' * (withinGroupSumOfSquaresMatrix).inverse() * (distance vector). The distance is between a centroid and a point. However, we don't initially know the withinGroupSumOfSquaresMatrix so we can't initially use the Mahalanobis squared distance. We must initialize with the Euclidean squared distance in order to obtain a withinGroupSumOfSquaresDistance. Minimizing the det(W)has the advantage of being scale independent. The Euclidean squared metric is scale-dependent. Different solutions may be obtained from the raw data and from the data scaled in some way. The Euclidean squared method does not work well in finding clusters with nonconvex shapes or very different sizes. The Euclidean squared metric may impose a spherical structure on the observed clusters even when the observed clusters are of other shapes. For example, the Euclidean squared measure may fail on 2 well separated elliptical clusters. The Euclidean metric tends to find groups with similar numbers of points. The Mahalanobis metric minimizing det(W) can find ellipsoidal distributions with the same size and orientation. With the Euclidean distance incremental changing 1 point at a time follows batch processing which changes all points at once. The batch algorithm is faster but the incremental algorithm is more accurate. With air pollution for 4 clusters scaled variance the incremental improved hierarchical and fast global and left max-min and global unchanged. With city block the incremental decreased the Calinski and Harabasz figure of merit for the air pollution study. With Mahalanobis the incremental was too slow for a large number of points and decreased the Calinski and Harabasz figure of merit for the air pollution study. The S metric minimizes S = Sum over all clusters of Sk. Let Wk be the within cluster sum of squares of cluster k. w1k, w2k, ..., wnDimsk are the eigenvalues in descending order from the eigenvalue decomposition of Wk. The clusters must be ellipsoidal shapes of the same size and different orientations. Specify a known ellipsoidal shape that applies to all the clusters in the distribution. Put the axes ratio of all ellipsoidal axes to the largest ellipsoidal axis in descending order, where the first ratio of the largest axis to itself is 1.0. Sk = w1k/(ellipsoidal axis ratio 1)**2 + w2k/(ellipsoidal axis ratio 2)**2 + ... + wnDimsk/(ellipsoidal axis ratio nDims)**2. Minimize S by going thru all the points 1 at a time and for each point checking to see which cluster yields the lowest value of S. For each point we test its possible removal from its present cluster and its possible addition to every other cluster. After doing all the points, we keep doing more iterations of all the points until no cluster changes occur. Since we don't initially know the withinGroupSumOfSquares for each cluster, initialization is performed with the Euclidean metric. Minimizing the sum over all clusters k of nk*log(tr(Wk/nk)) finds spherical distributions of different sizes. Minimizing S* = sum over all k of nk * log(Sk/nk) finds ellipsoidal shapes of different sizes and orientations. As with the S metric the axes ratios of the ellipsoidal axes must be known and specified. Cannot get S* to work. The dialog checkbox Scale variables to unit variance allow combining variables using different scales such as temperature in degrees Fahrenheit and wind speed in miles per hour. With a small number of records, it is feasible to use RANDOM_INIT to perform multiple restarts efficiently. With a small sample size, the subsampling for initialization is not effective in the Bradley-Fayyad initialization. The Bradley-Fayyad method is best suited for large-scale data. By initializing a general clustering algorithm near the nodes, not only are the true clusters found more often, but it follows that the clustering algorithm will iterate fewer times prior to convergence. The k-means algorithm will run differently with each run when random initialization and Bradley Fayyad initialization are used since random number generation is used in selection. The k-means algorithm will run the same with each run when Hierarchical grouping initialization and maxmin initialization are used. Global k-means and fast global k-means will run the same with each run. Initialization methods are only chosen with k-means; initialization methods are not chosen with global K-means and with fast global k-means. In the maxmin initialization first select the 2 points that are farthest apart as seeds. For each additional seed find the minimum distance from each nonseed point to any of the seed points. The new seed point will be the point that yields the maximum of these minimum distances. In hierarchical grouping all n points are initially considered as n groups of 1 member. The number of groups is decreased by 1 with each iteration from n to n-1 to n-2 until the desired number of groups has been obtained. At each step the sum over all groups of the error sum of squares of each group is minimized. The error sum of squares of a group of n members is (sum from i = 1 to n of (xi - xaverage)**2 + (yi - yaverage)**2 + (zi - zaverage)**2) where xaverage, yaverage, and zaverage are for all members of 1 group. With n groups n*(n-1)/2 possible groupings to go from n to n-1 groups are tried to find the n-1 grouping with the lowest sum over all groups of the error sum of squares. In Bradley-Fayyad we perform k-means on 10 subsamples, each with 1/10 of all the points, chosen at random. The initial centroids are chosen at random and then 1/10 of the data points are selected at random. When a subsample k-means finishes, if a centroid does not have any members, then the initial centroid is reassigned to the point which is farthest from its assigned cluster centroid, and k-means is run again from these new initial centroids. The final centroids cmi from each of the 10 subsample runs are grouped into a common set cm. 10 k-means runs are performed on this common cm centroid set, each starting with a cmi from a subsample run. Let the final resulting centroids from each of these 10 runs be fmi. Then the fmi which has the lowest sum over all groups of the error sum of squares of each group will be chosen as the initial starting centroids for the full sample run. The global k-means problem solves a problem with m clusters by sequentially solving all intermediate problems with 1,2, ..., m-1 clusters. At the start of a run with m clusters the first m-1 clusters are put at positions corresponding to their final locations in the m-1 cluster problem. For global k-means at every intermediate stage the one unassigned cluster is started at locations corresponding to every data point in the image and the clustering error is calculated for a full k-means run. The run with the lowest clustering error is taken as the solution. For fast global k-means at each intermediate step only 1 full k-means run is performed instead of having the number of k-means runs equal the number of data points in the image. The point which provides the largest guaranteed reduction in the error measure is used as the new cluster starting location. In fast global k-means we wish to find the point xn that maximizes bn, the guaranteed reduction in clustering error obtained by inserting a new cluster center at position xn. Let there be N points. Then bn = sum from j = 1 to N of the max(d(j,k-1)**2 - ||xn -xj||**2,0) where d(j,k-1) is the distance between xj and the closest center among the k-1 cluster centers obtained so far. Fast global k-means is not used for the city-block measure. From the Celebi reference: Consider a point xi, two cluster centers ca and cb and a distance metric d. Using the triangle inequality, we have d(ca,cb) g1) if the test statistic F(g1,g2) = ((SD1 - SD2)/SD2)/([(n-g1)/(n-g2)]*((g2/g1)**(2/nDims)) - 1) exceeds the critical value from an F distribution with nDims*(g2 - g1) and nDims*(n - g2) degrees of freedom. Experience with Beale's rule suggest that it will only be successful when the clusters are fairly well separated and approximately spherical in shape. Marriott proposed finding the ideal number of groups g by finding the minimum of g*g*determinant(withinGroupSumOfSquaresMatrix), where the matrix is nDims by nDims. For unimodal distributions the minimum value is likely to be g = 1. For strongly grouped distributions the minimum will indicate the appropriate value of g. For a uniform distribution the value will be constant over different group numbers. If g*g*determinant(withinGroupSumOfSquaresMatrix)/determinant(totalSumOfSquaresMatrix) > 1 for all numbers of clusters, then the distribution should be regarded as a single group. References: 1.) "A systematic evaluation of different methods for initializing the K-means clustering algorithm" by Anna D. Peterson, Arka. P. Ghosh, and Ranjan Maitra, IEEE Transactions on Knowledge and Data Engineering 2.) "Refining Initial Points for K-Means Clustering" by P.S. Bradley and Usama M. Fayyad. 3.) "Hierarchical Grouping to Optimize an Objective Function" by Joe H. Ward, Jr., Journal of the American Statistical Association, Volume 58, Issue 301, March, 1963, pp. 236-244. 4.) "Improving the Performance of K-Means for Color Quantization" by M. Emre Celebi, Image and Vision Computing, Volume 29, No. 4, 2011, pp. 260-271. 5.) "The global k-means clustering algorithm" by Aristidis Likas, Nikos Vlassis, and Jakob J. Verbeek, Pattern Recognition, Volume 36, 2003, pp. 451-461. 6.) Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber, Section 8.4 Partitioning Methods, Morgan Kaufmann Publishers, 2001, pp. 348-353. 7.) Cluster Analysis Fourth Edition by Brian S. Everitt, Sabine Landau, and Morven Leese, Arnold Publishers, 2001, Chapter 5, Optimization Clustering Techniques, pp. 90-117. 8.) "Practical Problems in Cluster Analysis" by F.H.C. Marriott, Biometrics, Vol. 27, September, 1971, pp. 501-514. 9.) Multivariate Observations by G. A. F. Seber, John Wiley & Sons, INc., 1984, Section 7.5 Partitioning Methods, pp.379-387. 10.) "Model-Based Gaussian and Non-Gaussian Clustering" by Jeffrey D. Banfield and Adrian E. Raftery, Biometrics, Vol. 49, No. 3., September, 1993, pp. 803-821. 11.) "Introduction to Clustering Large and High-Dimensional Data" by Jacob Kogan, Cambridge University Press, 2007, Chapter 2 Quadratic k-means algorithm, pp. 9-39.
    • Field Detail

      • HIERARCHICAL_GROUPING_INIT

        public static final int HIERARCHICAL_GROUPING_INIT
        See Also:
        Constant Field Values
      • SPHERES_DIFFERENT_SIZES

        public static final int SPHERES_DIFFERENT_SIZES
        See Also:
        Constant Field Values
      • algoSelection

        private int algoSelection
      • scale

        private double[] scale
      • pos

        private double[][] pos
      • groupNum

        private int[] groupNum
      • weight

        private double[] weight
      • centroidPos

        private double[][] centroidPos
      • resultsFileName

        private java.lang.String resultsFileName
      • initSelection

        private int initSelection
      • redBuffer

        private float[] redBuffer
      • greenBuffer

        private float[] greenBuffer
      • blueBuffer

        private float[] blueBuffer
      • scaleMax

        private double scaleMax
      • useColorHistogram

        private boolean useColorHistogram
      • distanceMeasure

        private int distanceMeasure
      • scaleVariablesToUnitVariance

        private boolean scaleVariablesToUnitVariance
      • axesRatio

        private double[] axesRatio
      • bwSegmentedImage

        private boolean bwSegmentedImage
      • doubleBuffer

        private double[] doubleBuffer
      • scale2

        private double[] scale2
      • equalScale

        private boolean equalScale
      • nDims

        private int nDims
      • nPoints

        private int nPoints
      • numberClusters

        private int numberClusters
      • totalWeight

        private double[] totalWeight
      • centroidDistances

        private double[][] centroidDistances
      • groupMean

        private double[][] groupMean
      • groupStdDev

        private double[][] groupStdDev
      • showSegmentedImage

        private boolean showSegmentedImage
      • followBatchWithIncremental

        private boolean followBatchWithIncremental
      • colorSegmentInRGB

        private boolean colorSegmentInRGB
    • Constructor Detail

      • AlgorithmKMeans

        public AlgorithmKMeans​(ModelImage image,
                               int algoSelection,
                               int distanceMeasure,
                               double[][] pos,
                               double[] scale,
                               int[] groupNum,
                               double[] weight,
                               double[][] centroidPos,
                               java.lang.String resultsFileName,
                               int initSelection,
                               float[] redBuffer,
                               float[] greenBuffer,
                               float[] blueBuffer,
                               double scaleMax,
                               boolean useColorHistogram,
                               boolean scaleVariablesToUnitVariance,
                               double[] axesRatio,
                               boolean bwSegmentedImage,
                               double[] doubleBuffer,
                               boolean showSegmentedImage,
                               boolean followBatchWithIncremental,
                               boolean colorSegmentInRGB)
    • Method Detail

      • finalize

        public void finalize()
        Prepares this class for destruction.
        Overrides:
        finalize in class AlgorithmBase
      • EuclideanSquared

        private void EuclideanSquared()
      • cityBlock

        private void cityBlock()
      • MahalanobisSquared

        private void MahalanobisSquared()
      • otherMetrics

        private void otherMetrics()
      • getGroupMean

        public double[][] getGroupMean()
        Returns:
        groupMean[nDims][numberClusters]
      • getGroupStdDev

        public double[][] getGroupStdDev()
        Returns: