java.lang.Object
- java.lang.Thread
- - gov.nih.mipav.model.algorithms.AlgorithmBase
  - - gov.nih.mipav.model.algorithms.AlgorithmKMeans

All Implemented Interfaces:

java.awt.event.ActionListener, java.awt.event.WindowListener, java.lang.Runnable, java.util.EventListener
```
public class AlgorithmKMeans
extends AlgorithmBase
```
This program can be run on data of any dimensionality. However, VOIs will only be placed in images for 2D and 3D data. This program can be used for one dimensional segmenting of black and white images or two dimensional segmenting of color images. The data file format is as follows: # is used to start any comment line number of dimensions nDims extents[0] ... extents[nDims-1], length of each dimension scale[0] ... scale[nDims-1], scale factor of each dimension nPoints, number of points For each point a line of the form weight pos[0] ... pos[nDims-1] This program can be run to find the mean of centroids using Euclidean squared distances, the program can be run to find the median of centroids using city-block or Manhattan distances, or the program can be run to find the mean of centroids using Mahalanobis squared distances. Using medians with city-block distances is probably best in cases of noise and outlier data points since a few outliers can greatly affect the mean value. Median components are found separately for each dimension. The determinant(withinGroupSumOfSquaresMatrix) = det(W) is minimized by minimizing the Mahalanobis squared distance, which equals the (distance vector)' * (withinGroupSumOfSquaresMatrix).inverse() * (distance vector). The distance is between a centroid and a point. However, we don't initially know the withinGroupSumOfSquaresMatrix so we can't initially use the Mahalanobis squared distance. We must initialize with the Euclidean squared distance in order to obtain a withinGroupSumOfSquaresDistance. Minimizing the det(W)has the advantage of being scale independent. The Euclidean squared metric is scale-dependent. Different solutions may be obtained from the raw data and from the data scaled in some way. The Euclidean squared method does not work well in finding clusters with nonconvex shapes or very different sizes. The Euclidean squared metric may impose a spherical structure on the observed clusters even when the observed clusters are of other shapes. For example, the Euclidean squared measure may fail on 2 well separated elliptical clusters. The Euclidean metric tends to find groups with similar numbers of points. The Mahalanobis metric minimizing det(W) can find ellipsoidal distributions with the same size and orientation. With the Euclidean distance incremental changing 1 point at a time follows batch processing which changes all points at once. The batch algorithm is faster but the incremental algorithm is more accurate. With air pollution for 4 clusters scaled variance the incremental improved hierarchical and fast global and left max-min and global unchanged. With city block the incremental decreased the Calinski and Harabasz figure of merit for the air pollution study. With Mahalanobis the incremental was too slow for a large number of points and decreased the Calinski and Harabasz figure of merit for the air pollution study. The S metric minimizes S = Sum over all clusters of Sk. Let Wk be the within cluster sum of squares of cluster k. w1k, w2k, ..., wnDimsk are the eigenvalues in descending order from the eigenvalue decomposition of Wk. The clusters must be ellipsoidal shapes of the same size and different orientations. Specify a known ellipsoidal shape that applies to all the clusters in the distribution. Put the axes ratio of all ellipsoidal axes to the largest ellipsoidal axis in descending order, where the first ratio of the largest axis to itself is 1.0. Sk = w1k/(ellipsoidal axis ratio 1)**2 + w2k/(ellipsoidal axis ratio 2)**2 + ... + wnDimsk/(ellipsoidal axis ratio nDims)**2. Minimize S by going thru all the points 1 at a time and for each point checking to see which cluster yields the lowest value of S. For each point we test its possible removal from its present cluster and its possible addition to every other cluster. After doing all the points, we keep doing more iterations of all the points until no cluster changes occur. Since we don't initially know the withinGroupSumOfSquares for each cluster, initialization is performed with the Euclidean metric. Minimizing the sum over all clusters k of nk*log(tr(Wk/nk)) finds spherical distributions of different sizes. Minimizing S* = sum over all k of nk * log(Sk/nk) finds ellipsoidal shapes of different sizes and orientations. As with the S metric the axes ratios of the ellipsoidal axes must be known and specified. Cannot get S* to work. The dialog checkbox Scale variables to unit variance allow combining variables using different scales such as temperature in degrees Fahrenheit and wind speed in miles per hour. With a small number of records, it is feasible to use RANDOM_INIT to perform multiple restarts efficiently. With a small sample size, the subsampling for initialization is not effective in the Bradley-Fayyad initialization. The Bradley-Fayyad method is best suited for large-scale data. By initializing a general clustering algorithm near the nodes, not only are the true clusters found more often, but it follows that the clustering algorithm will iterate fewer times prior to convergence. The k-means algorithm will run differently with each run when random initialization and Bradley Fayyad initialization are used since random number generation is used in selection. The k-means algorithm will run the same with each run when Hierarchical grouping initialization and maxmin initialization are used. Global k-means and fast global k-means will run the same with each run. Initialization methods are only chosen with k-means; initialization methods are not chosen with global K-means and with fast global k-means. In the maxmin initialization first select the 2 points that are farthest apart as seeds. For each additional seed find the minimum distance from each nonseed point to any of the seed points. The new seed point will be the point that yields the maximum of these minimum distances. In hierarchical grouping all n points are initially considered as n groups of 1 member. The number of groups is decreased by 1 with each iteration from n to n-1 to n-2 until the desired number of groups has been obtained. At each step the sum over all groups of the error sum of squares of each group is minimized. The error sum of squares of a group of n members is (sum from i = 1 to n of (xi - xaverage)**2 + (yi - yaverage)**2 + (zi - zaverage)**2) where xaverage, yaverage, and zaverage are for all members of 1 group. With n groups n*(n-1)/2 possible groupings to go from n to n-1 groups are tried to find the n-1 grouping with the lowest sum over all groups of the error sum of squares. In Bradley-Fayyad we perform k-means on 10 subsamples, each with 1/10 of all the points, chosen at random. The initial centroids are chosen at random and then 1/10 of the data points are selected at random. When a subsample k-means finishes, if a centroid does not have any members, then the initial centroid is reassigned to the point which is farthest from its assigned cluster centroid, and k-means is run again from these new initial centroids. The final centroids cmi from each of the 10 subsample runs are grouped into a common set cm. 10 k-means runs are performed on this common cm centroid set, each starting with a cmi from a subsample run. Let the final resulting centroids from each of these 10 runs be fmi. Then the fmi which has the lowest sum over all groups of the error sum of squares of each group will be chosen as the initial starting centroids for the full sample run. The global k-means problem solves a problem with m clusters by sequentially solving all intermediate problems with 1,2, ..., m-1 clusters. At the start of a run with m clusters the first m-1 clusters are put at positions corresponding to their final locations in the m-1 cluster problem. For global k-means at every intermediate stage the one unassigned cluster is started at locations corresponding to every data point in the image and the clustering error is calculated for a full k-means run. The run with the lowest clustering error is taken as the solution. For fast global k-means at each intermediate step only 1 full k-means run is performed instead of having the number of k-means runs equal the number of data points in the image. The point which provides the largest guaranteed reduction in the error measure is used as the new cluster starting location. In fast global k-means we wish to find the point xn that maximizes bn, the guaranteed reduction in clustering error obtained by inserting a new cluster center at position xn. Let there be N points. Then bn = sum from j = 1 to N of the max(d(j,k-1)**2 - ||xn -xj||**2,0) where d(j,k-1) is the distance between xj and the closest center among the k-1 cluster centers obtained so far. Fast global k-means is not used for the city-block measure. From the Celebi reference: Consider a point xi, two cluster centers ca and cb and a distance metric d. Using the triangle inequality, we have d(ca,cb) <= d(xi,ca) + d(xi,cb). Therefore, if we know that 2d(xi,ca) <= d(ca,cb), we can conclude that d(xi,ca) <= d(xi,cb) without having to calculate d(xi,cb). By doing runs with different number of clusters and seeing which run gives the highest Calinski and Harabasz figure of merit, the ideal number of clusters can be determined. Let B = the between groups sum of squares and W = the within group sum of squares. Then the Calinski and Harabasz figure of merit equals = (B/(number of clusters -1))/(W/(totalWeight - number of clusters), where the weight of each point equals 1 in an unweighted data set or the totalWeight = the number of points in an unweighted data set. The sum of squared deviations from cluster centroids from 2 different runs can be used in an 'F test' proposed by Beale. Let SD denote the sum of the squared deviations from cluster centroids in the sample. Then a divison of n objects into g2 clusters is significantly better than a division into g1 clusters (g2 > g1) if the test statistic F(g1,g2) = ((SD1 - SD2)/SD2)/([(n-g1)/(n-g2)]*((g2/g1)**(2/nDims)) - 1) exceeds the critical value from an F distribution with nDims*(g2 - g1) and nDims*(n - g2) degrees of freedom. Experience with Beale's rule suggest that it will only be successful when the clusters are fairly well separated and approximately spherical in shape. Marriott proposed finding the ideal number of groups g by finding the minimum of g*g*determinant(withinGroupSumOfSquaresMatrix), where the matrix is nDims by nDims. For unimodal distributions the minimum value is likely to be g = 1. For strongly grouped distributions the minimum will indicate the appropriate value of g. For a uniform distribution the value will be constant over different group numbers. If g*g*determinant(withinGroupSumOfSquaresMatrix)/determinant(totalSumOfSquaresMatrix) > 1 for all numbers of clusters, then the distribution should be regarded as a single group. References: 1.) "A systematic evaluation of different methods for initializing the K-means clustering algorithm" by Anna D. Peterson, Arka. P. Ghosh, and Ranjan Maitra, IEEE Transactions on Knowledge and Data Engineering 2.) "Refining Initial Points for K-Means Clustering" by P.S. Bradley and Usama M. Fayyad. 3.) "Hierarchical Grouping to Optimize an Objective Function" by Joe H. Ward, Jr., Journal of the American Statistical Association, Volume 58, Issue 301, March, 1963, pp. 236-244. 4.) "Improving the Performance of K-Means for Color Quantization" by M. Emre Celebi, Image and Vision Computing, Volume 29, No. 4, 2011, pp. 260-271. 5.) "The global k-means clustering algorithm" by Aristidis Likas, Nikos Vlassis, and Jakob J. Verbeek, Pattern Recognition, Volume 36, 2003, pp. 451-461. 6.) Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber, Section 8.4 Partitioning Methods, Morgan Kaufmann Publishers, 2001, pp. 348-353. 7.) Cluster Analysis Fourth Edition by Brian S. Everitt, Sabine Landau, and Morven Leese, Arnold Publishers, 2001, Chapter 5, Optimization Clustering Techniques, pp. 90-117. 8.) "Practical Problems in Cluster Analysis" by F.H.C. Marriott, Biometrics, Vol. 27, September, 1971, pp. 501-514. 9.) Multivariate Observations by G. A. F. Seber, John Wiley & Sons, INc., 1984, Section 7.5 Partitioning Methods, pp.379-387. 10.) "Model-Based Gaussian and Non-Gaussian Clustering" by Jeffrey D. Banfield and Adrian E. Raftery, Biometrics, Vol. 49, No. 3., September, 1993, pp. 803-821. 11.) "Introduction to Clustering Large and High-Dimensional Data" by Jacob Kogan, Cambridge University Press, 2007, Chapter 2 Quadratic k-means algorithm, pp. 9-39.

Nested Class Summary

Nested Classes
Modifier and Type Class Description

private class AlgorithmKMeans.positionWeightComparator

private class AlgorithmKMeans.positionWeightItem
- Nested classes/interfaces inherited from class java.lang.Thread
  java.lang.Thread.State, java.lang.Thread.UncaughtExceptionHandler

Field Summary

Fields
Modifier and Type	Field	Description
`private int`	`algoSelection`
`private double[]`	`axesRatio`
`private float[]`	`blueBuffer`
`static int`	`BRADLEY_FAYYAD_INIT`
`private boolean`	`bwSegmentedImage`
`private double[][]`	`centroidDistances`
`private double[][]`	`centroidPos`
`static int`	`CITY_BLOCK`
`private boolean`	`colorSegmentInRGB`
`private int`	`distanceMeasure`
`private double[]`	`doubleBuffer`
`private boolean`	`equalScale`
`static int`	`EUCLIDEAN_SQUARED`
`static int`	`FAST_GLOBAL_K_MEANS`
`private boolean`	`followBatchWithIncremental`
`static int`	`GLOBAL_K_MEANS`
`private float[]`	`greenBuffer`
`private double[][]`	`groupMean`
`private int[]`	`groupNum`
`private double[][]`	`groupStdDev`
`static int`	`HIERARCHICAL_GROUPING_INIT`
`private ModelImage`	`image`
`private int`	`initSelection`
`static int`	`K_MEANS`
`static int`	`MAHALANOBIS_SQUARED`
`static int`	`MAXMIN_INIT`
`private int`	`nDims`
`private int`	`nPoints`
`private int`	`numberClusters`
`private double[][]`	`pos`
`static int`	`RANDOM_INIT`
`private float[]`	`redBuffer`
`private java.lang.String`	`resultsFileName`
`static int`	`S_METRIC`
`static int`	`S_STAR_METRIC`
`private double[]`	`scale`
`private double[]`	`scale2`
`private double`	`scaleMax`
`private boolean`	`scaleVariablesToUnitVariance`
`private boolean`	`showSegmentedImage`
`static int`	`SPHERES_DIFFERENT_SIZES`
`private double[]`	`totalWeight`
`private boolean`	`useColorHistogram`
`private double[]`	`weight`

Fields inherited from class gov.nih.mipav.model.algorithms.AlgorithmBase
destFlag, destImage, image25D, mask, maxProgressValue, minProgressValue, multiThreadingEnabled, nthreads, progress, progressModulus, progressStep, runningInSeparateThread, separable, srcImage, threadStopped

Fields inherited from class java.lang.Thread
MAX_PRIORITY, MIN_PRIORITY, NORM_PRIORITY

Constructor Summary

Constructors
Constructor	Description
AlgorithmKMeans(ModelImage image, int algoSelection, int distanceMeasure, double[][] pos, double[] scale, int[] groupNum, double[] weight, double[][] centroidPos, java.lang.String resultsFileName, int initSelection, float[] redBuffer, float[] greenBuffer, float[] blueBuffer, double scaleMax, boolean useColorHistogram, boolean scaleVariablesToUnitVariance, double[] axesRatio, boolean bwSegmentedImage, double[] doubleBuffer, boolean showSegmentedImage, boolean followBatchWithIncremental, boolean colorSegmentInRGB)

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`private void`	`cityBlock()`
`private void`	`EuclideanSquared()`
`void`	`finalize()`	Prepares this class for destruction.
`double[][]`	`getGroupMean()`
`double[][]`	`getGroupStdDev()`
`private void`	`MahalanobisSquared()`
`private void`	`otherMetrics()`
`void`	`runAlgorithm()`	Starts the algorithm.

Methods inherited from class gov.nih.mipav.model.algorithms.AlgorithmBase
actionPerformed, addListener, addProgressChangeListener, calculateImageSize, calculatePrincipleAxis, computeElapsedTime, computeElapsedTime, convertIntoFloat, delinkProgressToAlgorithm, delinkProgressToAlgorithmMulti, displayError, errorCleanUp, fireProgressStateChanged, fireProgressStateChanged, fireProgressStateChanged, fireProgressStateChanged, fireProgressStateChanged, generateProgressValues, getDestImage, getElapsedTime, getMask, getMaxProgressValue, getMinProgressValue, getNumberOfThreads, getProgress, getProgressChangeListener, getProgressChangeListeners, getProgressModulus, getProgressStep, getProgressValues, getSrcImage, isCompleted, isImage25D, isMultiThreadingEnabled, isRunningInSeparateThread, isThreadStopped, linkProgressToAlgorithm, linkProgressToAlgorithm, makeProgress, notifyListeners, removeListener, removeProgressChangeListener, run, setCompleted, setImage25D, setMask, setMaxProgressValue, setMinProgressValue, setMultiThreadingEnabled, setNumberOfThreads, setProgress, setProgressModulus, setProgressStep, setProgressValues, setProgressValues, setRunningInSeparateThread, setSrcImage, setStartTime, setThreadStopped, startMethod, windowActivated, windowClosed, windowClosing, windowDeactivated, windowDeiconified, windowIconified, windowOpened

Methods inherited from class java.lang.Thread
activeCount, checkAccess, clone, countStackFrames, currentThread, dumpStack, enumerate, getAllStackTraces, getContextClassLoader, getDefaultUncaughtExceptionHandler, getId, getName, getPriority, getStackTrace, getState, getThreadGroup, getUncaughtExceptionHandler, holdsLock, interrupt, interrupted, isAlive, isDaemon, isInterrupted, join, join, join, onSpinWait, resume, setContextClassLoader, setDaemon, setDefaultUncaughtExceptionHandler, setName, setPriority, setUncaughtExceptionHandler, sleep, sleep, start, stop, suspend, toString, yield

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Detail

RANDOM_INIT
```
public static final int RANDOM_INIT
```
See Also:

Constant Field Values

BRADLEY_FAYYAD_INIT
```
public static final int BRADLEY_FAYYAD_INIT
```
See Also:

Constant Field Values

HIERARCHICAL_GROUPING_INIT
```
public static final int HIERARCHICAL_GROUPING_INIT
```
See Also:

Constant Field Values

MAXMIN_INIT
```
public static final int MAXMIN_INIT
```
See Also:

Constant Field Values

K_MEANS
```
public static final int K_MEANS
```
See Also:

Constant Field Values

GLOBAL_K_MEANS
```
public static final int GLOBAL_K_MEANS
```
See Also:

Constant Field Values

FAST_GLOBAL_K_MEANS
```
public static final int FAST_GLOBAL_K_MEANS
```
See Also:

Constant Field Values

EUCLIDEAN_SQUARED
```
public static final int EUCLIDEAN_SQUARED
```
See Also:

Constant Field Values

CITY_BLOCK
```
public static final int CITY_BLOCK
```
See Also:

Constant Field Values

MAHALANOBIS_SQUARED
```
public static final int MAHALANOBIS_SQUARED
```
See Also:

Constant Field Values

S_METRIC
```
public static final int S_METRIC
```
See Also:

Constant Field Values

SPHERES_DIFFERENT_SIZES
```
public static final int SPHERES_DIFFERENT_SIZES
```
See Also:

Constant Field Values

S_STAR_METRIC
```
public static final int S_STAR_METRIC
```
See Also:

Constant Field Values

image
```
private ModelImage image
```

algoSelection
```
private int algoSelection
```

scale
```
private double[] scale
```

pos
```
private double[][] pos
```

groupNum
```
private int[] groupNum
```

weight
```
private double[] weight
```

centroidPos
```
private double[][] centroidPos
```

resultsFileName

private java.lang.String resultsFileName

initSelection
```
private int initSelection
```

redBuffer
```
private float[] redBuffer
```

greenBuffer
```
private float[] greenBuffer
```

blueBuffer
```
private float[] blueBuffer
```

scaleMax
```
private double scaleMax
```

useColorHistogram
```
private boolean useColorHistogram
```

distanceMeasure
```
private int distanceMeasure
```

scaleVariablesToUnitVariance

private boolean scaleVariablesToUnitVariance

axesRatio
```
private double[] axesRatio
```

bwSegmentedImage
```
private boolean bwSegmentedImage
```

doubleBuffer
```
private double[] doubleBuffer
```

scale2
```
private double[] scale2
```

equalScale
```
private boolean equalScale
```

nDims
```
private int nDims
```

nPoints
```
private int nPoints
```

numberClusters
```
private int numberClusters
```

totalWeight
```
private double[] totalWeight
```

centroidDistances
```
private double[][] centroidDistances
```

groupMean
```
private double[][] groupMean
```

groupStdDev
```
private double[][] groupStdDev
```

showSegmentedImage
```
private boolean showSegmentedImage
```

followBatchWithIncremental

private boolean followBatchWithIncremental

colorSegmentInRGB
```
private boolean colorSegmentInRGB
```

Constructor Detail

AlgorithmKMeans

public AlgorithmKMeans(ModelImage image,
                       int algoSelection,
                       int distanceMeasure,
                       double[][] pos,
                       double[] scale,
                       int[] groupNum,
                       double[] weight,
                       double[][] centroidPos,
                       java.lang.String resultsFileName,
                       int initSelection,
                       float[] redBuffer,
                       float[] greenBuffer,
                       float[] blueBuffer,
                       double scaleMax,
                       boolean useColorHistogram,
                       boolean scaleVariablesToUnitVariance,
                       double[] axesRatio,
                       boolean bwSegmentedImage,
                       double[] doubleBuffer,
                       boolean showSegmentedImage,
                       boolean followBatchWithIncremental,
                       boolean colorSegmentInRGB)

Method Detail
- finalize
```
public void finalize()
```
  Prepares this class for destruction.
  
  Overrides:
  
  finalize in class AlgorithmBase
- runAlgorithm
```
public void runAlgorithm()
```
  Starts the algorithm.
  
  Specified by:
  
  runAlgorithm in class AlgorithmBase
- EuclideanSquared
```
private void EuclideanSquared()
```
- cityBlock
```
private void cityBlock()
```
- MahalanobisSquared
```
private void MahalanobisSquared()
```
- otherMetrics
```
private void otherMetrics()
```
- getGroupMean
```
public double[][] getGroupMean()
```
  Returns:
  
  groupMean[nDims][numberClusters]
- getGroupStdDev
```
public double[][] getGroupStdDev()
```
  Returns:

Modifier and Type	Class	Description
`private class`	`AlgorithmKMeans.positionWeightComparator`
`private class`	`AlgorithmKMeans.positionWeightItem`

Class AlgorithmKMeans

Nested Class Summary

Nested classes/interfaces inherited from class java.lang.Thread

Field Summary

Fields inherited from class gov.nih.mipav.model.algorithms.AlgorithmBase

Fields inherited from class java.lang.Thread

Constructor Summary

Method Summary

Methods inherited from class gov.nih.mipav.model.algorithms.AlgorithmBase

Methods inherited from class java.lang.Thread

Methods inherited from class java.lang.Object

Field Detail

RANDOM_INIT

BRADLEY_FAYYAD_INIT

HIERARCHICAL_GROUPING_INIT

MAXMIN_INIT

K_MEANS

GLOBAL_K_MEANS

FAST_GLOBAL_K_MEANS

EUCLIDEAN_SQUARED

CITY_BLOCK

MAHALANOBIS_SQUARED

S_METRIC

SPHERES_DIFFERENT_SIZES

S_STAR_METRIC

image

algoSelection

scale

pos

groupNum

weight

centroidPos

resultsFileName

initSelection

redBuffer

greenBuffer

blueBuffer

scaleMax

useColorHistogram

distanceMeasure

scaleVariablesToUnitVariance

axesRatio

bwSegmentedImage

doubleBuffer

scale2

equalScale

nDims

nPoints

numberClusters

totalWeight

centroidDistances

groupMean

groupStdDev

showSegmentedImage

followBatchWithIncremental

colorSegmentInRGB

Constructor Detail

AlgorithmKMeans

Method Detail

finalize

runAlgorithm

EuclideanSquared

cityBlock

MahalanobisSquared

otherMetrics

getGroupMean

getGroupStdDev