Package gov.nih.mipav.model.algorithms
Class AlgorithmKMeans
- java.lang.Object
-
- java.lang.Thread
-
- gov.nih.mipav.model.algorithms.AlgorithmBase
-
- gov.nih.mipav.model.algorithms.AlgorithmKMeans
-
- All Implemented Interfaces:
java.awt.event.ActionListener
,java.awt.event.WindowListener
,java.lang.Runnable
,java.util.EventListener
public class AlgorithmKMeans extends AlgorithmBase
This program can be run on data of any dimensionality. However, VOIs will only be placed in images for 2D and 3D data. This program can be used for one dimensional segmenting of black and white images or two dimensional segmenting of color images. The data file format is as follows: # is used to start any comment line number of dimensions nDims extents[0] ... extents[nDims-1], length of each dimension scale[0] ... scale[nDims-1], scale factor of each dimension nPoints, number of points For each point a line of the form weight pos[0] ... pos[nDims-1] This program can be run to find the mean of centroids using Euclidean squared distances, the program can be run to find the median of centroids using city-block or Manhattan distances, or the program can be run to find the mean of centroids using Mahalanobis squared distances. Using medians with city-block distances is probably best in cases of noise and outlier data points since a few outliers can greatly affect the mean value. Median components are found separately for each dimension. The determinant(withinGroupSumOfSquaresMatrix) = det(W) is minimized by minimizing the Mahalanobis squared distance, which equals the (distance vector)' * (withinGroupSumOfSquaresMatrix).inverse() * (distance vector). The distance is between a centroid and a point. However, we don't initially know the withinGroupSumOfSquaresMatrix so we can't initially use the Mahalanobis squared distance. We must initialize with the Euclidean squared distance in order to obtain a withinGroupSumOfSquaresDistance. Minimizing the det(W)has the advantage of being scale independent. The Euclidean squared metric is scale-dependent. Different solutions may be obtained from the raw data and from the data scaled in some way. The Euclidean squared method does not work well in finding clusters with nonconvex shapes or very different sizes. The Euclidean squared metric may impose a spherical structure on the observed clusters even when the observed clusters are of other shapes. For example, the Euclidean squared measure may fail on 2 well separated elliptical clusters. The Euclidean metric tends to find groups with similar numbers of points. The Mahalanobis metric minimizing det(W) can find ellipsoidal distributions with the same size and orientation. With the Euclidean distance incremental changing 1 point at a time follows batch processing which changes all points at once. The batch algorithm is faster but the incremental algorithm is more accurate. With air pollution for 4 clusters scaled variance the incremental improved hierarchical and fast global and left max-min and global unchanged. With city block the incremental decreased the Calinski and Harabasz figure of merit for the air pollution study. With Mahalanobis the incremental was too slow for a large number of points and decreased the Calinski and Harabasz figure of merit for the air pollution study. The S metric minimizes S = Sum over all clusters of Sk. Let Wk be the within cluster sum of squares of cluster k. w1k, w2k, ..., wnDimsk are the eigenvalues in descending order from the eigenvalue decomposition of Wk. The clusters must be ellipsoidal shapes of the same size and different orientations. Specify a known ellipsoidal shape that applies to all the clusters in the distribution. Put the axes ratio of all ellipsoidal axes to the largest ellipsoidal axis in descending order, where the first ratio of the largest axis to itself is 1.0. Sk = w1k/(ellipsoidal axis ratio 1)**2 + w2k/(ellipsoidal axis ratio 2)**2 + ... + wnDimsk/(ellipsoidal axis ratio nDims)**2. Minimize S by going thru all the points 1 at a time and for each point checking to see which cluster yields the lowest value of S. For each point we test its possible removal from its present cluster and its possible addition to every other cluster. After doing all the points, we keep doing more iterations of all the points until no cluster changes occur. Since we don't initially know the withinGroupSumOfSquares for each cluster, initialization is performed with the Euclidean metric. Minimizing the sum over all clusters k of nk*log(tr(Wk/nk)) finds spherical distributions of different sizes. Minimizing S* = sum over all k of nk * log(Sk/nk) finds ellipsoidal shapes of different sizes and orientations. As with the S metric the axes ratios of the ellipsoidal axes must be known and specified. Cannot get S* to work. The dialog checkbox Scale variables to unit variance allow combining variables using different scales such as temperature in degrees Fahrenheit and wind speed in miles per hour. With a small number of records, it is feasible to use RANDOM_INIT to perform multiple restarts efficiently. With a small sample size, the subsampling for initialization is not effective in the Bradley-Fayyad initialization. The Bradley-Fayyad method is best suited for large-scale data. By initializing a general clustering algorithm near the nodes, not only are the true clusters found more often, but it follows that the clustering algorithm will iterate fewer times prior to convergence. The k-means algorithm will run differently with each run when random initialization and Bradley Fayyad initialization are used since random number generation is used in selection. The k-means algorithm will run the same with each run when Hierarchical grouping initialization and maxmin initialization are used. Global k-means and fast global k-means will run the same with each run. Initialization methods are only chosen with k-means; initialization methods are not chosen with global K-means and with fast global k-means. In the maxmin initialization first select the 2 points that are farthest apart as seeds. For each additional seed find the minimum distance from each nonseed point to any of the seed points. The new seed point will be the point that yields the maximum of these minimum distances. In hierarchical grouping all n points are initially considered as n groups of 1 member. The number of groups is decreased by 1 with each iteration from n to n-1 to n-2 until the desired number of groups has been obtained. At each step the sum over all groups of the error sum of squares of each group is minimized. The error sum of squares of a group of n members is (sum from i = 1 to n of (xi - xaverage)**2 + (yi - yaverage)**2 + (zi - zaverage)**2) where xaverage, yaverage, and zaverage are for all members of 1 group. With n groups n*(n-1)/2 possible groupings to go from n to n-1 groups are tried to find the n-1 grouping with the lowest sum over all groups of the error sum of squares. In Bradley-Fayyad we perform k-means on 10 subsamples, each with 1/10 of all the points, chosen at random. The initial centroids are chosen at random and then 1/10 of the data points are selected at random. When a subsample k-means finishes, if a centroid does not have any members, then the initial centroid is reassigned to the point which is farthest from its assigned cluster centroid, and k-means is run again from these new initial centroids. The final centroids cmi from each of the 10 subsample runs are grouped into a common set cm. 10 k-means runs are performed on this common cm centroid set, each starting with a cmi from a subsample run. Let the final resulting centroids from each of these 10 runs be fmi. Then the fmi which has the lowest sum over all groups of the error sum of squares of each group will be chosen as the initial starting centroids for the full sample run. The global k-means problem solves a problem with m clusters by sequentially solving all intermediate problems with 1,2, ..., m-1 clusters. At the start of a run with m clusters the first m-1 clusters are put at positions corresponding to their final locations in the m-1 cluster problem. For global k-means at every intermediate stage the one unassigned cluster is started at locations corresponding to every data point in the image and the clustering error is calculated for a full k-means run. The run with the lowest clustering error is taken as the solution. For fast global k-means at each intermediate step only 1 full k-means run is performed instead of having the number of k-means runs equal the number of data points in the image. The point which provides the largest guaranteed reduction in the error measure is used as the new cluster starting location. In fast global k-means we wish to find the point xn that maximizes bn, the guaranteed reduction in clustering error obtained by inserting a new cluster center at position xn. Let there be N points. Then bn = sum from j = 1 to N of the max(d(j,k-1)**2 - ||xn -xj||**2,0) where d(j,k-1) is the distance between xj and the closest center among the k-1 cluster centers obtained so far. Fast global k-means is not used for the city-block measure. From the Celebi reference: Consider a point xi, two cluster centers ca and cb and a distance metric d. Using the triangle inequality, we have d(ca,cb) <= d(xi,ca) + d(xi,cb). Therefore, if we know that 2d(xi,ca) <= d(ca,cb), we can conclude that d(xi,ca) <= d(xi,cb) without having to calculate d(xi,cb). By doing runs with different number of clusters and seeing which run gives the highest Calinski and Harabasz figure of merit, the ideal number of clusters can be determined. Let B = the between groups sum of squares and W = the within group sum of squares. Then the Calinski and Harabasz figure of merit equals = (B/(number of clusters -1))/(W/(totalWeight - number of clusters), where the weight of each point equals 1 in an unweighted data set or the totalWeight = the number of points in an unweighted data set. The sum of squared deviations from cluster centroids from 2 different runs can be used in an 'F test' proposed by Beale. Let SD denote the sum of the squared deviations from cluster centroids in the sample. Then a divison of n objects into g2 clusters is significantly better than a division into g1 clusters (g2 > g1) if the test statistic F(g1,g2) = ((SD1 - SD2)/SD2)/([(n-g1)/(n-g2)]*((g2/g1)**(2/nDims)) - 1) exceeds the critical value from an F distribution with nDims*(g2 - g1) and nDims*(n - g2) degrees of freedom. Experience with Beale's rule suggest that it will only be successful when the clusters are fairly well separated and approximately spherical in shape. Marriott proposed finding the ideal number of groups g by finding the minimum of g*g*determinant(withinGroupSumOfSquaresMatrix), where the matrix is nDims by nDims. For unimodal distributions the minimum value is likely to be g = 1. For strongly grouped distributions the minimum will indicate the appropriate value of g. For a uniform distribution the value will be constant over different group numbers. If g*g*determinant(withinGroupSumOfSquaresMatrix)/determinant(totalSumOfSquaresMatrix) > 1 for all numbers of clusters, then the distribution should be regarded as a single group. References: 1.) "A systematic evaluation of different methods for initializing the K-means clustering algorithm" by Anna D. Peterson, Arka. P. Ghosh, and Ranjan Maitra, IEEE Transactions on Knowledge and Data Engineering 2.) "Refining Initial Points for K-Means Clustering" by P.S. Bradley and Usama M. Fayyad. 3.) "Hierarchical Grouping to Optimize an Objective Function" by Joe H. Ward, Jr., Journal of the American Statistical Association, Volume 58, Issue 301, March, 1963, pp. 236-244. 4.) "Improving the Performance of K-Means for Color Quantization" by M. Emre Celebi, Image and Vision Computing, Volume 29, No. 4, 2011, pp. 260-271. 5.) "The global k-means clustering algorithm" by Aristidis Likas, Nikos Vlassis, and Jakob J. Verbeek, Pattern Recognition, Volume 36, 2003, pp. 451-461. 6.) Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber, Section 8.4 Partitioning Methods, Morgan Kaufmann Publishers, 2001, pp. 348-353. 7.) Cluster Analysis Fourth Edition by Brian S. Everitt, Sabine Landau, and Morven Leese, Arnold Publishers, 2001, Chapter 5, Optimization Clustering Techniques, pp. 90-117. 8.) "Practical Problems in Cluster Analysis" by F.H.C. Marriott, Biometrics, Vol. 27, September, 1971, pp. 501-514. 9.) Multivariate Observations by G. A. F. Seber, John Wiley & Sons, INc., 1984, Section 7.5 Partitioning Methods, pp.379-387. 10.) "Model-Based Gaussian and Non-Gaussian Clustering" by Jeffrey D. Banfield and Adrian E. Raftery, Biometrics, Vol. 49, No. 3., September, 1993, pp. 803-821. 11.) "Introduction to Clustering Large and High-Dimensional Data" by Jacob Kogan, Cambridge University Press, 2007, Chapter 2 Quadratic k-means algorithm, pp. 9-39.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private class
AlgorithmKMeans.positionWeightComparator
private class
AlgorithmKMeans.positionWeightItem
-
Field Summary
Fields Modifier and Type Field Description private int
algoSelection
private double[]
axesRatio
private float[]
blueBuffer
static int
BRADLEY_FAYYAD_INIT
private boolean
bwSegmentedImage
private double[][]
centroidDistances
private double[][]
centroidPos
static int
CITY_BLOCK
private boolean
colorSegmentInRGB
private int
distanceMeasure
private double[]
doubleBuffer
private boolean
equalScale
static int
EUCLIDEAN_SQUARED
static int
FAST_GLOBAL_K_MEANS
private boolean
followBatchWithIncremental
static int
GLOBAL_K_MEANS
private float[]
greenBuffer
private double[][]
groupMean
private int[]
groupNum
private double[][]
groupStdDev
static int
HIERARCHICAL_GROUPING_INIT
private ModelImage
image
private int
initSelection
static int
K_MEANS
static int
MAHALANOBIS_SQUARED
static int
MAXMIN_INIT
private int
nDims
private int
nPoints
private int
numberClusters
private double[][]
pos
static int
RANDOM_INIT
private float[]
redBuffer
private java.lang.String
resultsFileName
static int
S_METRIC
static int
S_STAR_METRIC
private double[]
scale
private double[]
scale2
private double
scaleMax
private boolean
scaleVariablesToUnitVariance
private boolean
showSegmentedImage
static int
SPHERES_DIFFERENT_SIZES
private double[]
totalWeight
private boolean
useColorHistogram
private double[]
weight
-
Fields inherited from class gov.nih.mipav.model.algorithms.AlgorithmBase
destFlag, destImage, image25D, mask, maxProgressValue, minProgressValue, multiThreadingEnabled, nthreads, progress, progressModulus, progressStep, runningInSeparateThread, separable, srcImage, threadStopped
-
-
Constructor Summary
Constructors Constructor Description AlgorithmKMeans(ModelImage image, int algoSelection, int distanceMeasure, double[][] pos, double[] scale, int[] groupNum, double[] weight, double[][] centroidPos, java.lang.String resultsFileName, int initSelection, float[] redBuffer, float[] greenBuffer, float[] blueBuffer, double scaleMax, boolean useColorHistogram, boolean scaleVariablesToUnitVariance, double[] axesRatio, boolean bwSegmentedImage, double[] doubleBuffer, boolean showSegmentedImage, boolean followBatchWithIncremental, boolean colorSegmentInRGB)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
cityBlock()
private void
EuclideanSquared()
void
finalize()
Prepares this class for destruction.double[][]
getGroupMean()
double[][]
getGroupStdDev()
private void
MahalanobisSquared()
private void
otherMetrics()
void
runAlgorithm()
Starts the algorithm.-
Methods inherited from class gov.nih.mipav.model.algorithms.AlgorithmBase
actionPerformed, addListener, addProgressChangeListener, calculateImageSize, calculatePrincipleAxis, computeElapsedTime, computeElapsedTime, convertIntoFloat, delinkProgressToAlgorithm, delinkProgressToAlgorithmMulti, displayError, errorCleanUp, fireProgressStateChanged, fireProgressStateChanged, fireProgressStateChanged, fireProgressStateChanged, fireProgressStateChanged, generateProgressValues, getDestImage, getElapsedTime, getMask, getMaxProgressValue, getMinProgressValue, getNumberOfThreads, getProgress, getProgressChangeListener, getProgressChangeListeners, getProgressModulus, getProgressStep, getProgressValues, getSrcImage, isCompleted, isImage25D, isMultiThreadingEnabled, isRunningInSeparateThread, isThreadStopped, linkProgressToAlgorithm, linkProgressToAlgorithm, makeProgress, notifyListeners, removeListener, removeProgressChangeListener, run, setCompleted, setImage25D, setMask, setMaxProgressValue, setMinProgressValue, setMultiThreadingEnabled, setNumberOfThreads, setProgress, setProgressModulus, setProgressStep, setProgressValues, setProgressValues, setRunningInSeparateThread, setSrcImage, setStartTime, setThreadStopped, startMethod, windowActivated, windowClosed, windowClosing, windowDeactivated, windowDeiconified, windowIconified, windowOpened
-
Methods inherited from class java.lang.Thread
activeCount, checkAccess, clone, countStackFrames, currentThread, dumpStack, enumerate, getAllStackTraces, getContextClassLoader, getDefaultUncaughtExceptionHandler, getId, getName, getPriority, getStackTrace, getState, getThreadGroup, getUncaughtExceptionHandler, holdsLock, interrupt, interrupted, isAlive, isDaemon, isInterrupted, join, join, join, onSpinWait, resume, setContextClassLoader, setDaemon, setDefaultUncaughtExceptionHandler, setName, setPriority, setUncaughtExceptionHandler, sleep, sleep, start, stop, suspend, toString, yield
-
-
-
-
Field Detail
-
RANDOM_INIT
public static final int RANDOM_INIT
- See Also:
- Constant Field Values
-
BRADLEY_FAYYAD_INIT
public static final int BRADLEY_FAYYAD_INIT
- See Also:
- Constant Field Values
-
HIERARCHICAL_GROUPING_INIT
public static final int HIERARCHICAL_GROUPING_INIT
- See Also:
- Constant Field Values
-
MAXMIN_INIT
public static final int MAXMIN_INIT
- See Also:
- Constant Field Values
-
K_MEANS
public static final int K_MEANS
- See Also:
- Constant Field Values
-
GLOBAL_K_MEANS
public static final int GLOBAL_K_MEANS
- See Also:
- Constant Field Values
-
FAST_GLOBAL_K_MEANS
public static final int FAST_GLOBAL_K_MEANS
- See Also:
- Constant Field Values
-
EUCLIDEAN_SQUARED
public static final int EUCLIDEAN_SQUARED
- See Also:
- Constant Field Values
-
CITY_BLOCK
public static final int CITY_BLOCK
- See Also:
- Constant Field Values
-
MAHALANOBIS_SQUARED
public static final int MAHALANOBIS_SQUARED
- See Also:
- Constant Field Values
-
S_METRIC
public static final int S_METRIC
- See Also:
- Constant Field Values
-
SPHERES_DIFFERENT_SIZES
public static final int SPHERES_DIFFERENT_SIZES
- See Also:
- Constant Field Values
-
S_STAR_METRIC
public static final int S_STAR_METRIC
- See Also:
- Constant Field Values
-
image
private ModelImage image
-
algoSelection
private int algoSelection
-
scale
private double[] scale
-
pos
private double[][] pos
-
groupNum
private int[] groupNum
-
weight
private double[] weight
-
centroidPos
private double[][] centroidPos
-
resultsFileName
private java.lang.String resultsFileName
-
initSelection
private int initSelection
-
redBuffer
private float[] redBuffer
-
greenBuffer
private float[] greenBuffer
-
blueBuffer
private float[] blueBuffer
-
scaleMax
private double scaleMax
-
useColorHistogram
private boolean useColorHistogram
-
distanceMeasure
private int distanceMeasure
-
scaleVariablesToUnitVariance
private boolean scaleVariablesToUnitVariance
-
axesRatio
private double[] axesRatio
-
bwSegmentedImage
private boolean bwSegmentedImage
-
doubleBuffer
private double[] doubleBuffer
-
scale2
private double[] scale2
-
equalScale
private boolean equalScale
-
nDims
private int nDims
-
nPoints
private int nPoints
-
numberClusters
private int numberClusters
-
totalWeight
private double[] totalWeight
-
centroidDistances
private double[][] centroidDistances
-
groupMean
private double[][] groupMean
-
groupStdDev
private double[][] groupStdDev
-
showSegmentedImage
private boolean showSegmentedImage
-
followBatchWithIncremental
private boolean followBatchWithIncremental
-
colorSegmentInRGB
private boolean colorSegmentInRGB
-
-
Constructor Detail
-
AlgorithmKMeans
public AlgorithmKMeans(ModelImage image, int algoSelection, int distanceMeasure, double[][] pos, double[] scale, int[] groupNum, double[] weight, double[][] centroidPos, java.lang.String resultsFileName, int initSelection, float[] redBuffer, float[] greenBuffer, float[] blueBuffer, double scaleMax, boolean useColorHistogram, boolean scaleVariablesToUnitVariance, double[] axesRatio, boolean bwSegmentedImage, double[] doubleBuffer, boolean showSegmentedImage, boolean followBatchWithIncremental, boolean colorSegmentInRGB)
-
-
Method Detail
-
finalize
public void finalize()
Prepares this class for destruction.- Overrides:
finalize
in classAlgorithmBase
-
runAlgorithm
public void runAlgorithm()
Starts the algorithm.- Specified by:
runAlgorithm
in classAlgorithmBase
-
EuclideanSquared
private void EuclideanSquared()
-
cityBlock
private void cityBlock()
-
MahalanobisSquared
private void MahalanobisSquared()
-
otherMetrics
private void otherMetrics()
-
getGroupMean
public double[][] getGroupMean()
- Returns:
- groupMean[nDims][numberClusters]
-
getGroupStdDev
public double[][] getGroupStdDev()
- Returns:
-
-