public class XMeans extends RandomizableClusterer implements TechnicalInformationHandler
@inproceedings{Pelleg2000, author = {Dan Pelleg and Andrew W. Moore}, booktitle = {Seventeenth International Conference on Machine Learning}, pages = {727-734}, publisher = {Morgan Kaufmann}, title = {X-means: Extending K-means with Efficient Estimation of the Number of Clusters}, year = {2000} }Valid options are:
-I <num> maximum number of overall iterations (default 1).
-M <num> maximum number of iterations in the kMeans loop in the Improve-Parameter part (default 1000).
-J <num> maximum number of iterations in the kMeans loop for the splitted centroids in the Improve-Structure part (default 1000).
-L <num> minimum number of clusters (default 2).
-H <num> maximum number of clusters (default 4).
-B <value> distance value for binary attributes (default 1.0).
-use-kdtree Uses the KDTree internally (default no).
-K <KDTree class specification> Full class name of KDTree class to use, followed by scheme options. eg: "weka.core.neighboursearch.kdtrees.KDTree -P" (default no KDTree class used).
-C <value> cutoff factor, takes the given percentage of the splitted centroids if none of the children win (default 0.0).
-D <distance function class specification> Full class name of Distance function class to use, followed by scheme options. (default weka.core.EuclideanDistance).
-N <file name> file to read starting centers from (ARFF format).
-O <file name> file to write centers to (ARFF format).
-U <int> The debug level. (default 0)
-Y <file name> The debug vectors file.
-S <num> Random number seed. (default 10)
RandomizableClusterer
,
Serialized FormModifier and Type | Field and Description |
---|---|
static int |
D_CONVCHCLOSER
have a closer look at converge children.
|
static int |
D_CURR
for current debug.
|
static int |
D_FOLLOWSPLIT
follows the splitting of the centers.
|
static int |
D_GENERAL
general debugging.
|
static int |
D_ITERCOUNT
follow iterations.
|
static int |
D_KDTREE
check on kdtree.
|
static int |
D_METH_MISUSE
functions were maybe misused.
|
static int |
D_PRINTCENTERS
print the centers.
|
static int |
D_RANDOMVECTOR
check on random vectors.
|
protected double |
m_Bic
BIC-Score of the current model.
|
protected double |
m_BinValue
Distance value between true and false of binary attributes and
"same" and "different" of nominal attributes (default = 1.0).
|
protected Reader |
m_CenterInput
input file for the cluster centers.
|
protected PrintWriter |
m_CenterOutput
output file for the cluster centers.
|
protected int[] |
m_ClusterAssignments
temporary variable holding cluster assignments while iterating.
|
protected Instances |
m_ClusterCenters
cluster centers.
|
boolean |
m_CurrDebugFlag
Flag: I'm debugging.
|
protected double |
m_CutOffFactor
cutoff factor - percentage of splits done in Improve-Structure part
only relevant, if all children lost.
|
protected int |
m_DebugLevel
level of debug output, 0 is no output.
|
protected Instances |
m_DebugVectors
all the debug vectors.
|
protected File |
m_DebugVectorsFile
file name of the input file for the random vectors.
|
protected int |
m_DebugVectorsIndex
the index for the current debug vector.
|
protected Reader |
m_DebugVectorsInput
input file for the random vectors --> USED FOR DEBUGGING.
|
protected DistanceFunction |
m_DistanceF
the distance function used.
|
protected File |
m_InputCenterFile
file name of the output file for the cluster centers.
|
protected Instances |
m_Instances
training instances.
|
protected int |
m_IterationCount
counts iterations done in main loop.
|
protected KDTree |
m_KDTree
KDTrees class if KDTrees are used.
|
protected int |
m_KMeansStopped
counter to say how often kMeans was stopped by loop counter.
|
protected int |
m_MaxIterations
maximum overall iterations.
|
protected int |
m_MaxKMeans
maximum iterations to perform Kmeans part
if negative, iterations are not checked.
|
protected int |
m_MaxKMeansForChildren
see above, but for kMeans of splitted clusters.
|
protected int |
m_MaxNumClusters
max number of clusters to generate.
|
protected int |
m_MinNumClusters
min number of clusters to generate.
|
protected double[] |
m_Mle
Distortion.
|
protected Instances |
m_Model
model information, should increase readability.
|
protected int |
m_NumClusters
The actual number of clusters.
|
protected int |
m_NumSplits
Number of splits prepared.
|
protected int |
m_NumSplitsDone
Number of splits accepted (including cutoff factor decisions).
|
protected int |
m_NumSplitsStillDone
Number of splits accepted just because of cutoff factor.
|
protected File |
m_OutputCenterFile
file name of the output file for the cluster centers.
|
protected ReplaceMissingValues |
m_ReplaceMissingFilter
replace missing values in training instances.
|
protected boolean |
m_UseKDTree
whether to use the KDTree (the KDTree is only initialized to be
configurable from the GUI).
|
static int |
R_HIGH
Index in ranges for HIGH.
|
static int |
R_LOW
Index in ranges for LOW.
|
static int |
R_WIDTH
Index in ranges for WIDTH.
|
m_Seed, m_SeedDefault
Constructor and Description |
---|
XMeans()
the default constructor.
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
assignToCenters(Instances centers,
int[][] instOfCent,
int[] allInstList,
int[] assignments)
Assign instances to centers.
|
protected boolean |
assignToCenters(KDTree kdtree,
Instances centers,
int[][] instOfCent,
int[] assignments,
int iterationCount)
Assign instances to centers using KDtree.
|
protected boolean |
assignToCenters(KDTree tree,
Instances centers,
int[][] instOfCent,
int[] allInstList,
int[] assignments,
int iterationCount)
Assigns instances to centers.
|
String |
binValueTipText()
Returns the tip text for this property.
|
void |
buildClusterer(Instances data)
Generates the X-Means clusterer.
|
protected double |
calculateBIC(int[][] instOfCent,
Instances centers,
double[] mle)
Calculates the BIC for the given set of centers and instances.
|
protected double |
calculateBIC(int[] instList,
Instance center,
double mle,
Instances model)
Returns the BIC-value for the given center and instances.
|
boolean |
checkForNominalAttributes(Instances data)
Checks for nominal attributes in the dataset.
|
protected void |
checkInstances()
Checks the instances.
|
int |
clusterInstance(Instance instance)
Classifies a given instance.
|
protected int |
clusterProcessedInstance(Instance instance)
Clusters an instance that has been through the filters.
|
protected int |
clusterProcessedInstance(Instance instance,
Instances centers)
Clusters an instance.
|
String |
cutOffFactorTipText()
Returns the tip text for this property.
|
String |
debugLevelTipText()
Returns the tip text for this property.
|
String |
debugVectorsFileTipText()
Returns the tip text for this property.
|
String |
distanceFTipText()
Returns the tip text for this property.
|
protected double[] |
distortion(int[][] instOfCent,
Instances centers)
Calculates the maximum likelihood estimate for the variance.
|
double |
getBinValue()
Gets value that represents true in a new numeric attribute.
|
Capabilities |
getCapabilities()
Returns default capabilities of the clusterer.
|
Instances |
getClusterCenters()
Return the centers of the clusters as an Instances object
|
double |
getCutOffFactor()
Gets the cutoff factor.
|
int |
getDebugLevel()
Gets the debug level.
|
File |
getDebugVectorsFile()
Gets the file name for a file that has the random vectors stored.
|
DistanceFunction |
getDistanceF()
Gets the distance function.
|
protected String |
getDistanceFSpec()
Gets the distance function specification string, which contains the
class name of the distance function class and any options to it.
|
File |
getInputCenterFile()
Gets the file to read the list of centers from.
|
KDTree |
getKDTree()
Gets the KDTree class.
|
protected String |
getKDTreeSpec()
Gets the KDTree specification string, which contains the class name of
the KDTree class and any options to the KDTree.
|
int |
getMaxIterations()
Gets the maximum number of iterations.
|
int |
getMaxKMeans()
Gets the maximum number of iterations in KMeans.
|
int |
getMaxKMeansForChildren()
Gets the maximum number of iterations in KMeans.
|
int |
getMaxNumClusters()
Gets the maximum number of clusters to generate.
|
int |
getMinNumClusters()
Gets the minimum number of clusters to generate.
|
Instance |
getNextDebugVectorsInstance(Instances model)
Read an instance from debug vectors file.
|
String[] |
getOptions()
Gets the current settings of SimpleKMeans.
|
File |
getOutputCenterFile()
Gets the file to write the list of centers to.
|
String |
getRevision()
Returns the revision string.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing
detailed information about the technical background of this class,
e.g., paper reference or book this class is based on.
|
boolean |
getUseKDTree()
Gets whether the KDTree is used or not.
|
String |
globalInfo()
Returns a string describing this clusterer.
|
protected int[] |
initAssignments(int numInstances)
Creates and initializes integer array, used to store assignments.
|
protected int[] |
initAssignments(int[] ass)
Set array of int, used to store assignments, to -1.
|
void |
initDebugVectorsInput()
Initialises the debug vector input.
|
String |
inputCenterFileTipText()
Returns the tip text for this property.
|
String |
KDTreeTipText()
Returns the tip text for this property.
|
Enumeration |
listOptions()
Returns an enumeration describing the available options.
|
protected double |
logLikelihoodEstimate(int numInst,
Instance center,
double distortion,
int numCent)
Calculates the log-likelihood of the data for the given model, taken
at the maximum likelihood point.
|
static void |
main(String[] argv)
Main method for testing this class.
|
protected Instances |
makeCentersRandomly(Random random0,
Instances model,
int numClusters)
Generates new centers randomly.
|
String |
maxIterationsTipText()
Returns the tip text for this property.
|
String |
maxKMeansForChildrenTipText()
Returns the tip text for this property.
|
String |
maxKMeansTipText()
Returns the tip text for this property.
|
String |
maxNumClustersTipText()
Returns the tip text for this property.
|
protected double |
meanOrMode(Instances instances,
int[] instList,
int attIndex)
Computes Mean Or Mode of one attribute on a subset of m_Instances.
|
String |
minNumClustersTipText()
Returns the tip text for this property.
|
protected Instances |
newCentersAfterSplit(boolean[] splitWon,
Instances splitCenters)
Returns new centers.
|
protected Instances |
newCentersAfterSplit(double[] pbic,
double[] cbic,
double cutoffFactor,
Instances splitCenters)
Returns new center list.
|
protected int |
nextAssignedOne(int cent,
int lastIndex,
int[] assignments)
Searches along the assignment array for the next entry of the center
in question.
|
int |
numberOfClusters()
Returns the number of clusters.
|
String |
outputCenterFileTipText()
Returns the tip text for this property.
|
protected void |
PFD_CURR(String output)
Does debug printouts.
|
protected void |
PFD(int debugLevel,
String output)
Does debug printouts.
|
protected void |
PrCentersFD(int debugLevel)
Print centers for debug.
|
protected boolean |
recomputeCenters(Instances centers,
int[][] instOfCent,
Instances model)
Recompute the new centers.
|
protected void |
recomputeCentersFast(Instances centers,
int[][] instOfCentIndexes,
Instances model)
Recompute the new centers - 2nd version
Same as recomputeCenters, but does not check if center stays the same.
|
void |
setBinValue(double value)
Sets the distance value between true and false of binary attributes.
|
void |
setCutOffFactor(double i)
Sets a new cutoff factor.
|
void |
setDebugLevel(int d)
Sets the debug level.
|
void |
setDebugVectorsFile(File value)
Sets the file that has the random vectors stored.
|
void |
setDistanceF(DistanceFunction distanceF)
gets the "binary" distance value.
|
void |
setInputCenterFile(File value)
Sets the file to read the list of centers from.
|
void |
setKDTree(KDTree k)
Sets the KDTree class.
|
void |
setMaxIterations(int i)
Sets the maximum number of iterations to perform.
|
void |
setMaxKMeans(int i)
Set the maximum number of iterations to perform in KMeans.
|
void |
setMaxKMeansForChildren(int i)
Sets the maximum number of iterations KMeans that is performed
on the child centers.
|
void |
setMaxNumClusters(int n)
Sets the maximum number of clusters to generate.
|
void |
setMinNumClusters(int n)
Sets the minimum number of clusters to generate.
|
void |
setOptions(String[] options)
Parses a given list of options.
|
void |
setOutputCenterFile(File value)
Sets file to write the list of centers to.
|
void |
setUseKDTree(boolean value)
Sets whether to use the KDTree or not.
|
protected Instances |
splitCenter(Random random,
Instance center,
double variance,
Instances model)
Split centers in their region.
|
protected Instances |
splitCenters(Random random,
Instances instances,
Instances model)
Split centers in their region.
|
protected boolean |
stopIteration(int iterationCount,
int max)
Checks if iterationCount has to be checked and if yes
(this means max is > 0) compares it with max.
|
protected boolean |
stopKMeansIteration(int iterationCount,
int max)
Controls that counter does not exceed max iteration value.
|
protected boolean |
TFD(int debugLevel)
Tests on debug status.
|
String |
toString()
Return a string describing this clusterer.
|
String |
useKDTreeTipText()
Returns the tip text for this property.
|
getSeed, seedTipText, setSeed
distributionForInstance, forName, makeCopies, makeCopy, runClusterer
protected Instances m_Instances
protected Instances m_Model
protected ReplaceMissingValues m_ReplaceMissingFilter
protected double m_BinValue
protected double m_Bic
protected double[] m_Mle
protected int m_MaxIterations
protected int m_MaxKMeans
protected int m_MaxKMeansForChildren
protected int m_NumClusters
protected int m_MinNumClusters
protected int m_MaxNumClusters
protected DistanceFunction m_DistanceF
protected Instances m_ClusterCenters
protected File m_InputCenterFile
protected Reader m_DebugVectorsInput
protected int m_DebugVectorsIndex
protected Instances m_DebugVectors
protected File m_DebugVectorsFile
protected Reader m_CenterInput
protected File m_OutputCenterFile
protected PrintWriter m_CenterOutput
protected int[] m_ClusterAssignments
protected double m_CutOffFactor
public static int R_LOW
public static int R_HIGH
public static int R_WIDTH
protected KDTree m_KDTree
protected boolean m_UseKDTree
protected int m_IterationCount
protected int m_KMeansStopped
protected int m_NumSplits
protected int m_NumSplitsDone
protected int m_NumSplitsStillDone
protected int m_DebugLevel
public static int D_PRINTCENTERS
public static int D_FOLLOWSPLIT
public static int D_CONVCHCLOSER
public static int D_RANDOMVECTOR
public static int D_KDTREE
public static int D_ITERCOUNT
public static int D_METH_MISUSE
public static int D_CURR
public static int D_GENERAL
public boolean m_CurrDebugFlag
public String globalInfo()
public TechnicalInformation getTechnicalInformation()
getTechnicalInformation
in interface TechnicalInformationHandler
public Capabilities getCapabilities()
getCapabilities
in interface Clusterer
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class AbstractClusterer
Capabilities
public void buildClusterer(Instances data) throws Exception
buildClusterer
in interface Clusterer
buildClusterer
in class AbstractClusterer
data
- set of instances serving as training dataException
- if the clusterer has not been
generated successfullypublic boolean checkForNominalAttributes(Instances data)
data
- the data to checkprotected int[] initAssignments(int[] ass)
ass
- integer array used for storing assignmentsprotected int[] initAssignments(int numInstances)
numInstances
- length of array used for assignmentsprotected Instances newCentersAfterSplit(double[] pbic, double[] cbic, double cutoffFactor, Instances splitCenters)
pbic
- array of parents BIC-valuescbic
- array of childrens BIC-valuescutoffFactor
- cutoff factorsplitCenters
- all childrenprotected Instances newCentersAfterSplit(boolean[] splitWon, Instances splitCenters)
splitWon
- array of boolean to indicate to take split or notsplitCenters
- list of splitted centersprotected boolean stopKMeansIteration(int iterationCount, int max)
iterationCount
- current value of countermax
- maximum value for counterprotected boolean stopIteration(int iterationCount, int max)
iterationCount
- the current iteration countmax
- the maximum number of iterationsprotected boolean recomputeCenters(Instances centers, int[][] instOfCent, Instances model)
centers
- the input and output centersinstOfCent
- the instances to the centersmodel
- data model informationprotected void recomputeCentersFast(Instances centers, int[][] instOfCentIndexes, Instances model)
centers
- the input center and output centersinstOfCentIndexes
- the indexes of the instances to the centersmodel
- data model informationprotected double meanOrMode(Instances instances, int[] instList, int attIndex)
instances
- all instancesinstList
- the indexes of the instances the mean is computed fromattIndex
- the index of the attributeprotected boolean assignToCenters(KDTree tree, Instances centers, int[][] instOfCent, int[] allInstList, int[] assignments, int iterationCount) throws Exception
tree
- KDTree on all instancescenters
- all the input centersinstOfCent
- the instances to each centerallInstList
- list of all instancesassignments
- assignments of instances to centersiterationCount
- the number of iterationException
- is something goes wrongprotected boolean assignToCenters(KDTree kdtree, Instances centers, int[][] instOfCent, int[] assignments, int iterationCount) throws Exception
kdtree
- KDTree on all instancescenters
- all the input centersinstOfCent
- the instances to each centerassignments
- assignments of instances to centersiterationCount
- the number of iterationException
- in case instances are not assigned to clusterprotected boolean assignToCenters(Instances centers, int[][] instOfCent, int[] allInstList, int[] assignments) throws Exception
centers
- all the input centersinstOfCent
- the instances to each centerallInstList
- list of all indexesassignments
- assignments of instances to centersException
- if something goes wrongprotected int nextAssignedOne(int cent, int lastIndex, int[] assignments)
cent
- index of the centerlastIndex
- index to start searchingassignments
- assignmentsprotected Instances splitCenter(Random random, Instance center, double variance, Instances model) throws Exception
random
- random functioncenter
- the center that is split herevariance
- variance of the clustermodel
- data model validException
- something in AlgVector goes wrongprotected Instances splitCenters(Random random, Instances instances, Instances model)
random
- the random number generatorinstances
- of the regionmodel
- the model for the centers
(should be the same as that of instances)protected Instances makeCentersRandomly(Random random0, Instances model, int numClusters)
random0
- random number generatormodel
- data model of the instancesnumClusters
- number of clustersprotected double calculateBIC(int[] instList, Instance center, double mle, Instances model)
instList
- The indices of the instances that belong to the centercenter
- the center.mle
- maximum likelihoodmodel
- the data modelprotected double calculateBIC(int[][] instOfCent, Instances centers, double[] mle)
instOfCent
- The instances that belong to their respective centerscenters
- the centersmle
- maximum likelihoodprotected double logLikelihoodEstimate(int numInst, Instance center, double distortion, int numCent)
numInst
- number of instances that belong to the centercenter
- the centerdistortion
- distortionnumCent
- number of centersprotected double[] distortion(int[][] instOfCent, Instances centers)
instOfCent
- indices of instances to each centercenters
- the centersprotected int clusterProcessedInstance(Instance instance, Instances centers)
instance
- the instance to assign a cluster to.centers
- the centers to cluster the instance to.protected int clusterProcessedInstance(Instance instance)
instance
- the instance to assign a cluster topublic int clusterInstance(Instance instance) throws Exception
clusterInstance
in interface Clusterer
clusterInstance
in class AbstractClusterer
instance
- the instance to be assigned to a clusterException
- if instance could not be classified
successfullypublic int numberOfClusters()
numberOfClusters
in interface Clusterer
numberOfClusters
in class AbstractClusterer
public Enumeration listOptions()
listOptions
in interface OptionHandler
listOptions
in class RandomizableClusterer
public String minNumClustersTipText()
public void setMinNumClusters(int n)
n
- the minimum number of clusters to generatepublic int getMinNumClusters()
public String maxNumClustersTipText()
public void setMaxNumClusters(int n)
n
- the maximum number of clusters to generatepublic int getMaxNumClusters()
public String maxIterationsTipText()
public void setMaxIterations(int i) throws Exception
i
- the number of iterationsException
- if i is less than 1public int getMaxIterations()
public String maxKMeansTipText()
public void setMaxKMeans(int i)
i
- the number of iterationspublic int getMaxKMeans()
public String maxKMeansForChildrenTipText()
public void setMaxKMeansForChildren(int i)
i
- the number of iterationspublic int getMaxKMeansForChildren()
public String cutOffFactorTipText()
public void setCutOffFactor(double i)
i
- the new cutoff factorpublic double getCutOffFactor()
public String binValueTipText()
public double getBinValue()
public void setBinValue(double value)
value
- the distancepublic String distanceFTipText()
public void setDistanceF(DistanceFunction distanceF)
distanceF
- the distance function with all options setpublic DistanceFunction getDistanceF()
protected String getDistanceFSpec()
public String debugVectorsFileTipText()
public void setDebugVectorsFile(File value)
value
- the file to read the random vectors frompublic File getDebugVectorsFile()
public void initDebugVectorsInput() throws Exception
Exception
- if there is error
opening the debug input file.public Instance getNextDebugVectorsInstance(Instances model) throws Exception
model
- the data model for the instance.Exception
- if there are no debug vector
in m_DebugVectors.public String inputCenterFileTipText()
public void setInputCenterFile(File value)
value
- the file to read centers frompublic File getInputCenterFile()
public String outputCenterFileTipText()
public void setOutputCenterFile(File value)
value
- file to write centers topublic File getOutputCenterFile()
public String KDTreeTipText()
public void setKDTree(KDTree k)
k
- a KDTree object with all options setpublic KDTree getKDTree()
public String useKDTreeTipText()
public void setUseKDTree(boolean value)
value
- if true the KDTree is usedpublic boolean getUseKDTree()
protected String getKDTreeSpec()
public String debugLevelTipText()
public void setDebugLevel(int d)
d
- debuglevelpublic int getDebugLevel()
protected void checkInstances()
public void setOptions(String[] options) throws Exception
-I <num> maximum number of overall iterations (default 1).
-M <num> maximum number of iterations in the kMeans loop in the Improve-Parameter part (default 1000).
-J <num> maximum number of iterations in the kMeans loop for the splitted centroids in the Improve-Structure part (default 1000).
-L <num> minimum number of clusters (default 2).
-H <num> maximum number of clusters (default 4).
-B <value> distance value for binary attributes (default 1.0).
-use-kdtree Uses the KDTree internally (default no).
-K <KDTree class specification> Full class name of KDTree class to use, followed by scheme options. eg: "weka.core.neighboursearch.kdtrees.KDTree -P" (default no KDTree class used).
-C <value> cutoff factor, takes the given percentage of the splitted centroids if none of the children win (default 0.0).
-D <distance function class specification> Full class name of Distance function class to use, followed by scheme options. (default weka.core.EuclideanDistance).
-N <file name> file to read starting centers from (ARFF format).
-O <file name> file to write centers to (ARFF format).
-U <int> The debug level. (default 0)
-Y <file name> The debug vectors file.
-S <num> Random number seed. (default 10)
setOptions
in interface OptionHandler
setOptions
in class RandomizableClusterer
options
- the list of options as an array of stringsException
- if an option is not supportedpublic String[] getOptions()
getOptions
in interface OptionHandler
getOptions
in class RandomizableClusterer
public String toString()
public Instances getClusterCenters()
protected void PrCentersFD(int debugLevel)
debugLevel
- level that gives according messagesprotected boolean TFD(int debugLevel)
debugLevel
- level that gives according messagesprotected void PFD(int debugLevel, String output)
debugLevel
- level that gives according messagesoutput
- string that is printedprotected void PFD_CURR(String output)
output
- string that is printedpublic String getRevision()
getRevision
in interface RevisionHandler
getRevision
in class AbstractClusterer
public static void main(String[] argv)
argv
- should contain optionsCopyright © 2015 University of Waikato, Hamilton, NZ. All rights reserved.