Given a data set S, there are many situations where we would like to partition the data set into subsets (called clusters) where the data elements in each cluster are more similar to other data elements in that cluster and less similar to data elements in other clusters. Here “similar” can mean many things. In biology it might mean that the organisms are genetically similar. In marketing it might mean that customers are in the same market segment.
Clustering can also be hierarchical, where clustering is done at multiple levels. Here the data set is divided into clusters and these clusters are in turn further divided into more finely granular clusters. The biological classification system (kingdoms, phylum, class, order, family, group, genus, species) is an example of hierarchical clustering.
In this chapter we will describe a form of prototype clustering, called k-means clustering, where a prototype member of each cluster is identified (called a centroid) which somehow represents that cluster. The approach we take is that each data element belongs to the cluster whose centroid is nearest to it; i.e. which minimizes the distance between that data element and that cluster’s centroid.
Typically our data elements will be n-tuples. These can be thought of as points in n-space or as n dimensional vectors. Each dimension can represent some characteristic of the data elements under consideration. Distance will typically be the Euclidean distance (or a weighted version of this distance) as described below. Thus data elements which most share these characteristics will belong to the same cluster.
We also describe a simpler, one-dimensional clustering algorithm called Jenks Natural Breaks.
- K-means clustering Basic Concepts
- Real Statistics Capabilities
- Choosing the initial clusters (k-means++ algorithm)
- Jenks Natural Breaks