Given a data set *S*, there are many situations where we would like to partition the data set into subsets (called** clusters**) where the data elements in each cluster are more similar to other data elements in that cluster and less similar to data elements in other clusters. Here “similar” can mean many things. In biology it might mean that the organisms are genetically similar. In marketing it might mean that customers are in the same market segment.

Clustering can also be hierarchical, where clustering is done at multiple levels. Here the data set is divided into clusters and these clusters are in turn further divided into more finely granular clusters. The biological classification system (kingdoms, phylum, class, order, family, group, genus, species) is an example of **hierarchical clustering**.

In this chapter we will describe a form of **prototype clustering**, called **k-means clustering**, where a prototype member of each cluster is identified (called a **centroid**) which somehow represents that cluster. The approach we take is that each data element belongs to the cluster whose centroid is nearest to it; i.e. which minimizes the distance between that data element and that cluster’s centroid.

Typically our data elements will be *n*-tuples. These can be thought of as points in n-space or as *n* dimensional vectors. Each dimension can represent some characteristic of the data elements under consideration. Distance will typically be the Euclidean distance (or a weighted version of this distance) as described below. Thus data elements which most share these characteristics will belong to the same cluster.

We also describe a simpler, one-dimensional clustering algorithm called Jenks Natural Breaks.

**Topics**:

- K-means clustering Basic Concepts
- Real Statistics Capabilities
- Choosing the initial clusters (k-means++ algorithm)
- Jenks Natural Breaks

Hi Charles

Thanks so much for these resources!

I am using the Excel for Mac 2011 plugin and I see that the Cluster Analysis option doesn’t appear on the dialogue box – the only one that is missing 🙁

Llew,

Yes, Cluster Analysis is not yet in the latest Mac release of the Real Statistics software, although it is in the Windows releases of the software. It will be part of the next Mac release of the software.

Charles

I have Excel 2013 and I installed all versions of real statistics (2003, 2007, 2013). I am on windows and neither of these had the “cluster analysis” option.

First choose the Multivariate Analysis option. Cluster Analysis will appear on the dialog box that is displayed.

Charles

hey I have excel 2016 for mac where can I find the cluster analysis tool?

Sam,

Sorry, but the current Mac version of the software doesn-t include the cluster analysis tool. THis is currently only available on the Windows version.

Charles

Respected Sir,

Can I fit a double term exponential model in excel? I wish to have the equation

Y = a*exp(-bx)+c*exp(-dx).

If not could you please suggest me some free software to fit the model?

Avijit,

You can use Excel’s Solver to find the parameters a, b, c and d which create a fit for your data. The approach is illustrated on the following webpage

Exponential Regression using Solver

Charles

Hi,

How can I find the main variable that that make the difference between the clusters.

i.e., if I’ll ask someone only these question I’ll know to associate him to cluster .

Thank you

Sarah,

Sorry, but I don’t understand your question. Are you asking “once I have built the model, “how can I assign the appropriate cluster to a new value?”

Charles

Hi,

Yes, after I have the clusters based on 25 questions I want to assign new questionnaires to the clusters by using less questions – only the questions that make the main difference between the clusters.

Thanx

Hi Charles,

Is there a problem with my question or with my clarification?

Thank you for your great tool and your help.

Sarah

Sarah,

Cluster analysis can be used to reduce the number of variables, not necessarily by the number of questions. I guess you can use cluster analysis to determine groupings of questions. You can then try to use this information to reduce the number of questions.

Charles

Do you support the Ward clustering method?

Ron,

I believe that Ward clustering is a type of hierarchical clustering. I don’t support this as of yet. I plan to eventually add this type of support, but have no specific date.

Charles