Cluster Analysis

Given a data set S, there are many situations where we would like to partition the data set into subsets (called clusters) where the data elements in each cluster are more similar to other data elements in that cluster and less similar to data elements in other clusters. Here “similar” can mean many things. In biology it might mean that the organisms are genetically similar. In marketing it might mean that customers are in the same market segment.

Clustering can also be hierarchical, where clustering is done at multiple levels. Here the data set is divided into clusters and these clusters are in turn further divided into more finely granular clusters. The biological classification system (kingdoms, phylum, class, order, family, group, genus, species) is an example of hierarchical clustering.

In this chapter we will describe a form of prototype clustering, called k-means clustering, where a prototype member of each cluster is identified (called a centroid) which somehow represents that cluster. The approach we take is that each data element belongs to the cluster whose centroid is nearest to it; i.e. which minimizes the distance between that data element and that cluster’s centroid.

Typically our data elements will be n-tuples. These can be thought of as points in n-space or as n dimensional vectors. Each dimension can represent some characteristic of the data elements under consideration. Distance will typically be the Euclidean distance (or a weighted version of this distance) as described below. Thus data elements which most share these characteristics will belong to the same cluster.

We also describe a simpler, one-dimensional clustering algorithm called Jenks Natural Breaks.


15 Responses to Cluster Analysis

  1. Ron Shani says:

    Do you support the Ward clustering method?

    • Charles says:

      I believe that Ward clustering is a type of hierarchical clustering. I don’t support this as of yet. I plan to eventually add this type of support, but have no specific date.

  2. Sarah says:

    How can I find the main variable that that make the difference between the clusters.
    i.e., if I’ll ask someone only these question I’ll know to associate him to cluster .
    Thank you

    • Charles says:

      Sorry, but I don’t understand your question. Are you asking “once I have built the model, “how can I assign the appropriate cluster to a new value?”

      • Sarah says:

        Yes, after I have the clusters based on 25 questions I want to assign new questionnaires to the clusters by using less questions – only the questions that make the main difference between the clusters.

      • Sarah says:

        Hi Charles,
        Is there a problem with my question or with my clarification?
        Thank you for your great tool and your help.

        • Charles says:

          Cluster analysis can be used to reduce the number of variables, not necessarily by the number of questions. I guess you can use cluster analysis to determine groupings of questions. You can then try to use this information to reduce the number of questions.

  3. Avijit Ghosh says:

    Respected Sir,
    Can I fit a double term exponential model in excel? I wish to have the equation
    Y = a*exp(-bx)+c*exp(-dx).
    If not could you please suggest me some free software to fit the model?

  4. Sam says:

    hey I have excel 2016 for mac where can I find the cluster analysis tool?

    • Charles says:

      Sorry, but the current Mac version of the software doesn-t include the cluster analysis tool. THis is currently only available on the Windows version.

  5. Llew Claasen says:

    Hi Charles

    Thanks so much for these resources!

    I am using the Excel for Mac 2011 plugin and I see that the Cluster Analysis option doesn’t appear on the dialogue box – the only one that is missing 🙁

    • Charles says:

      Yes, Cluster Analysis is not yet in the latest Mac release of the Real Statistics software, although it is in the Windows releases of the software. It will be part of the next Mac release of the software.

      • Hector says:

        I have Excel 2013 and I installed all versions of real statistics (2003, 2007, 2013). I am on windows and neither of these had the “cluster analysis” option.

        • Charles says:

          First choose the Multivariate Analysis option. Cluster Analysis will appear on the dialog box that is displayed.

Leave a Reply

Your email address will not be published. Required fields are marked *