Interpreting Cluster Analysis 
Interpreting Results from Cluster Analysis By James Kolsky June 1997 
Cluster analysis is a popular classification technique frequently used to analyze market research data which divides the data into groups. Data appears in rows, purchase intent scores for example, and columns, sales concepts for instance. Rows can then be clustered with respect to columns or columns with respect to rows. For example, clustering techniques can be used to identify demographic or psychographic characteristics of consumers with similar purchasing histories, or to isolate differences between groups of products. Market researchers can then study the individual clusters of consumers or products in more detail in order to maximize results from future marketing strategies. Cluster analysis is most often used in cases in which it is unknown, prior to the analysis, the number of groups in the data or which observations belong to which groups. Objects associated with a specific cluster should be quite similar and generally clusters should be distinct, i.e. not overlapping. Hierarchical methods, in which clusters are defined according to similarity or dissimilarity measures, remain the most popular method of analysis, and user friendly software makes the analysis easily accessible to a wide variety of researchers in a variety of fields. However, such software packages rarely provide a clear set of guidelines to indicate how such an analysis should be performed. Thus, it is left up to the researcher to determine the method of analysis that will provide the most valid conclusions. The validity of the conclusions drawn from cluster analysis techniques is sometimes questioned since very different clusters can be formed from the same data depending on how the analysis is performed. This apparent ambiguity results from the decisions made by researchers prior to the actual analysis. Careful attention to these preanalysis decisions will increase your ability to obtain meaningful results. The first of these issues is which variables should be included in the analysis. If important variables are ignored, future results will be suspect. Secondly, a distance measure must be determined. These distances will be used to form the actual clusters  observations with small distances will be grouped together. Euclidean distance is commonly used, and is the default in many software packages. This option, however, may be a poor choice if variables are measured on different scales. Variables on larger scales will have stronger effects on the clusters formed. In fact, in this situation the clusters may simply represent the variables which are measured on similar scales. Appropriate transformations of selected variables may be needed to eliminate this feature of the data. The cityblock distance is another type of distance measurement that can be chosen. It is considered to be more robust, or resistant, against the effect of outliers in the data. A final decision on the proper distance choice will depend, not only on the data, but the objectives of the analysis. For example, should clusters could reflect observations which differ greatly on only a single variable, or those that differ moderately on several variables. Lastly, the algorithm used to form the clusters must be defined. While there are literally hundreds of such algorithms, selection is effectively limited to the half dozen or so available through whatever software package is being used. Still, the choice among these half dozen algorithms is not always clear, and different choices can again lead to different cluster solutions. The single linkage method is the simplest method of joining clusters and the most commonly used. It is robust to small perturbations in the distances, but a small addition of data can significantly alter the clustering. The distances, as defined above, are computed between each object not in a cluster to the nearest cluster member. The new object is then assigned to the nearest cluster. Other methods commonly used are the complete linkage and the average linkage. Complete linkage distances are computed between nonclustered objects and the farthest cluster member; average linkage distances are computed between nonclustered objects and the mean of the clustered objects. If changes among the distances and algorithms used can result in changes to cluster membership, how are meaningful results to be obtained How much confidence can researchers place in future marketing strategies based on these clustering results? The first step to determining the validity of the clusters begins well before the analysis stage. Researchers should fully understand both the data and the research objectives. Exploratory data analysis is easily done in all software packages. Boxplots and histograms can be examined to detect outliers or differences in scale among the variables. Crosstabs and scatter plots can be examined to look for strong relationships between variables. Secondly, again prior to analysis, thought should be given to what the conclusions of the study are expected to be. It is rare that those involved with the project will not have a general idea of what the outcome might look like. Previous analysis results, or simply the intuition of a researcher who has been absorbed in similar data for several months, can be used to get a feeling for what the final outcome might look like. Lastly, it is important to keep in mind that different methods of cluster analysis work better with certain shaped clusters. For example, methods which work well when groupings within the data tend to be spherical, may perform poorly if groupings within the data tend to be elliptical. Thus, when performing the analysis, examine several types of distance measures and algorithms. Compare the resulting clusters. Are they showing similar pictures? Differences tend to occur when there are no natural clusters in the data. In the case of different outcomes, are some of the clusters contradicting what researchers expected to see? Do the described relationships make sense? Are they actionable? Answers to the preanalysis issues may indicate distances and algorithms which are more appropriate. Results from several of the distance or algorithm options may be noninformative. For instance, one method may result in a single cluster. In cases where several different conclusions are obtained and there is no reason to view any as preferable to the others, reexamine the study objectives. Choose the analysis that best answers the objectives. If multiple results answer the questions of interest, keep the solution as simple as possible. In cluster analysis, simplicity usually corresponds to the number of clusters in the solution. Above all, remember that cluster analysis is an exploratory tool and different algorithms may very well detect different patterns in the data, none of which may be "wrong"  simply a different method of "slicing" the data.
