Article 2022-09-18 20:30:03
Ricard Santiago avatar Ricard Santiago

Hierarchical cluster analysis


Cluster analysis allows you to segment the data in a summary way. For this reason, it is a widely used technique in marketing to segment customers, in psychology and psychiatry to group in paranoia or schizophrenia, etc.


Cluster analysis belongs to the structural group. In the same way that when we saw factor analysis, the objective of this technique is to summarize the information present in the data set. We work on nominal, interval and ratio scales that are converted to dummy variables. In other words, presence or absence is registered, normally represented in binary form with 0 and 1. In any case, this technique requires that all the variables be of the same type and cannot be mixed.

This technique is widely used in the marketing area since it allows creating very homogeneous groups internally (segment customers), but very heterogeneous with respect to the rest of the groups.

I will now show a brief example of the grouping of highway sections according to their characteristics by the hierarchical grouping method.

I have eliminated the columns that are not necessary, and I have scaled the data so that all the variables have the same prominence and there is none that prevails or predominates over another. I have calculated the distance matrix from the Euclidean distance, this is based on the calculation of the hypotenuse. Additionally, I have generated two graphs one is a dendrogram that allows you to see the records hierarchically and grouped by cluster and, the other, is the same, but through a PCA projection to see more clearly the different clusters. Furthermore, the latter is made from the grouping of furthest neighbors.

As can be seen in the first dendrogram, I have opted for four clusters. Leaving the horizontal line at 6. It seems to me that this is the clearest division. Once I graph in PCA format, I realize that the best number of clusters is 4. As you can easily see, the betweens sure comes out quite high since there is a good separation between the clusters and this ensures that there is intra-cluster heterogeny.

On the other hand, if I recalculate the dendrogram by agnes in order to see the agglomerative coefficient, the amount of grouping structure, and thus, see how strong each cluster is. As you can see in the image the result is 0.77 which is quite high. The idyllic is 1, but it's not going to happen.

I will also calculate which method gives me the best result to represent agnes. The methods I have chosen are average, single, complete, Ward. The scores have been:

Therefore, I choose the Ward method because it offers me the greatest structure. I create the dendrogram that although the clusters are organized differently, they interpret the same data for each of them.

Group 1 could be considered as the ideal cluster in relation to the other four since this one has the lowest accident rate 2.87. I will define what the highways of this group are like. They have a length of 12.3 miles, with a lot of traffic 56,000, 10% truck traffic relative to total traffic, 0.1 traffic lights per mile with an average limit of 62 miles per hour, with the largest hard shoulder width (10 feet) of the four clusters, the largest number of lanes with 5.2 lanes on average, fewer accesses per mile with 3.94 accesses, the highest number of highway-type interchanges per mile 1,152 and with the lane width 12 feet similar or standard to the rest of the clusters. It can be said, therefore, that it is a wide road with good shoulders, with traffic, but many lanes and with few additions.

Group 2 is the one where the most accidents occur, with a rate of 8.07 accidents per million vehicle miles. The shortest length, half of group 1, 6.41. It has 3.2 times less traffic than cluster 1 and 4.25 times more than cluster 4. 6.6% of trucks in total traffic and a speed limit of 48 miles and 1.37 traffic lights per mile. It has a hard shoulder width of 5, half that of cluster 1 and a total of 2.8 lanes on average. It has 30.52 access points per mile and 0.28 highway-type interchanges. Finally, a lane width of 12 feet. In general terms, cluster 2 is summarized in the highest accident rate, with greater stops due to the number of traffic lights, low speed and little travel, few lanes and many additions.

Group 3 has an accident rate of 3.57 and a length of 9.62 miles, with 20,000 daily traffic, 8.8% truck traffic relative to total traffic, 0.66 traffic lights per mile with an average limit of 55.58 miles per hour, with a hard shoulder width of 7.47 feet, 3.29 lanes, 9.44 accesses per mile, 0.24 highway-type interchanges per mile, and 12.11 feet of lane width. It is summarized in a low average of accidents, quite short road, some traffic lights, second-fastest road (intermediate speed), few additions and wide infrastructure.

Group 4 has an accident rate of 3.16 and a length of 20.43 miles, with 4,833 daily traffic, 10.83% truck traffic relative to total traffic, 0.10 traffic lights per mile with an average limit of 54.16 miles per hour, with a hard shoulder width of 5.5 feet, 2.16 lanes, 11.78 accesses per mile, 0.013 highway-type interchanges per mile and 11.67 feet of lane width. It is summarized in the lowest accident rate, long highway, almost no stops due to the low number of traffic lights, intermediate speed, two lanes and eleven accesses and narrow lane.

This type of analysis is very useful to try to find a structure in the data provided for each road to find out what works to reduce the number of accidents. From the data, it seems that the safest thing is a medium, long-distance road, with few stops and additions, and a medium-high speed. With a subsequent analysis and a greater number of observations, it could be described what the roads should look like to be safe. Very useful for those who intend to build or reform a road.

More related to Hierarchical cluster analysis