Christian Zuniga, PhD
Friday October 9, 2020
Clustering is the task of grouping similar items together with the objective of understanding the relations among the items. For example, a company may discover its customers may be placed into groups based on common characteristics that go beyond the obvious ones such as age. These could then be offered more useful services. Clustering is a part of the knowledge discovery process that may reveal hidden patterns and new knowledge¹
If there were two characteristics or features, the items could be visualized as shown in Figure 1. As the figure shows, the items in this toy example are not uniformly distributed in the feature space and tend to gather around groups (2 or 4 from visual inspection).
K-means clustering is among the simplest methods and is usually worth trying first. K corresponds to the number of groups which is entered beforehand. Each group is represented by a cluster center and there are K centers. The algorithm proceeds iteratively in two steps.
For N iterations:
1. For each item in the data set, find the closest cluster center
2. Update the cluster centers.
Figure 2 shows the resulting clusters with K=2 and N=10 iterations. The initial cluster centers are chosen at random in the space. The final clusters depend on the initial cluster center location.
Euclidian distance is frequently used to find the closest centers but other options are available. If an item and a cluster center are given by d-dimensional vectors. x and μ respectively, then the distance is:
Distance quantifies the notion of similarity among items. Similar items have smaller distance. Intuitively, items in the same cluster should have smaller distance since they are most similar. The overall objective of K means clustering is to minimize the distances among its members. Using an indicator variable r allows the overall objective to be written as shown below³.
In the first step of K-means, the cluster centers are fixed and the cluster indicator variables are found. The distance of each item to each cluster center is calculated and the item is assigned to the cluster to which it is closest as shown in Figure 3.
In the second step the assignment variables are fixed and the cluster centers are recalculated. Taking the derivative of the objective J with respect to the cluster centers and setting it equal to zero gives the new centers. A new center is the average of the features of its members. A simple Python implementation is given in Appendix A.
Lets try the K-means algorithm on a more interesting example. Lets cluster the companies in the NASDAQ 100⁴. The closing price data can be obtained using Yahoo Finance through the yfinance package⁵. Lets use 3 months of prices and use the log of the returns. The resulting data matrix is 64x101 (there are actually 101 ticker symbols). Appendix B shows one way to download the data.
This price data can be thought of having 64 dimensions. It is usually better to reduce the dimensionality of the data using PCA or other means before clustering. Distances in higher dimensions loose discriminative ability and take longer to calculate. It is also good practice to scale the data. Sci-kit learn is used in this example⁶. Using the first two principal components allows the companies to be described by 2 features instead of 64. In this case 2 features only account for ~30% of the variance but may still be useful for exploratory purposes. Figure 4 shows a plot of these 2 features for all the companies.
From popular news it is expected that companies can be clustered into sectors such as technologies, financials, industrials, and others. This guides the choice for K, the number of clusters. For example Figure 5 shows the result with K=4 clusters. Table 1 shows the company ticker symbol belong to each cluster. The cluster in yellow seems reasonable and corresponds to known technology companies such as Apple, Google, Amazon, and Facebook. The stock price of these companies has grown in recent months so the analysis appears consistent. Companies in the cluster in blue such as GE appear to be those whose stock price has not changed much.
Many other companies may be unfamiliar and will need to be researched further. K means clustering is providing an initial way of organizing the companies, or data in general. K-means could be improved with additional criteria in initializing the clusters and choosing K⁷. However it does have limitations such as assuming the clusters are circular and a hard clustering approach. Other clustering methods may be better depending on the application.
- Zaki and Meira “Data Mining and Analysis: Fundamental Concepts and Algorithms”, Cambridge University Press 2014
- Harrington “Machine Learning in Action” Manning Publications Co. 2012
- Bishop “Pattern Recognition and Machine Learning” Springer 2006