Book summary: Mining of Massive Datasets — chapter 7: Clustering
As a Data Miner/Data Scientist I prefer to look at code and practical problems rather than theory. But I feel it is time to review theoretical knowledge to strengthen my base knowledge. I want to share with you too. So we all benefit from it.
In this series I will walk you through one of very famous books about Data Mining: Mining of Massive Datasets by Standford University.
I will not review all chapters but few chaps which are more useful and digestable in my opinion. Also I will implement some functions/algorithm in this book as well (but no promise).
Hope you find this journey useful and excited as me!
The goal in this chapter is to offer methods for discovering clusters in data. We are particularly interested in situations where the data is very large, and/or where the space either is high-dimensional, or the space is not Euclidean at all. We shall therefore discuss several algorithms that assume the data does not fit in main memory.
Clustering: Clusters are often a useful summary of data that is in the form of points in some space. To cluster points, we need a distance measure on that space. Ideally, points in the same cluster have small distances between them, while points in different clusters have large distance between them.
Clustering Algorithms: Clustering algorithms generally have one of two forms depending on strategy: Hierarchical/Agglomerative and point-assignment clustering.
Hierarchical/Agglomerative clustering algorithms begin with all points in a cluster of their own, and nearby clusters are merged iteratively.
Point-assignment clustering algorithms consider points in turn and assign them to the cluster in which they best fit.
The Curse of Dimensionality: Points in high-dimensional Euclidean spaces often behave unintuitively. Two unexpected properties of these spaces are that random points are almost always at about the same distance, and random vectors are almost always orthogonal.
Centroids and Clustroids: In a Euclidean space, the members of a cluster can be averaged, and this average is called the centroid. In non-Euclidean space, there is no guarantee that points have an “average”, so we are forced to use one of the members of the cluster as a representative or typical element of the cluster. That representative is called the clustroid.
Radius and Diameter: Whether or not the space is Euclidean, we can define the radius of a cluster to be the maximum distance from the centroid or clustroid to any point in that cluster. Alternative definitions, especially of the radius, are also known, for example, average distance from the centroid to the other points.
2 category of clustering
- Hierarchical Clustering: Family of algorithms has many variations, which differ primarily, which two clusters to merge next. We may decide when to stop the merge process in various ways. How it works:
Step 1: Make each data point a single-point cluster → forms N clusters
Step 2: Take the two closest data points and make them one cluster → forms N-1 clusters
Step 3: Take the two closest clusters and make them one cluster → Forms N-2 clusters.
Step 4: Repeat step-3 until you are left with only one cluster.
- Picking Clusters to Merge: One strategy for deciding on the best pair of clusters to merge in a hierarchical clustering is to pick the clusters with the closest centroids or clustroids. Another approach is to pick the pair of clusters with the closet points, one from each cluster. A third approach is to use the average distance between points from the two clusters.
- Stopping the Merger Process: A hierarchical clustering can proceed until there are a fixed number of clusters left. Alternatively, we could merge until it is impossible to find a pair of clusters whose merger is sufficiently compact, e.g., the merged cluster has a radius or diameter below some threshold. Another approach involves merging as long as the resulting cluster has a sufficiently high “density”, which can be defined in various ways, but is the number of points divided by some measure of the size of the cluster, e.g., the radius.
The main output of Hierarchical Clustering is a dendogram, which shows the hierarchical relationship between the clusters.
Dendogram- is a tree-like diagram that records the sequences of mergers or splits
When the space is non-Euclidean, we need to use some distance measure that is computed from points, such as Jaccard, cosine, or edit distance. It is due to the curse of dimensionality.
2. Point-assignment clustering: Maintain a set of clusters then assigns points to each cluster. There are several types of this clustering family as well as their variants.
2.1. K-means Algorithms:
- Assumes a Euclidean space.
Step 1: Initially choose K points that are likely to be in different clusters;
Step 2: Make this point to be centroids of their clusters;
Step 3: FOR each remaning point p DO:
find the main centroid where p is closet;
Add p to the clusters of that centroids;
Adjust the centroid of that cluster to account for p;
Initializing K-Means Algorithms: One way to find k initial centroids is to pick a random point, and then choose k-1 additional points, each as far away as possible from the previously chosen points. An alternative is to start with a small sample of points and use a hierarchical clustering to merge them into k clusters.
Picking K in a K-Means Algorithm: If the number of clusters is unknown, we can use a binary-search technique, trying a k-means clustering with different values of k. We search for the largest value of k for which a decrease below k clusters results in a radically higher average diameter of the clusters. This search can be carried out in a number of clustering operations that is logarithmic in the true value of k.
2.2. The BFR Algorithm: This algorithm is a version of k-means designed to handle data that is too large to fit in main memory. Strong assumption:
- Clusters are normally distributed in each dimensions.
- Axes are fixed — ellipse at an angle are not OK
Representing Clusters in BFR: Points are read from disk one chunk at a time.
Clusters are represented in main memory by the count of the number of points, the vector sum of all the points, and the vector formed by summing the squares of the components of the points in each dimension. Other collection of points, too far from a cluster centroid to be included in a cluster, are represented as “mini-clusters” in the same way as the k clusters, while still other points, which are not near any other point will be represented as themselves and called “retained” points.
BFR overview: Data is read in chunk each time.
The BFR model contains three sets:
- The discard set: contains the main clusters
- The compress set: contains points that are far from clusters in discard but close to each other. These get summarized as clusters.
- The retain set: contains points that are far from clusters in both discard and compress and also far from other points in retain.
Step 1: The initial step of the algorithm is to pick k points and let these be the main clusters in discard.
Step 2: After picking the initial point it is time to go throught the rest of the points
Step 3: For each point added to a cluster, the sum and sum of squares of the cluster are updated. The point is then discarded.
Step 4: For each point added to a cluster, the sum and sum of squares of the cluster are updated.
2.3. The CURE Algorithm -Clustering Using REpresentative
- Euclidean space, but clusters can have any shape.
- It handles data that is too large to fit in main memory.
- Use collection representative points to represent clusters.
4 remote points are moved forward to the centre of clusters. They are representative points for the clusters.
Step 1.1: Pick random sample points fit into main memory.
Step 1.2: Cluster sample points hierarchically to create the initial cluster
Step 1.3: Pick representative points:
For each cluster, pick k representative points (e.g. k=4) as dispersed as possible
Move each representative point a fixed fraction (eg. 20%) toward the centroid of the cluster
Step 2.1: Rescan the whole dataset and visit each point p in dataset
Step 2.2: Place it in the “closest cluster”. Closest in the sense that the minimum distance between point p to all representative points.