Book summary: Mining of Massive Datasets — chapter 6: Frequent Itemsets
As a Data Miner/Data Scientist I prefer to look at code and practical problems rather than theory. But I feel it is time to review theoretical knowledge and want to share with you too. So we all benefit from it.
In this series I will walk you through one of very famous books about Data Mining: Mining of Massive Datasets by Standford University.
I will not review all chapters but few chaps which are more useful and digestable in my opinion. Also I will implement some functions/algorithm in this book as well (but no promise).
Hope you find this journey useful and excited as me!
The chapter is about major families of techniques for characterizing data: the discovery of frequent item-sets. This problem is often viewed as the discovery of "association rules", although the latter is a more complex characterization of data, whose discovery depends fundamentally on the discovery of frequent item-sets.
- Market-Basket Data: This model of data assumes there are two kinds of entities: items and baskets. There is a many-many relationship between items and baskets. Typically, baskets are related to small sets of items, while items may be related to many baskets
- Frequent Itemsets: The support for a set of items is the number of baskets containing all those items. Item-sets with support that is at least some threshold are called frequent item-sets.
support(item-sets) = No of baskets
frequent item-sets = support(item-sets) >threshold
Application: Analysis of true market baskets, related concepts (items ~word & basket ~document), plagiarism (item ~document, basket~sentence), biomarkers (item~biomarker, basket~patient)
2. Association Rules: These are implications that if a basket contains a certain set of items I, then it is likely to contain another particular item j as well. The probability that j is also in a basket containing I is called the confidence of the rule. The interest of the rule is the amount by which the confidence deviates from the fraction of all baskets that contain j.
- The Pair-Counting Bottleneck: To find frequent item-sets, we need to examine all baskets and count the number of occurrences of sets of a certain size. For typical data, with a goal of producing a small number of item-sets that are the most frequent of all, the part that often takes the most main memory is the counting of pairs of items. Thus, methods for finding frequent item-sets typically concentrate on how to minimize the main memory needed to count pairs.
3. The A-Priori Algorithm for Pairs: We can find all frequent pairs by making two passes over the baskets. ON the first pass, we count the items themselves, and then determine which items are frequent. On the second pass, we count only the pairs of items both of which are found frequent on the first pass. Monotonicity justifies our ignoring other pairs.
Drawback: Expensive and space and time.
Pass 1: Item counts
For finding frequent k-tuple: Scan entire data k times.
Pass 2: Return frequent items.
Needs room in main memory to count each candidate k-tuple. Typical k=2 requires the most memory
- Finding Larger Frequent Item-sets: A-Priori and many other algorithms allow us to find frequent item-sets larger than pair, if we make one pas cover the baskets for each size item-set, up to some limit. To find the frequent item-sets of size k, monotonicity lets us restrict our attention to only those item-sets such that all their subsets of size k-1 have already been found frequent.
- The PCY Algorithm: This algorithm improves on A-Priori by creating a hash table on the first pass, using all main-memory space that is not needed to count the items. Pairs of items are hashed, and the hash-table buckets are used as integer counts of the number of times a pair has hashed to that bucket. Then, on the second pass, we only have to count pairs of frequent items that hashed to a frequent bucket (one whose count is at least the support threshold) on the first pass.
During pass 1, maintain a hash table
Keep a count for each bucket into which pairs of items are hashe
Pass 2: Only count pairs that hash to frequent buckets
- The Multistage Algorithm: We can insert additional passes between the first and second pass of the PCY Algorithm to hash pairs to other, independent hash tables. At each intermediate pass, we only have to hash pairs of frequent items that have hashed to frequent buckets on all previous passes.
- The Multihash Algorithm: We can modify the first pass of the PCY Algorithm to divide available main memory into several hash tables. On the second available main memory into several hash tables. On the second pass, we only have to count a pair of frequent items if they hashed to frequent buckets in all hash tables.
- Randomized Algorithms: Instead of making passes through all the data, we may choose a random sample of the baskets, small enough that it is possible to store both the sample and the needed counts of itemsets in main memory. The support threshold must be scaled down in proportion. We can then find the frequent item-sets for the sample, and hop that it is a good representation of the data as a whole. While this method uses at most one pass through the whole dataset, it is subject to false positives (item-sets that are frequent in the sample but not the whole) and false negatives (item-sets that are frequent in the whole but not the sample).
Take a random sample of the market baskets
Run A-priori in main memory
Verify the candidate pairs by a second pass
- The SON Algorithm: An improvement on the simple randomized algorithm is to divide the entire file of baskets into segments small enough that all frequent item-sets for the segment can be found in main memory. Candidate item-sets are those found frequent for at least one segment. A second pass allows us to count all the candidates and find the exact collection of frequent item-sets. This algorithm is especially appropriate in a MapReduce setting
- Toivonen's Algorithm: This algorithm starts by finding frequent item-sets in a sample, but with the threshold lowered so there is little chance of missing an item-set that is frequent in the whole. Next, we examine the entire file of baskets, counting not only the item-sets that are frequent in the sample, but also, the negative border — item-sets that have not been found frequent, but all their immediate subsets are. If no member of the negative border is found frequent in the whole then the answer is exact. But if a member of the negative border is found frequent, then the whole process has to repeat with another sample.
- Frequent Item-sets in Streams: If we use a decaying window with constant c, then we can start counting an item whenever we see it in a basket. We start counting an item-set if we see it contained within the current basket, and all its immediate proper subsets already are being counted. As the window is decaying, we multiply all counts by 1-c and eliminate those that are less than 1/2.