Monday, January 23, 2017

Rare Pattern Mining


Association rule mining is predominately focuses on finding frequent co-occurring associations among a collection of items.  Sometimes it is referred to as “Market Basket Analysis”, as one of the original motivation of association rule mining was to mine supermarket transactions. The major aim here is to find associations of items that occur together more often than you would expect from a random probability. A  classic example of this application is the famous Beer and Diapers association. :)

Over the years more often people are using association rule mining to find rare patterns. Detecting rare patterns in data is a vital task, with numerous high-impact applications including medical, finance, and security. The main difficulty in finding rare patterns or rare association rule is the mining process can be likened to finding a needle in a haystack [1]. The difficultly of looking for a small needle is multiplied by the fact that the needle is hidden by a large amount of hay strands. 

Rare rules pose difficulties for data mining systems for a variety of reasons. The fundamental problem is the lack of data -- associated with rare cases. Rare rules tend to cover only a few examples or data points. The lack of data makes detecting rare cases even more difficult and, even when rare case are detected, it is hard for us to meaningfully generalize the results.

Overall there are two major categories of the types of rare patterns: basic and extended. In the basic category, we have association rule mining and rare patterns. In the extended category, there is a range of other patterns including subgraph, probabilistic, and sequence patterns. Based on pattern diversity, rare pattern mining can be classified using the following criteria basic and extended patterns.

Some rare patterns may involve sequences and structures. For example, by studying the order in which rare events occur we may be able to find sequential patterns, that is, rare subsequence’s in a sequence of ordered events. For example, when a diagnosing a patient’s condition, the context is reflected in the sequence of conditions that has appeared across a time period. By mining sequential patterns of symptoms we can capture a patient's condition. In this way, a patient's condition can be found at a more abstract level and it gets easier to track and detect changes in the patient's conditions.

In some datasets we need to consider the existential uncertainty of rare itemsets, indicating the probability that an itemset occurs in a transaction, makes traditional techniques inapplicable. For example a medical dataset may contain a table of patient records, each of which contains a set of symptoms and/or illnesses that a patient suffers from. Applying rare association analysis on such a dataset allows us to discover any potential correlations among the symptoms and illnesses for rare diseases. In these cases, symptoms would best be represented by probabilities based on historical data statistics that indicate their presence in the patients' records. These type of data is known as uncertain data.

Until now there are various techniques out there that focuses on finding these rare association rules. I cover a range of these techniques in my survey paper.

[1] Gary M. Weiss. 2004. Mining with rarity: a unifying framework. SIGKDD Exploration Newsletter 6, 1 (2004), 7 – 19.