Tuesday, February 24, 2015

Core Patterns of Colossal Patterns


Professor Jiawei Han at the University of Illinois and some of his colleagues created a method of data mining to find what are called colossal patterns.  "Colossal" in this context basically means really big patterns with lots of different items included.  As an example of this, think about people working in biostatistics looking for large sequences of DNA that are interesting.  The "Pattern-Fusion" method that they created has 2 parts to the algorithm.
  1. Use a regular pattern generation method (e.g. Apriori, FP Growth, etc.) to create the first couple levels of patterns that are supported.  A common level of complexity for this step to  stop is when the algorithm has 3-item patterns (e.g. ABC), but this can be selected by the user when this algorithm is run
  2. Take these large-ish patterns and combine them together to make bigger patterns faster than incrementally looking for patterns that only have 1 more item.
The reason why this method is so valuable can be seen if you look at how many possible combinations(patterns) are possible as the number of items increases.  The table below shows the number of combinations possible if you have 5, 10, or 15 items in each column, and pick 1-15 items in the rows.

If you're really only interested in finding the BIG patterns that the data supports, you don't really want to get bogged down in looking at 6,435 patterns that have 7 items in them (see the last column above).  Plus, this tabular example above is VERY simplified, how many sequences are in a strand of DNA?  Oh yeah! Old algorithms might get bogged down forever in that mess in the middle for that type of problem. 


Core Pattern and Pattern Robustness
Now that I've summarized the conceptual process, I'm going to explain core patterns and pattern robustness.  Core patterns are the ones that the algorithm "likes" to use in order to create the BIG patterns it's really looking for.  I got my butt kicked by these 2 concepts during my quiz for professor Han's Coursera course, so I figured I should probably take the time to really understand them and explain them simply for everybody else.


In Han's textbook, Data mining : concepts and techniques, he says that...
"for a pattern α, an itemset βα is said to be a t-core pattern of α if  where  is the number of patterns containing α in database D." emphasis added.

Say what?!? For those of you out there who are not fluent in mathspeak, I will translate.  For a pattern named Beta to be a core pattern of pattern Alpha it needs to meet a couple requirements.

Criteria 1
Beta needs to be a subset of Alpha (βα).  In Venn diagram world this looks like the diagrams below. The line under that 'c' looking symbol means that β can actually be as big as α (kind of like a less-than-or-equal-to symbol), and therefore they can be the same set of items and still satisfy this criteria.


Criteria 2

That complicated equation needs to have a value greater than τ.  What is tau, other than a greek letter?  It' just a value between 0 and 1 that the user of the algorithm picks so that there is a criteria to rate patterns against.  If the calculated value in the equation is really high, we're more certain the pattern being considered is a core pattern, if it's really low...not so much.  That explains tau, but now we need to understand the equation.

|Dα| is defined as "the number of patterns containing α in database D".  Although the symbols might be confusing, this is actually relatively easy to calculate.  Just count up all of the records, transactions, etc. in your database that have the pattern α.  You need to make sure that your looking at sub-patterns too.  Say one transaction has ABDFE, and you're looking for ABF (this is α in this example), this transaction counts for this calculation.  The same process is followed for β but based on the first criteria, β is going to be a smaller subset pattern of α or the same pattern as α.  So in the quick example above, β might be A, AB, AF, BF, or even ABF.  The whole point of this ratio is to give a measure for how important β is in the make-up of the bigger pattern α.  If there are 100 records that contain α, and 250 records that contain β, then this core pattern "strength", as I'll call it, would be 0.4 (P.S. don't get confused by the Venn diagrams above here.  Those diagrams are for the items in the patterns themselves, NOT for the frequency of the patterns in the database you're mining).  If tau were set by the user as 0.25, then β would be considered a .25-core pattern of α.

Example
Now let's use an example to clarify this and show how to calculate core pattern robustness.  Suppose we have a database with some patterns that have the counts as shown in the table below
Let's assume that we set tau to be equal to 0.5.  So we'll need to do the core pattern calculation explained above for all of the subsets of each item in this list to see if they are a core pattern.  For the first pattern, ACE, there are 10 instances in the database (this one is simple, but others are a little tricky).  Now we figure out the counts for A, C, E, AC, AE, and CE in the database.  A=20, C=40, E=10, AC=20, AE=10, CE=10, and we already know ACE=10.  So, for each of these patterns we can calculate the ratio ACE/A=0.5; ACE/C=0.25; ACE/E=1; ACE/AC=0.5; ACE/AE=1; ACE/CE=1; and ACE/ACE=1.  So based on our criteria, tau = 0.5, we have core patterns of A, E, AC, AE, CE, and ACE; C got cut out because it only had 0.25.

BCD on the next line is similar, but for BCD the count is 20 because BCD has 10 transactions in the 2nd row and 10 transactions in ABCDF in the 4th row; told you there were trickier ones.  Follow the same process as the above paragraph for the rest of the calculations. If you do this, you should come up with a table that looks something like this.  

As an exercise, try to figure out which patterns didn't make the cut.  There are three of them, and I already showed you one of them already.  Obviously if the value of tau were set lower, we would have cut out a lot more patterns and the table would have a lot fewer entries in the last column

Core Pattern Robustness
Now, we need to explain pattern robustness.  if you have the information above already, pattern robustness is pretty easy to figure out.  In mathspeak, and example of how robustness might be stated goes something like this "A pattern α is (d, τ) robust if d is the maximum number of items that can be removed from α for the resulting pattern to remain a τ-core pattern." (Han's textbook again).  All that really means is that you're trying to answer the question, "how many items can I remove from the pattern I'm looking at and still have what's leftover qualify as a core pattern?".  For the first pattern in the table above, ACE, we have 3 items A, C, and E.  If we take away 1 item, say A, we can see that CE is still a core pattern in the table.  if we take away a 2nd item C, E is still a core pattern.  If we took away E, we'd have nothing so that's no good.  By definition the null set, or nothing, can't be a core pattern.  So for ACE we were able to take away 2 items and still have a core pattern.  So, if we were to translate that example to mathspeak we would be able to say that pattern ACE is (2, 0.5)-robust.  Using similar logic we could add another column to our table stating robustness of each of the rows.  Please double check me...at the time of this writing it is getting late and my mental powers are waning. :)

I hope this helps people out there understand something that took me quite a while to understand from just a couple of sentences in a textbook.