Tuesday, February 17, 2015

Closed Itemsets

When we are looking for patterns in data sets, often the amount of patterns that can be found can be huge.  To demonstrate this I'll use a simple set of examples.  Lets say that you have one transaction with items A and B (I know this wouldn't give you much information, but it proves my point).  In this transaction there are 3 patterns/combinations: A, B and AB.  Now let's say that the transaction has A, B, and C in it.  This transaction allows for 7 patterns: A, B, C, AB, AC, BC, and ABC.  If it had A, B, C, D, then there would be 15 patterns: A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, and ABCD. You can see that the number of patterns that can be created grows MUCH faster than the number of items in the transaction. This growth pattern is based on the mathematical concept of combinations (wikipedia).

Also, any data set that anybody would actually care about would have more than one transaction, probably thousands, if not millions or billions.  This explosion of complexity that comes from all of the possible patterns that can be found in data is the motivation for ways of compressing the possible patterns/combinations.  That is what closed itemsets are all about.


Now I'm going to have to get a little technical here, but I'm going to translate for you too so don't worry.  The definition of a closed pattern goes like this, "A pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y  X, with the same support as X" (J. Han, Pattern Discovery in Data Mining, Coursera Lecture Material) 

Let's break this definition down.  "itemset" is just a group of items.  It could be one item (A), 2 items (AB), 3-items (ABC)... k-items. People sometimes refer to itemset size by telling you what k equals.  k is just a variable that represents the size/complexity of the pattern you're looking at. By, "frequent", the definition is saying that this itemset, X,  you're looking at meets the minimum support requirements, so it might be interesting.  The whole bit in the definition about "super-pattern Y  X" Basically means that all of X is contained in Y.  In Venn diagram world, if Y is a super-pattern of X then it looks like this

supper-pattern Venn diagram

In our example above, an example of this "super-pattern" notion would be like saying that ABC is a super-pattern of AB.  The notation Y  X is a fancy mathematical way of saying the same thing. The way I think about a closed pattern is that it's the biggest, most complex, pattern I can find at various levels of support.  It isn't exactly like that, but it will become clearer with an example.

Let's suppose we have a simple set of transactions like this
Simple transaction table

If you are looking for patterns that have support greater than or equal to 1, there would be a lot of combinations possible for just 2 transactions.  Based on the work we already did above, there are 15 combinations for the first transaction.  Transaction 20 also has the 15 combinations, but there is some overlap in possible combinations in transaction 10 and 20.  To visualize this, lets list them out in a table.

Combination Table

 If you look closely at the combination table above, you can find some repeats if you compare transaction 10 and 20.  To make this easier to see, I've rearranged the entries a little bit here.

Expanded Combination Table

The green highlighted combinations are found in both transactions.  Based on this, there is support of 1 for every combination in the expanded combination table, except for C, D, and CD which have support of 2.  This is where closed itemsets come in handy.  If you remember the Apriori Principle, if a more complex combination meets the minimum support requirements, then the subset, or simpler patterns, will also meet those requirements.  In this example, if I make the statement that I know that CD has support of 2, then I automatically know that C, and D also have support of at least 2.  In this case C and D have a support of exactly 2 so CD is one of my closed patterns in this data.  Let's revisit the definition of a closed pattern to make sure that's true.  There are 2 requirement for a pattern to be a closed pattern
  1. It's frequent; yep CD meets that criteria
  2. No super-pattern with the same support as CD;  There are bigger patterns in the data, but none of them have support of 2 so we're good here too
What's so awesome about this is that I can say I have closed pattern CD with support of 2, and I automatically know that C and D also have support of 2 because of the apriori principle.  We can use this one close pattern to compress three patterns without losing any information!

Now let's look for some other closed patterns in the data, so you can get the hang of it.  The way I explained closed data sets before was that they are normally the most complicated patterns that are frequent.  In our table above we've got ABCD and CDEF to work with that seem to fit that criteria.  So which one is a closed pattern? Both actually.  Let's look at the 2 rules for ABCD

  1. It's frequent; in our example we only need frequency of 1, so we're good here
  2. No super-pattern with the same support as ABCD;  CDEF has the same support as ABCD, but CDEF is not a supper pattern of ABCD.  
The same logic is applied to CDEF.  Going back to Venn diagram world the whole set of closed patterns in this data looks like this, where S is the support for each closed pattern.
Closed pattern Venn diagram

If you use the Apriori principle, and these 3 closed patterns, you can reconstruct the entire data table we started with without losing any information.  Pretty awesome right?


There is another way to compress pattern data called max-patterns.  They are almost exactly the same as closed patterns, except that we don't care about whether other patterns have the same support.  A max-pattern still has to satisfy the minimum support threshold thought.  The easiest way to explain this is by using the example above for closed patterns.  In that example we have closed patterns ABCD, CD, and CDEF. CD is a distinct closed pattern because it has a support of 2 and the others only have support of 1.  This is despite that fact that it is actually a subset, or part of, ABCD and CDEF.  If you're looking for max-patterns you don't care about this distinction so you only get ABCD and CDEF as max-patterns; that's because CD is part of the larger patterns ABCD and/or CDEF.  A max-pattern will give you more pattern compression, but you can see that even in this very simple example, you're losing some of the information that was originally contained in the raw data because now we don't know how much support CD, C, or D had.