Tuesday, March 3, 2015

Null-Invariant Measures of Interestingness

In 2 of my previous posts about measures of interestingness (Lift and Chi-Square) I mentioned I would be writing a post about null-invariant measures of interestingness.  I'm a man of my word so here you go.  The problem I presented in those posts can be summarized with these 2 tables.

If you click back to those previous posts (links above) you'll see that the lift for the first table was 0.714 and the lift for the 2nd table would be calculated as 0.929.  The chi-square value for the first value was 0.586 and the value for the 2nd is calculated as 0.022.  For both these measures we get different numbers when the size of the bottom right cell ("null" count) is changed.  The table below shows this information in a slightly different view.

Since we're pattern mining, and we're really only interested in the interaction between the two items (apples and oranges in this example), we'd love to have a way to measure interestingness that doesn't include that "null" count in the calculation at all.  Data scientists call this type of an interestingness measure "null-invariant".  Let's see what they've come up with to help us out here.

Based on a white paper entitled "Selecting the Right Interestingness Measure for Association Patterns" written by Pang-Ning Tan, Vipin Kumar and Jaideep Srivastava in 2002, there are 3 measures for interestingness that have this "null-invariant" property that we're looking for.  The three that they found are Confidence, Cosine, and Jaccard.  The Confidence measure actually has at least 3 variations to it now that I'm aware of so we'll cover 5 different null-invariant measures in total here.

If you've read my post about support and confidence, you may remember near the end that I mentioned that confidence gives different values based on the way it's calculated (e.g the confidence for Apples à Oranges may not be the same as the confidence for Oranges à Apples).  If that is the case, which one should you pick?  With so many potential patterns that can be found in data already, it is preferable for data scientists to find a way to combine these 2 options to get 1 answer instead of having to duplicate all of their results to show both sides of the same coin all the time.  To do this, they have taken the original confidence equation P(A|B), or P(B|A) depending on how you look at it, and created the following alternate forms

At this point in time it is my duty as your guide to understanding data mining in a simple way to calm you down if the stuff in the definition column gave you heart palpitations.  All three of these are actually VERY easy to understand if you already understand confidence.  I'm actually going to work from the bottom of this list up. Max Confidence basically says, I'm just going to pick the biggest value of confidence between the 2 ways of looking at it. Kulczynski (pronounced KƏl-zin’-skee) takes the approach of averaging the 2 together instead.  For, All Confidence, it may not be apparent, but you're actually taking the minimum value of confidence.  Since confidence is just the count (or support) of records that have both items (AB) divided by the count of the first item in a statement like AàB.  If you're dividing by the maximum value of the support for AàB and BàA, then you're really just finding the minimum value of confidence between the 2 options.  So, like I said, it might look scary, but it's simple.

Another good interestingness measure is the cosine function.  The way that I think about the cosine function is that is essentially the lift function in disguise, but is null invariant.  Let me show you what I mean.  Here are the equations for both

You can see that the only real difference between the 2 is that Cosine has a square root in the denominator.  There's a good reason for this.  In my post about lift, I mentioned that to calculate lift correctly, you have to make sure you are using the support as a percent of the total records in the database.  If you don't do this, the lift value is wrong.  The fact that you have to divide by the grand total of all the records is also why the lift measure is susceptible to changes in the number of null records.  Cosine overcomes this with a trick I was taught in intermediate school

When I first saw the lift equation written out this way I really wanted to get the "grand total" values to cancel out because it would be so much cleaner.  Another way of writing (or think about) lift is something like this (AB/G)/((A*B)/G^2) where AB is the count of A & B together, A is the count of A, B is the count of B, and G represents the grand total.  Using fraction math, you can cancel out one of the grand totals, but not the 2nd one leaving you with (AB)/(A*B/G).  Cosine takes care of this problem.  When we take the square root of the denominator, we can cancel out the grand total from the function entirely making the measure null-invariant.  It looks like this: (AB/G)/sqrt((A*B)/G^2), which simplifies to AB/sqrt(A*B).

The Jaccard function is defined as |AB|/|AB|.  To explain what that means let's go to Venn diagram world.
As a reminder, that upside-down U symbol is an intersection and is represented by the orange area in the diagram above.  The right-side-up U symbol is a union and is represented by all of the area that is either yellow, orange or red.  The '|' symbols in the formula are just a notation that is used to indicate that we want the count of items in that set, instead of the set itself.  Now that that is fresh in our minds again, the numerator of the Jaccard is the Orange area, and the denominator is the area of the yellow-orange-red area, NOT double counting the orange area because of the perceived overlap.  With that understanding, you can see why Jaccard is null invariant; it doesn't use any data from the blank space, null records, to be calculated.  I could make the white space in Venn diagram above HUGE and the Jaccard would still have the same value. 

Lastly, let's take a look at our apple and orange example one more time with all of these different measures of interestingness.

First of all, in the 2nd row, you can see that all of the measures we've added to this table aren't affected by the different "null" number.  I also have added a couple of rows to demonstrate how the different measures change with different values in the table.  In row 3 you can see a what a positive correlation looks like, instead of a negative one.  In row 4, you can see what no correlation looks like.  It's interesting to note that for Jaccard no correlation is represented by a value of 1/3.  Then in rows 5 and 6 I showed the extremes of these null variant measures...all of them are bound between 0 and 1.  0 means STRONG negative correlation, and 1 means STRONG positive correlation.  

Row 7 is there as a tale of caution.  Based on an astute comment on this post, revised/change the table above so that you can compare rows 1, 2, and 3.  They all have the same values except for the "null" number.  If the "null" number gets too big, like in row 7, then "someone who has bought an orange is far more likely than the general population to buy an apple".  This is not immediately obvious, and can be a stumbling block when using null-invariant measures of interestingness.

I think that sums it up.  I hope that helps!