Saturday, February 28, 2015

Chi-Squared

Another method of "interestingness" that can be used in data mining is the chi-square test.  This test is actually one that I have a fair amount of experience with from my six sigma and manufacturing background.  I'll use a simple example from my past to walk you through how it is calculated first, then give some warnings about ways it can be misused.

Suppose you have a factory that makes widgets and have 2 machines in the manufacturing process that perform the same step, say they're both plastic injection molding machines.  When parts come out you have an inspector that classifies the parts as good or bad.  Over time, you collect all of this data and come up with a table like the one below.

I looks like they produced roughly the same quantity of bad parts, but machine 2 also made less parts overall.  What chi-squared will do is help us determine if that difference is (statistically) significant.  The first thing we need to do is determine what we would expect if the 2 machines had the same quality.  To get this expectation we use the values in the row total and column total.  If there were no real difference between the 2 then we would expect there to be a good ratio of 605/657=92.08% for both machines.  To get the bad ratio (defect rate) we just take 1 minus the good ratio and we get 7.92%.  Now we just need to account for the fact that the 2 machines produced different quantities.  With these expected good/bad ratios we can calculate how many good parts we expect if we produce 400 or 257 parts.  We would expect that machine 1 would produce 400*92.08% = 368.34 good parts and 400*7.92%=31.66 bad parts.  We would expect that machine 2 would make 257*92.08% = 236.66 good parts and 257*7.92% = 20.34 bad parts.  this gives us an expected table that looks like this.

The formula to calculate any expected value in the table above is [Row Total]*[Column Total]/[Grand Total].  If you look back to the logic in the paragraph above, you'll see that is exactly what we did.

Now that we have expected values, we can calculate the chi-square statistic.  This is done by taking each cell (e.g. good parts from machine 1 as one example) and calculate (observed value - expected value)^2/(expected value).  For the upper left cell this would be (375-368.34)^2/368.34 = 0.1203.  After doing this for each cell we get a table that looks like this.

Then you just have to add these up to get you your chi-square statistic; in this example it's 3.89.  A chi-square statistic doesn't mean much unless you know how many degrees of freedom you have.  Don't know what degrees of freedom means?  That's OK.  For now, all you need to know is that for a chi-square test it is equal to (# of rows - 1)*(# of columns - 1).  We just have a 2x2 table here so we get (2-1)*(2-1) = 1 degree of freedom.  The 3.89 chi-squared value and the 1 degree of freedom are used to lookup a p-value. I think of a p-value as a probability value.  When you're looking at a p-value, you are looking at the expected probability that nothing interesting is going on in your data.  So, if you want to find something interesting, you're hoping for REALLY small p-values.  In order to get the p-value I almost always use MS Excel.  The Excel function would look like this "=CHIDIST(3.89,1)".  For our problem, we get a p-value of 0.0486.  This can be interpreted that we think there is 4.86% chance that nothing interesting is happening here.  A common threshold that statisticians use for this p-value is 5%.  Since, our 4.86% is less than 5%, we would say that this difference is statistically significant.  

Now that we know the mechanics of how to calculate it, let's talk briefly about the intuition behind the numbers.  If 2 features (e.g. good/bad and machine 1/2) have nothing interesting going on what is the chi-square value going to be?  To start, you would expect the values in the expectations table to be exactly the same as the data you started with. Once you know that, you also know that all of the values in the chi-square table will be zero; (observed-expected)^2/expected will give you 0 divided by something, which is zero.  The chi-square value can never be negative because the numerator is a squared value, and the denominator is an expected number of positive counts.  The least interesting thing possible is 0, and the most interesting thing would be...some ridiculously large number (theoretically infinity), but in real life you don't ever get infinity as the answer.

Now, after all that explanation, I've got to let you down a little bit because chi-square isn't very good for data mining applications for the same reason why the lift measure has problems.  If we use the same example that I used in my lift post, we have transactions for apples and oranges like this


If I follow the instructions above to calculate chi-square for the first table, I get 0.586.  If I do the same thing for the 2nd table, I get 714,283.7...hmmm.  In theory the interaction between apple and orange purchases are the same in both tables, but the chi-square statistic gets very confused by the double null transactions.  That is why null-invariant measures for interestingness are so important when mining data patterns (another post to come soon explaining these measures).