Suppose you have a factory that makes widgets and have 2 machines in the manufacturing process that perform the same step, say they're both plastic injection molding machines. When parts come out you have an inspector that classifies the parts as good or bad. Over time, you collect all of this data and come up with a table like the one below.
The formula to calculate any expected value in the table above is [Row Total]*[Column Total]/[Grand Total]. If you look back to the logic in the paragraph above, you'll see that is exactly what we did.
Now that we have expected values, we can calculate the chi-square statistic. This is done by taking each cell (e.g. good parts from machine 1 as one example) and calculate (observed value - expected value)^2/(expected value). For the upper left cell this would be (375-368.34)^2/368.34 = 0.1203. After doing this for each cell we get a table that looks like this.
Then you just have to add these up to get you your chi-square statistic; in this example it's 3.89. A chi-square statistic doesn't mean much unless you know how many degrees of freedom you have. Don't know what degrees of freedom means? That's OK. For now, all you need to know is that for a chi-square test it is equal to (# of rows - 1)*(# of columns - 1). We just have a 2x2 table here so we get (2-1)*(2-1) = 1 degree of freedom. The 3.89 chi-squared value and the 1 degree of freedom are used to lookup a p-value. I think of a p-value as a probability value. When you're looking at a p-value, you are looking at the expected probability that nothing interesting is going on in your data. So, if you want to find something interesting, you're hoping for REALLY small p-values. In order to get the p-value I almost always use MS Excel. The Excel function would look like this "=CHIDIST(3.89,1)". For our problem, we get a p-value of 0.0486. This can be interpreted that we think there is 4.86% chance that nothing interesting is happening here. A common threshold that statisticians use for this p-value is 5%. Since, our 4.86% is less than 5%, we would say that this difference is statistically significant.
Now that we know the mechanics of how to calculate it, let's talk briefly about the intuition behind the numbers. If 2 features (e.g. good/bad and machine 1/2) have nothing interesting going on what is the chi-square value going to be? To start, you would expect the values in the expectations table to be exactly the same as the data you started with. Once you know that, you also know that all of the values in the chi-square table will be zero; (observed-expected)^2/expected will give you 0 divided by something, which is zero. The chi-square value can never be negative because the numerator is a squared value, and the denominator is an expected number of positive counts. The least interesting thing possible is 0, and the most interesting thing would be...some ridiculously large number (theoretically infinity), but in real life you don't ever get infinity as the answer.
Now, after all that explanation, I've got to let you down a little bit because chi-square isn't very good for data mining applications for the same reason why the lift measure has problems. If we use the same example that I used in my lift post, we have transactions for apples and oranges like this
If I follow the instructions above to calculate chi-square for the first table, I get 0.586. If I do the same thing for the 2nd table, I get 714,283.7...hmmm. In theory the interaction between apple and orange purchases are the same in both tables, but the chi-square statistic gets very confused by the double null transactions. That is why null-invariant measures for interestingness are so important when mining data patterns (another post to come soon explaining these measures).