Saturday, March 28, 2015

Simple Text Retrieval Vector Space Model Explanation

This post is going to be the beginning of my coverage of the content in the Text Retrieval and Search Engines course taught by Chengxiang Zhai at the University of Illinois through Coursera.  I'm going to review the basic vector space model for text retrieval, and hint at some of the evolutions of this model and why they're important.

Vector Space Model
I'm not sure how many of you out there took linear algebra courses, or know much about vectors, but let's discuss this briefly, otherwise you'll be completely lost.  Vector space doesn't look like outer space, it looks more like this if you look at a simple 2-dimensional vector.

The vector here starts at the origin of the graph and then extends out to the point (1,1).  In text retrieval, the 2 axes of the graph might represent 2 search terms you're looking for (think googling "hard drive").  The vector on the graph would represent the query itself.  Here, the fact that the vector points to (1,1) means that the query contains 1 instance of the term "hard" and one instance of the term "drive and this could also be represented as {hard, drive}.  On the same graph, we can add documents that are in our database (could also be the world wide web).  there are a couple ways to do this, but we'll start simple.  The easiest way to do this is just search through the document(s) and determine if the search terms are present in the document.  If they are, assign that term a 1, if not assign it a 0.  This is called the bit vector approach.  Assume that the following represents 2 documents with only the relevant text included.

   D1 = {'s hard to determine....}
   D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}

Using the 1 and 0 rule explained above, I would assign D1 the vector (1,0) and D2 the vector (1,1).  In this version of text retrieval these vectors are called bit vectors.  Plotting these on the vector space graph we get this.

'q' here represents our "query".  You can see that D2 matches the query perfectly using this system.  D1 seems to be going in a different direction.  Data scientists use this "closeness"of the vectors to measure similarity between the search terms and the documents reviewed.  To do this, they use the dot product.  A dot product is just a fancy way of saying you multiply the terms together and then add up the sum.  If you're familiar with MS Excel, this is like using the sumproduct() function.  For our example above, the dot product between the query and D1 is 1x1+1x0=1.  For D2 we get 1x1+1x1=2.  The dot product value can be used the rank the documents.  "Closeness" is ranked like a basketball score, the higher the better, so you can see that D2 ranks higher than D1.

Now I want to step back and explain the dot product a little more so you can get some intuition behind why the dot product works well for measuring the similarity of vectors.  If you look up the definition of a dot product on Wikipedia you'll find that the dot product (AB) can also be calculated this way AB = ||A|| ||B|| cosθ.  The ||A|| represents the length of the vector A.  if you remember back to trigonometry.  The length of the hypotenuse of a right angle is calculated like this

If you use this thinking and apply it to vector space, you can see that for the query and the document we have right angle triangles and we can calculate the length of the vectors using this method.  When we do this we get ||D1||=1, and  ||D2||=||q||=1.414... where ||q|| represents the length of the query vector.  Just because the lengths are equal doesn't mean that the vectors are the same.  That is why the dot product definition includes the cosθ term.  If the value of theta (θ) is close to 0, then cosine is close to 1; as theta gets closer to 90 degrees, cosine gets closer to 0.  So, if you have two vectors that are pointed in almost the same direction, then the cosine between them will be ~1.  This allows the ||A|| ||B|| to add to the relevance score.  If the two vectors are 90 degrees apart from each other the ||A|| ||B|| will get multiplied by something close to 0, essentially eliminating the effect.

Now that we have that simple understanding of vectors and dot products lets look at this in 3 dimensions.  In the diagram below we have a 3 word query that creates 3 dimensions to compare documents against.  Here we're assuming the query is {hard, drive, test}.

If we look at several documents, we might have documents that contain the following text:
D1 = {'s hard to determine...}
D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}
D3 = {...before I bought my new car, I took it out for a test drive...}
D4 = { part of the factory acceptance, every unit gets a hard drive test...}
D5 = {...I think I failed, that was a hard test...a standardized test is design to be hard for...}

When I put these documents in the 3 dimensional vector space diagram it looks like this

You can see the angles between the query vector and the document vectors and get a feel for which documents match closely.  The actual dot product calculations for these query/document comparisons turn out like this using the binary bit vector approach described earlier.

Based on the dot product values, it's obvious that D4 is the best match.  You can also see this in the last column that gives you the angle between the vectors.  To calculate this angle I had to calculate the length of the vectors. This is VERY similar to doing it in 2 dimensions.  3D length is calculated like this
A, B, and C are perpendicular lengths in the vector space (distance along hard, test, and drive dimensions).  You just square them all and take the square root.  This same principle applies to higher dimensions (e.g. the length of a 4 dimensional vector is the square root of sum of the 4 squared terms/dimensions).

Term Frequency
The bit vector (1's and 0's) approach is a very simplistic model however, and there are times when this approach doesn't give you very good results.  Let's the do the same problem over again with what is called the term frequency approach. This method counts the frequency of each of the terms found in the query, instead of just determining if the term exists in the document or not.  If we apply this rule to the problem above, the vector space looks like this

And the overall dot product calculation table looks like this

In bit vector mode, D2, D3, and D5 were all considered equally relevant.  Using term frequency we are able to say that D5 is a better match than D2, and D2 is a better match than D3.  The problem though is that now D5 looks like it's a better match than D4, which doesn't make sense.  One method to deal with this problem is to discount the score impact of terms that occur frequently in documents.  In our case, if we discounted "hard", or "test", we might find that D4 comes out on top again.  This method is called inverse document frequency and I'll write another post about that later.  When I do, the link to it will be HERE.

As always, I hope this post helps somebody out there.

Saturday, March 21, 2015

Graph Pattern Mining (gSpan) - Introduction

For those of you out there that follow this blog fairly regulary, you may have noticed that there has been a lull in the amount of posts recently.  This is not because I have not been working on my next post, it is because it has taken me quite some time to understand the topic described in this post.  That being said, I'm assuming (yes I know what happens when people assume) that for many of you out there, this topic would drain several precious hours of your life to understand as well.  So, I'm going to share what I've learned about the gSpan algorithm.  In this post I will lay the foundation by describing how data scientists create some sort of order out of all of the different possible graph configurations that can exist.  In my gSpan algorithm post, I'll describe how the information presented in this post is used to find frequent graph patterns.  The information in this post is based on "gSpan: Graph-Based Substructure Pattern Mining" by Xifeng Yan and Jiawei Han, September 3, 2002.

Graphs and why they're tricky to pattern mine
First of all, let's start SUPER simple.  What do people mean when they talk about finding patterns in graphs?  Let's just say that they aren't talking about something like this:

They are actually talking about some sort of visual representation of something, like this:

I have no idea what the graph above represents...I totally made it up.  But, it could look like a chemical composition diagram to some of you out there, or maybe a social network diagram, a family history tree...whatever.  Essentially, graphs are systems that have nodes (the A, B and C in the picture above) and connections between them (solid, dashed and double lines above).  In graph mining terminology they call a node a vertex and a connection an edge.  If you ask me why, I have no idea, but that random knowledge might help you if you ever decide to pick up a technical paper on the subject.

Now, if you have followed some of my previous posts on pattern mining (Apriori Basket Analysis, or Generalized Sequential Pattern Mining) you might look at that graph above and ask yourself, where do I start with that thing?  This is a major question that have vexed many mathematicians and data scientists.  Let's just start by stating the fact that, in general, when we're dealing with graphs, we don't really care about the orientation of the graph, or the how it "looks" per se; we really only care about what vertices are connected to each other and how.  As an example here's that first graph again, with some other equivalent graphs. 

You'll notice the term "Isomorphism" in that picture.  That's just a fancy way of saying those graphs are essentially the same as graph 1 if you only consider the vertices and edges, but rearranged to look different.  Since what we really want, is to be able to look for sub-graphs (or small subsets of the graphs in our database) and see if they are frequent, we would love to be able to create some sort of order for the graphs so we could kind of treat them like a sequential pattern.  In sequential pattern mining we represent patterns something like this <A,(B,C)D,E,(E,F)>.  There's a chronological order there: first A, then B and C, then D, then E, then E and F.  But with a graph, where do you start??? For example, I can number the vertices of graph 1 several different ways

First of all, make a note that the numbering starts with the number 0; this is referred to as the root.  Two of the variations above start at the C at the top (but is that really the "top"?)  One starts at the A in the lower left corner, and the last one starts at the B.  The order that the nodes are numbered can create a LOT of numbered graphs, but they're all different versions of the same thing.  What a data miner really needs is a way to create some sort of sequential order so that we always number graph nodes/vertices the same way and therefore don't waste a whole bunch of time dealing with graphs that might be numbered differently, but represent the same thing we already have.  Since, I'm focusing on the gSpan algorithm, I'm going to describe how DFS codes solve this problem.

DFS in DFS code stands for depth first search.  When we get into the gSpan algorithm, the reason for this will become more apparent, but for now just go with it.  A DFS code is just a way of documenting the vertices and edges of a graph in tabular form, but with special rules.   We'll represent each edge by defining 5 things.  Do do this, first we need to create codes for the features in our graphs.

With this coding system, we can turn the green edge in the 1st graph above into (0,1,C,c,B).  The 0 is for the starting vertex; the 1 is for the ending vertex; 'C' is for the starting vertex label/type; 'c' is for the edge label/type; and 'B' is for the ending vertex label/type.  The reason for having a code for both the vertex number and the vertex label is that if you're analysing a graph for frequent patterns, the vertex number doesn't give you much information about patterns, but we need it to keep track of where we are in the graph.  If you finish coding all of the edges in that first graph you might come up with something that looks like this.

Since you don't have any rules at this point to guide you, you could have just as easily come up with DFS codes that look like any of the DFS codes below...or more

All of these examples start from 0 in the lower left corner, but are very different.  It becomes obvious that we need to come up with a standardized way to order our graphs for several reasons; (1) so that we don't drive ourselves crazy comparing codes like the ones above only to realize they're the same graph, (2) to make is easier to find a subgraph we might think is frequent in many different graphs that will likely have very different configurations.  One of the things required to solve these problems is a set of rules called "Neighborhood Restriction"that limit the ways that we can create these lists of connections in our graphs.  These rules govern the order that you add edges to the list when you're creating codes for a graph.  The simplified English version of the rules listed in the technical paper go something like this.

  • If the first vertex of the current edge is less than the 2nd vertex of the current edge (forward edge)
    • If the first vertex of the next edge is less than the 2nd vertex of the next edge (forward edge)
      • If the first vertex of the next edge is less than or equal to the 2nd vertex of the current edge
      • AND If the 2nd vertex of the next edge is equal to the 2nd vertex of the current edge plus one this is an acceptable next edge
      • Otherwise the next edge being considered isn't valid
    • Otherwise (next edge is a backward edge)
      • If the first vertex of the next edge is equal to the 2nd vertex of the current edge
      • AND If the 2nd vertex of the next edge is less than the 1st vertex of the current edge this is an acceptable next edge
      • Otherwise the next edge being considered isn't valid
  • Otherwise (the current edge is a backward edge
    • If the first vertex of the next edge is less than the 2nd vertex of the next edge (forward edge)
      • If the first vertex of the next edge is less than or equal to the 1st vertex of the current edge
      • AND If the 2nd vertex of the next edge is equal to the 1st vertex of the current edge plus one this is an acceptable next edge
      • Otherwise the next edge being considered isn't valid
    • Otherwise (next edge is a backward edge)
      • If the first vertex of the next edge is equal to the 1st vertex of the current edge
      • AND If the 2nd vertex of the current edge is less than the 2nd vertex of the next edge this is an acceptable next edge
      • Otherwise the next edge being considered isn't valid
I know that's hard to follow, I'll walk you through one and give you the answers to a couple more of these.  It just so happens that Example 1 above meets all of these criteria. So, for (0,1,C,c,B) to (1,2,B,a,A) we see that 0,1 is a forward edge, and 1,2 is a forward edge.  so we need to make sure that the 1 in (1,2,B,a,A) is less than or equal to the 1 in (0,1,C,c,B)...check; AND, we need to make sure that the 2 in (1,2,B,a,A) is equal to the 1 in  (0,1,C,c,B) +1; 2 = 1+1...check. so the step from edge 0 to edge 1 is valid.

Let's do another one.  In example 1 still, (1,2,B,a,A) is a forward edge because 1 is less than 2 and (2,0,A,b,C) is a backward edge because 2 is greater than 0.  So we check if the 2 in (2,0,A,b,C) is equal to the 2 in (1,2,B,a,A)...check; AND we check if the 0 in (2,0,A,b,C) is less than the 1 in (1,2,B,a,A)...check.

If you keep going through Example 1, and the rest of the examples above, you'll see how hard it might be to create this type of order at random

I created example 1 to fit the rules on purpose.  Examples 2, 3 and 4 were created at random.  I was surprised to see how many edge steps I accidentally got right for those 3.

If you take some time and are careful, you can create several different DFS codes like example 1 that meet the neighborhood restriction rules, but are different from example 1.  Below are 3 examples (1, 5, and 6) that meet the rules.  To get these 2 other examples, I started at different locations in the graph.  The easiest way to get these patterns (and the method that is actually used in gSpan) is to do the following:  First you take a step forward, then try to make any backward connections possible connecting from smallest to largest (e.g. 3,0 comes before 3,1 if applicable).  Then, you take another step forward (e.g. 3,4) and repeat until you're done with the graph.

So, it appears we've gotten closer to figuring out the 1 way we should represent a graph, but we're not quite there.  To fix this problem the creators of the gSpan algorithm have an ordering system that will allow us to figure out if example 1 is "less than" example 5, or if example 5 is "less than" example 6.  This ordering system is great because once we have this, we just define THE representation of the graph (the one we'll be using in the gSpan algorithm) as the minimum DFS code possible.  This order is called DFS Lexicographic Order.  You may remember from other posts that lexicographic is just a fancy way of saying the order is going from smallest to largest (e.g. 1,2,3 or A, B, C...).  To do this we also make use of the fact that we have given codes to our vertex types and edge types.  Remember how the blue circles in our examples are 'A', and the dotted green lines are 'c'?  'A' is a vertex label and 'c' is an edge label.  Here's the rules that govern this order that will help us get our "minimum" DFS code; I'm going to write it in a very simplified pseudo-code format.

Let X be the one version of the graph's DFS code Y be another version for the same graph
Assume that X > Y to start
Start by comparing the first edge (edge 0) of both DFS codes
Start a loop for comparisons going through all the edges
   Check each of the following rules; if one applies, set X < Y and exit the loop

  1. Is X a backward edge and Y a forward edge?
  2. Is X a backward edge, Y a backward edge, and the 2nd vertex of X < 2nd vertex of Y?
  3. Is X a backward edge, Y a backward edge, 2nd vertex of X = 2nd vertex of Y, and the edge label of X is less than the edge label of Y?
  4. Is X a forward edge, Y a forward edge, the 1st vertex of Y < the 1st vertex of X
  5. Is X a forward edge, Y a forward edge, the 1st vertex of X = the 1st vertex of Y, and the label for the first vertex of X is less than the label for the first vertex of Y?
  6. Is X a forward edge, Y a forward edge, the 1st vertex of X = the 1st vertex of Y, the first vertex label of X = the first vertex label of Y, and the edge label for X is less than the edge label for Y?
  7. Is X a forward edge, Y a forward edge, the 1st vertex of X = the 1st vertex of Y, the first vertex label of X = the first vertex label of Y, the edge label for X = the edge label for Y, and the 2nd vertex label of X < the 2nd vertex label of Y?

   If you've made it to this section of code, you haven't proven that A is less than B yet
   If there's another edge left in the graph to check, increment up one edge (e.g. go from edge 0 to 1)
End Loop

Again, like a lot of things in data mining, this might look complex, but when you see it in a couple of examples it is pretty easy to follow.  If we apply this to Example 1, 5, and 6 above you'll see what I mean.

Let's let example 1 be DFS code X and example 5 be DFS code Y in the code above.  For the first edge, they are both forward edges with 0,1.  So, rules 1 through 4 in the code don't apply.  When we look at rule 5, we see that the 'C' in (0,1,C,c,B) is "greater than" the 'A' in (0,1,A,a,B) so rule 5 doesn't apply either.  Because the first vertex label 'C' and 'A' aren't equal, rules 6 & 7 don't apply.  So, comparing the first edge (edge 0) didn't give us the answer.  We move on to the next edge (edge 1).  For X we have (1,2,B,a,A) and for Y we have (1,2,B,c,C).  The first two items match and they're forward edges so we jump all the way down to rule 5.  We see that the first vertex labels match too so we go to rule 6.  When we look at rule 6 we see that the edge 'a' is "less than" the edge 'c' so we get to say that X<Y, or Example 1 is "less than" Example 5.  We don't have to check the rest of the edges because we break out of the loop once we find just one example of one of the 7 rules being met.

Let's compare example 5 (X) and example 6 (Y).  Starting at edge 0, we see that they're both forward edges with (0,1,...), so rules 1-4 don't apply.  When we look at rule 5 we see that the 'A' from (0,1,A,a,B) is "less than" the 'B' from (0,1,B,c,C) which satisfies rule 5.  This makes example 5 "less than" example 6; the loop breaks and we have our answer.

So, for our 3 examples, we know that example 1 < example 5 and example 5 < example 6.  Just like other places where inequalities are used, this information tells us that example 1 < example 6, or to be complete example 1 < example 5 < example 6.

With the information and understanding we've covered in this post we are armed to tackle the gSpan algorithm itself.  When I complete that blog post, I'll add a link to it right HERE.  Hope this helps somebody out there!

Tuesday, March 10, 2015

Generalized Sequential Pattern (GSP) Mining

This is going to be my first post about sequential data pattern mining.  I'm starting this post by explaining the concept of sequential pattern mining in general, then I'll explain how the generalized sequential pattern (GSP) algorithm works along with its similarities to the Apriori method.

Sequential Pattern Mining
If you're a store owner and you want to learn about your customer's buying behaviour, you may not only be interested in what they buy together during one shopping trip.  You might also want to know about patterns in their purchasing behaviour over time.  "If a customer purchases baby lotion, then a new-born blanket, what are they likely to buy next?"  With information like this, a store owner can create clever marketing strategies to increase sales and/or profit.  

To get this data, the Apriori algorithm has to be modified to account for this time delay between transactions.  For this portion of the discussion I'll be referencing a technical paper by Rakesh Agrawal and Ramakrishnan Srikant entitled "Mining Sequential Patterns".  Say you start with a database full of transactions that looks something like this.

Suppose that the transaction data represents the day of the month that the item was purchased.  As a human, we can see some repeat customers, and make some guesses about some of the repeat patterns that might exist, but we want an automated approach.  An easy way to find sequential patterns is to utilize a modified process similar to the one developed for the Aprioi Algorithm that is for single purchases.  To do this we just sort the table first by customer ID, and then by transaction time stamp.  We'll get something that looks like this.

You'll also notice an extra column at the right that takes the sorted information and turns it into customer sequences.  The reason for this is that we can utilize this format to data mine sequential patterns much like patterns within transactions are mined (e.g basket analysis).  How people format the items in the sequence varies a little bit among the couple technical papers I've read, but let me explain how I've formatted it above.  Every sequence of transactions starts with a '<' and ends with a '>'.  That's all those symbols mean.  The next thing you need to know is that any time 2 or more items (letters in my example) are surrounded by parentheses (e.g. "(AB)", which might be an apple and an orange), those items were purchased at the same time.  If any items are not surrounded by parentheses, then that means that when those items were bought, the customer didn't buy anything else with them.  So if we had a sequence like <A(BC)DE(FG)> that would mean that there were 5 different transactions.  The customer first bought A, 2nd B and C together, 3rd D, 4th E, and 5th F and G together.  Also, I'm not sure if you noticed, but I always have the items within parentheses ordered alphabetically.  This is common to sort the items bought together from least to greatest (the fancy technical term for this is lexicographic order). With that understanding, now we can start talking about how the GSP Algorithm works.

GSP Algorithm
The Generalized Sequence Pattern algorithm was created from a simpler algorithm for mining sequences, but it has some extra bells and whistles added so it can be more flexible for different situations.  To explain the process, I'm going to start with the basics, then add the bells and whistles at the end.  The beginning of the GSP algorithm is basically the same as the Apriori Algorithm, because for 1 item, there really is no order.  So, we find all of the items in our customer ID sequence database that meet the minimum support threshold.  We'll use the data above and say that the minimum support is 2 in this case.  If that's true we get the table below.

There are a couple of tricky things here.  The more obvious one is that item E did not have enough support to meet the threshold, so it get's removed.  Based on the Apriori principle, we are able to conclude that because E is not frequent, then any sequence with E in it would not be frequent either.  So, we don't need to keep track of it anymore.

The next tricky is thing is that the support for B is listed at 5, not 6, even though the 3rd customer bought item B twice.  This practice is also borrowed from the Apriori algorithm.  In the Apriori algorithm, if a customer buys 2 candy bars at once, then we only count 1 candy bar when calculating the support, because we count transactions.  The same applies here, except instead of counting how many transactions contain an item, or itemset, we are counting how many customers have an item/sequence.

Once we have these items we start the iterative process of generating larger patterns from these items and checking if they have support.  This is where things get really interesting.  In Apriori, the total possible combinations would be AB, AC, AD, AF, AG, BC, BD, BF, BG, CD, CF, CG, DF, DG, FG.  When generating sequences, that is not nearly sufficient because the order matters, AND we can have sequences where single items repeat (e.g. I bought A, then the next day I bought A again).  So we have to create tables that look like this
 Because order matters, there are a lot more options.  Oh and in case you forgot, there is also the possibility of 2+ items being bought in the same transaction
The diagonal values are blank here because they're already covered in the first 2-item table.  Just to make sure you understand, (AA) isn't there because if A and A were bought in the same transaction it would only be counted once so we never have two of the same item together within a parentheses.  The lower left is also blank, but this is because of the convention I already described of always listing items purchased together in ascending order.

So, now that we understand all of that, we can check our 51 possible 2-item sequences against the database to see if they meet the support threshold.  I'll walk you through the first row of 2-item sequences to show you how this is done (see the table below).

In the first column, we see that there are no customers that ever bought item A on more than 1 occasion, so support there is 0.  For <AB>, you can see that customers 1 and 5 bought those items in that sequence.  For <AC>, notice that customer 4 bought (AB) together then C, but we get to pick out just A from that combined transaction to get our <AC> sequence. Notice that the items don't have to be right next to each other.  It's OK to have other items in between.  <AD> follows similar logic.  For <AF> we don't count customers 3 or 4 because although they have A and F, they aren't in the right order. To be counted here, they need to be A, then F.  When we look for the support of <FA> those will get counted.  There is one last rule you need to know for counting support of sequences.  If you are looking for support of a sequence like <FG>, then customer 1 doesn't count.  It only counts if F is bought before G, NOT at the same time.  If you work through the rest of the 2-item sequences that weren't purchased together (i.e. not ones like (AB)), you get support counts that look like this
You see that I have struck out any 2-item sequence that do not meet the support threshold of 2.  Now let's take a look at the 2-item sets that could have been purchased together (e.g. (AB)).  This one is a little bit easier, because there are only 5 times in our database when 2 items were purchased at the same time.  They were (FG), (AB), (AB), (BC) and (DE).  Only (AB) shows up twice so we keep it and get rid of the rest. Now we have these 2-item sequences left to work with:

Now we start the next iteration looking for 3-item sequences. This is where I'm going to start referring to "Mining Sequential Patterns: Generalizations and Performance Improvements" by Ramakrishnan and Rakesh Agrawal.  In that paper they describe a process for generating candidate sequences that I think is pretty cool, now that I understand it.  You take all of the remaining sequences from the last step (AB, AC, AD, AF, AG, BC...), then you remove the first item from the sequence.  For AB, we remove A and we're left with B.  Do this for all of the sequences (see 2nd column in table below).  Then you do the same thing, but remove the last item (3rd column in table below) .  This is pretty straightforward except when you're dealing with a sequence that ends in a multiple item purchase like (AB).  When that happens, you have to remove all of the possible items one at a time and treat them separately.  So if you're removing the "first" item from (AB) you remove A and get B leftover, and then you remove B and get A leftover.  Hopefully I haven't lost you in that explanation.  If I did, I think reviewing this table will demonstrate what I'm talking about.
Now that we have this information (if we were programming this we wouldn't actually create a's just to make it easier to see and understand for this blog post), We combine the sequences together where their "-1st" and "-Last" columns match.  So if we're starting from the top, we see that the 2-item sequence AB matches up with BC, BD, BF, BG and (AB) to create ABC, ABD, ABF, ABG and A(AB).  The A(AB) one is tricky because you have to remind yourself that the order within the parentheses is just convention and it could easily be written A(BA) which makes it easier to see that the B's match up.  Working through the rest of the table this way (being careful not to create any duplicate candidates) we populate the "3-seq after join" column in the table below
The next step is to prune these 3-item sequences.  Many of them have portions of their 3-item sequences that aren't supported from the previous round of 2-item sequences.  A good example of this is the candidate sequence AFA.  AF and FA are supported 2-item sequences, but AA is not a supported 2-item sequence, so this one gets pruned out.  This pruning is based on the same Apriori principle that we have used in prior posts.  This pruning helps save computing time because we don't want to comb through the database to find the support for a sequence that we know won't meet the support threshold.  Now that we have this slimmed down list of 3-item sequences, we scan the database again to determine support just as it was described above for 2-item sequences.  When that is completed we end up with the 3-item sequences in the last column of the table above.

At this point it is just "lather, rinse, repeat" as my shampoo bottle suggests.  Take that list of 3-item sequences and generate 4-item sequences using the same method.  We remove the first and last items and put them in the "-1st" and "-Last" columns below.

Then we look for sequences that have a "-1st" value that match a "-Last" value of another sequence. In this case we only find matches for "BF" and "BG".  You should prune the sequences if they aren't supported by 3-item sequences that are supported/frequent, but in this case they're fine.  Then we check the database for support to finalize the list.  If you do that you end up with information similar to the table below.

You can see that, at this point, this example gets pretty simple.  If you try the iteration again, the trick of removing the first item and last item yields no matches and you're done.  After that you would just output all of the supported/frequent sequences to the user, which in this case would  be the following (as long as I don't make a copy and paste mistake :/ ).

Bells and Whistles
I promised to describe the bells and whistles for this process as well.  Here's a list of extras that can be added

  1. Taxonomies/Hierarchies: Harry Potter and the Sorcerer's Stone is part of a series by J.K.Rowling, which is part of the children's fantasy genre.  Taxonomies like this can be incorporated so that more general sequences can be found if the more specific ones aren't supported.  The way that the algorithm handles these taxonomies is to add in the higher level items (e.g. J.K. Rowling is higher level than Harry Potter) to the combined transactions.  For example if you had a sequence like this <A(1, B)A> then you would translate that into a sequence like <(A, Letter, Symbol)(1, Number ,B , Letter, Symbol)(A, Letter, Symbol)> if you had a taxonomy like the one here:
  2. Window Size:  If you've had a customer for years, maybe it isn't that interesting if they support a particular pattern because they have so much buying behaviour on record that they might just support it by chance.  Window sizes limit the span of time when transactions are deemed to support a sequence.  (e.g. I only want buying patterns for pregnant women during a 9 month window)
  3. Max-gap:  This parameter is defined by the user to filter out large gaps in data sequences.  Let's say a customer buys an iPod, then even though they are within the specified window size, they have a large gap between that purchase and a new set of headphones.  To a business owner, this gap might make them "uninteresting" as a customer to market to, so the GSP algorithm can filter this out as it is running and checking for support
  4. Min-Gap: Think of this as answering the question "how much time would I let pass between transactions before I would consider 2 purchases separate elements in my sequences?"  An example of this would be me at Home Depot.  I go to start a project on a Friday evening and buy what I think I need.  It isn't until Saturday about noon that I realize I had no idea what I REALLY needed and go back for more stuff.  Some store owners may want to treat these as one "transaction" or event.  If the min-gap was set to 24 hours in this case, all of my purchases would be grouped together to look something like this <(Friday night stuff, Saturday noon stuff)>.
I admit that there are more details to how you might actually implement these bells and whistles if you had to code them.  For those that are interested in that I recommend you let me know you want another post on that.  Otherwise, I refer you to "Mining Sequential Patterns: Generalizations and Performance Improvements" by Ramakrishnan and Rakesh Agrawal.

For those of you with the perseverance to get to the end of this lengthy post I congratulate you!  As always, I hope this was helpful for somebody out there.

Tuesday, March 3, 2015

Null-Invariant Measures of Interestingness

In 2 of my previous posts about measures of interestingness (Lift and Chi-Square) I mentioned I would be writing a post about null-invariant measures of interestingness.  I'm a man of my word so here you go.  The problem I presented in those posts can be summarized with these 2 tables.

If you click back to those previous posts (links above) you'll see that the lift for the first table was 0.714 and the lift for the 2nd table would be calculated as 0.929.  The chi-square value for the first value was 0.586 and the value for the 2nd is calculated as 0.022.  For both these measures we get different numbers when the size of the bottom right cell ("null" count) is changed.  The table below shows this information in a slightly different view.

Since we're pattern mining, and we're really only interested in the interaction between the two items (apples and oranges in this example), we'd love to have a way to measure interestingness that doesn't include that "null" count in the calculation at all.  Data scientists call this type of an interestingness measure "null-invariant".  Let's see what they've come up with to help us out here.

Based on a white paper entitled "Selecting the Right Interestingness Measure for Association Patterns" written by Pang-Ning Tan, Vipin Kumar and Jaideep Srivastava in 2002, there are 3 measures for interestingness that have this "null-invariant" property that we're looking for.  The three that they found are Confidence, Cosine, and Jaccard.  The Confidence measure actually has at least 3 variations to it now that I'm aware of so we'll cover 5 different null-invariant measures in total here.

If you've read my post about support and confidence, you may remember near the end that I mentioned that confidence gives different values based on the way it's calculated (e.g the confidence for Apples à Oranges may not be the same as the confidence for Oranges à Apples).  If that is the case, which one should you pick?  With so many potential patterns that can be found in data already, it is preferable for data scientists to find a way to combine these 2 options to get 1 answer instead of having to duplicate all of their results to show both sides of the same coin all the time.  To do this, they have taken the original confidence equation P(A|B), or P(B|A) depending on how you look at it, and created the following alternate forms

At this point in time it is my duty as your guide to understanding data mining in a simple way to calm you down if the stuff in the definition column gave you heart palpitations.  All three of these are actually VERY easy to understand if you already understand confidence.  I'm actually going to work from the bottom of this list up. Max Confidence basically says, I'm just going to pick the biggest value of confidence between the 2 ways of looking at it. Kulczynski (pronounced KƏl-zin’-skee) takes the approach of averaging the 2 together instead.  For, All Confidence, it may not be apparent, but you're actually taking the minimum value of confidence.  Since confidence is just the count (or support) of records that have both items (AB) divided by the count of the first item in a statement like AàB.  If you're dividing by the maximum value of the support for AàB and BàA, then you're really just finding the minimum value of confidence between the 2 options.  So, like I said, it might look scary, but it's simple.

Another good interestingness measure is the cosine function.  The way that I think about the cosine function is that is essentially the lift function in disguise, but is null invariant.  Let me show you what I mean.  Here are the equations for both

You can see that the only real difference between the 2 is that Cosine has a square root in the denominator.  There's a good reason for this.  In my post about lift, I mentioned that to calculate lift correctly, you have to make sure you are using the support as a percent of the total records in the database.  If you don't do this, the lift value is wrong.  The fact that you have to divide by the grand total of all the records is also why the lift measure is susceptible to changes in the number of null records.  Cosine overcomes this with a trick I was taught in intermediate school

When I first saw the lift equation written out this way I really wanted to get the "grand total" values to cancel out because it would be so much cleaner.  Another way of writing (or think about) lift is something like this (AB/G)/((A*B)/G^2) where AB is the count of A & B together, A is the count of A, B is the count of B, and G represents the grand total.  Using fraction math, you can cancel out one of the grand totals, but not the 2nd one leaving you with (AB)/(A*B/G).  Cosine takes care of this problem.  When we take the square root of the denominator, we can cancel out the grand total from the function entirely making the measure null-invariant.  It looks like this: (AB/G)/sqrt((A*B)/G^2), which simplifies to AB/sqrt(A*B).

The Jaccard function is defined as |AB|/|AB|.  To explain what that means let's go to Venn diagram world.
As a reminder, that upside-down U symbol is an intersection and is represented by the orange area in the diagram above.  The right-side-up U symbol is a union and is represented by all of the area that is either yellow, orange or red.  The '|' symbols in the formula are just a notation that is used to indicate that we want the count of items in that set, instead of the set itself.  Now that that is fresh in our minds again, the numerator of the Jaccard is the Orange area, and the denominator is the area of the yellow-orange-red area, NOT double counting the orange area because of the perceived overlap.  With that understanding, you can see why Jaccard is null invariant; it doesn't use any data from the blank space, null records, to be calculated.  I could make the white space in Venn diagram above HUGE and the Jaccard would still have the same value. 

Lastly, let's take a look at our apple and orange example one more time with all of these different measures of interestingness.

First of all, in the 2nd row, you can see that all of the measures we've added to this table aren't affected by the different "null" number.  I also have added a couple of rows to demonstrate how the different measures change with different values in the table.  In row 3 you can see a what a positive correlation looks like, instead of a negative one.  In row 4, you can see what no correlation looks like.  It's interesting to note that for Jaccard no correlation is represented by a value of 1/3.  Then in rows 5 and 6 I showed the extremes of these null variant measures...all of them are bound between 0 and 1.  0 means STRONG negative correlation, and 1 means STRONG positive correlation.  

Row 7 is there as a tale of caution.  Based on an astute comment on this post, revised/change the table above so that you can compare rows 1, 2, and 3.  They all have the same values except for the "null" number.  If the "null" number gets too big, like in row 7, then "someone who has bought an orange is far more likely than the general population to buy an apple".  This is not immediately obvious, and can be a stumbling block when using null-invariant measures of interestingness.

I think that sums it up.  I hope that helps!