Simple Data Mining

How to Mine Frequent Patterns in Graphs with gSpan including a Walk-thru Example

2015-04-25T16:54:00.001-04:00

In this blog post, I'm going to explain how the gSpan algorithm works for mining frequent patterns in graphs. If you haven't already read my introductory post on this subject, please click HERE to read the post that lays the foundation for what will be described in this post.

Getting Started
To explain how this algorithm works, we're going to work through a simplified data set so that we can see each step and understand what is going on. Suppose that you have 5 graphs that look like this

Don't get confused by the greek characters here. I used random symbols to show that this method can be used for any type of graph (chemicals, internet web diagrams, etc.) even something completely made up like what I did above. The first thing we need to do with this graph data set is count how many vertices and edges we have of each type. I count 7 alpha(α) vertices, 8 beta(β) vertices, and 14 lambda(λ) vertices. There are 15 edges that are solid blue, 13 that are double red, and 8 that are dotted green. Now that we have these counts we need to order the original labels for vertices and edges into a code that will help us prioritize our search later. To do this we create a code that starts with the vertices and edges with the highest counts and then descends like this

We could have used numbers instead of letters here, it doesn't really matter as long as you can keep track of the symbols' order based on frequency. If you relabel the graphs with this order you get something like this

You can see that the vertex labels now have A, B, C and each edge is labelled a, b, or c to match the new label legend we created.

Finding Frequent Edges
For this example we will assume that we have a minimum support frequency of 3 (i.e. 3 of the 5 graphs have to have the pattern for us to consider it important). We start by looking at individual edges (e.g. A,b,A) and see if they meet this minimum support threshold. I'll start by looking at the first graph at the top. There is an edge that is (B,a,C) and I need to count how many of the graphs have an edge like this. I can find it in the 1st graph, 2nd graph, 3rd graph, and 4th graph so it has support of 4; we'll keep that one. Also notice that I had the option of writing that edge as (C,a,B) as well, but chose to write it with B as the starting vertex, because of the order we established above. We move on a little and check (A,a,C) and find that it can only be found in the first graph; it gets dropped. We check (A,c,C) and find it in the first 4 graphs as well; keep it. If we continue going through all of the graphs being careful not to repeat any edge definitions we'll get a list of frequent single edges that looks like this

You can see that we only have 3 edges that are frequent (this will simplify our lives in this example greatly). We need to sort the edges that are still frequent in DFS lexicographic order (See prior blog post for explanation of this). After sorting we end up with this list of edges to consider going forward

Pay attention to the fact that I added the "0,1" to the beginning of these edge definitions. These are the starting and ending points of the edges. We'll "grow" the rest of the frequent sub-graphs from these edges in a way similar to the method described in my prior blog post on this subject.

gSpan starts with the "minimum" edge, in this case (0,1,A,b,B), and then tries to grow that edge as far as it can while looking for frequent sub-graphs. Once we can't find any larger frequent sub-graphs that include that first edge, then we move on to the 2nd edge in the list above (0,1,A,c,C) and grow it looking for frequent sub-graphs. One of the advantages of this is that we won't have to consider sub-graphs that include the first edge (0,1,A,b,B) once we've moved on to our 2nd edge because all of the sub-graphs with the first edge will have already been explored. If you didn't follow that logic, it's OK; you'll see what I mean as we continue the example.

You might notice that the last graph in our set of graphs doesn't have any edges that are frequent. When we're starting with edge (A,b,B) we keep track of which graphs have that edge present. In our example that's graphs 1, 2, 3, & 4. Now that we have that "recorded" we start the growth from (0,1,A,b,B) by generating all of the potential children off of the 'B' vertex.

Growing from the 1st Edge
Before we grow our first edge, you need to understand how gSpan prioritizes growth. Let's assume that we have a generic subgraph that looks like this already (don't worry about how we got it yet, it will be come clear in a second)

I have removed the colors and labels from the diagram above to explain how gSpan grows without confusing you with labels. The numbers represent the order the vertices were added. If you are at this point in the algorithm, you will try to grow this graph in the following order of priority.

So the first priority is always to link the current end vertex back to the first vertex (the one labelled 0). If that's not possible, we see if we can link it back to the 2nd vertex (the one labelled 1). If we had a bigger sub-graph, we would continue trying to link to the 3rd, 4th, etc. if it's possible. If none of those are possible then our next (3rd) priority is to grow from the vertex we're at (labelled 4 above), and grow the chain longer. If that doesn't work, we go back to the most recent fork, take another step back and then try to grow from there (see how 5 is growing from vertex labelled 1 in the 4th priority). If that doesn't work, then we take another step up the chain and try to grow from there (5th example above). The 4th and 5th examples above also generalize for longer chains. If there were several vertices above the closest fork in the chain, you would progressively try to grow off of each of those vertices until you get back to the root (vertex labelled '0').

With that understanding in mind, we can finally start growing from our first edge. Our first edge is (0,1,A,b,B). If we look at the priority explanation above, we can tell that the 1st and 2nd growth priorities are not really options, so we take the 3rd priority and try to grow from vertex 1, or the B vertex in our example. Since we only have 2 frequent edges that have 'B' vertices in them our two options are as follows:

Option 1 Option 2
(0,1,A,b,B) (0,1,A,b,B)
(1,2,B,b,A) (1,2,B,a,C)

Now we are going to check for support of these 2 options. We only need to look in the graphs that we determined contain the frequent edges as described earlier. When we look for Option 1 we find it in the following graphs left that we're considering:

Notice that we can find the first option in 3 different places in the first graph, but we can't find it in any of the other graphs, so the support is only 1. That's a dead end, so we check option 2

You can see that we can find this pattern in multiple places in all of the 4 graphs we have left (I didn't show all of the places it is found in the first graph). The support for this sub-graph is 4 so it's a keeper. Also, all 4 graphs we looked at have this sub-graph present so we keep those too. That ends that round of edge growth so we take option 2 from above and try to grow from it. Remembering the priorities of growth, we know we need to try to connect the last vertex (2 or 'C' in our example) to the root (0 or 'A' in our example) if we can. If we look at the list of frequent edges to choose from, we see an edge that is (A,c,C) which could be written as (2,0,C,c,A). This would make the connection we need. None of the other frequent edges meet this criteria, so we check to see if (0,1,A,b,B); (1,2,B,a,C); (2,0,C,c,A) is frequent

This time this sub-graphs occurs twice in the 1st, 2nd and 4th graph. It only occurs once in the 3rd graph. Regardless, the support count for this sub-graph is still 4 so we've got a keeper again. Now we have a graph that looks something like this.

Since we have already connected the last vertex ('C') to the root ('A') and to the vertex after it ('B'), our next priority is to grow from the last vertex we created. The frequent edges we can try to add onto 'C' are (B,a,C) and (A,c,C). However, since we're working from 'C' they would be better written this way: (2,3,C,a,B) and (2,3,C,c,A). We'll start by checking if adding (2,3,C,a,B) works because it is the lowest in terms of DFS lexicographic order. Here's what we find.

As it turns out, this 4 edge pattern can only be found in 2 of the 4 graphs we have left so that doesn't work. Now we'll check if (2,3,C,c,A) works out.

Nope, that's only in the first graph so it's not frequent either. Since neither of these worked out, we need to go to our next priority (4th) which is growing from the vertex a step up from the last attempted branch. When you look at the sub-graph we've grown so far you can tell we need to try to grow from 'B' now. That being said, we take a look at our options for frequent edges that contain 'B' vertices and we see that we should try to add (B,a,C) or (B,b,A). Here's what trying to add (B,a,C) looks like in this scenario

Let's see if we can find that in any of the graphs

As it turns out, we CAN find this sub-graph in 3 of the 4 graphs we have left so that's a keeper. At this point in time I want you to notice that I am going to ignore the other edge that could grow from 'B' until later. This is because gSpan is a depth first algorithm, so once you get success you try to go deeper. When we hit a dead end, we will come back to all of these straggling options to see if they actually work. If you're familiar with computer programming, this type of algorithm is implemented with recursion. Our next step based on our priority rules is going to be to try to link our vertex 3 ('C') to the root, or the 'A'. Again the only frequent edge we can choose from to do this is (C,c,A). This option would look like this

At this point, remember that the last sub-graph that we found that was frequent could only be found in the first 3 graphs in our set, so now we only have to look for this new graph in those three graphs

This only exists in one of our graphs, so we go to our next growth priority. Since there is already a link between our 'C' at vertex 3 and the 'B' and vertex 1, the 2nd priority isn't an option, so we try to grow/extend from the 'C' at vertex 3 again. We can either add (C,a,B) or (C,c,A). Adding (C,a,B) would look like this

If we look in the 3 graphs we have left we find this

That's a dead end because we only have 1 graph with this 5 edge sub-graph. We back up and try adding (C,c,A) instead. I'm not going to do the visuals for this, but it can't be found in any of the graphs we have left either. Since we hit a dead end again here, we go to to our 5th priority growth option: growing from the root again. If we try to grow from the root 'A' we can either use (A,b,B) or (A,c,C). Starting with (A,b,B) we would get a sub-graph that looks like this

This can't be found in any of our graphs, so we try adding (A,c,C) as our last ditch effort

This can't be found either, so we have reached the end of this depth first search. Now we need to go back and take care of the straggler options that we didn't pursue because we were going for depth first. The last straggling option we left was when we were considering adding (B,b,A) to vertex 1.

Remember, at that point in time, we were still looking at 4 of the 5 original graphs so when we look for this sub-graph in those 4 graphs this is what we find.

We only find it in one graph, so that turns into a dead end too.

This Section Until "2nd Edge (A,c,C)" is a Correction (Thanks to Elife Öztürk)
If we look back further in our algorithm we find that we also haven't checked all of the growth priorities off of the first frequent edge (0,1,A,b,B). Our first growth came from using the 3rd priority growth strategy, but we haven't checked the 4th/5th priority grown strategies there yet. If we do, we can see that it might be possible to make a link like (0,1,A,b,B) and (0,2,A,c,C). Let's see if we can find that in the graphs

We can see that in all four of our remaining graphs so we'll keep this sub-graph.

Our first priority for growth on this sub-graph would be to connect back to the root, but we don't have any double bonds (sorry for the chemistry reference) in our example so that won't work. The 2nd priority is also impossible, so let's check the 3rd. Can we grow an edge off of the C node? Our 2 options would be (2,3,C,a,B) and (2,3,C,c,A). Adding (2,3,C,a,B) looks like this.

When we search through the graphs, we find this in 2 places

Notice that I'm not counting the instances in the 2nd and 3rd graphs. This is because the only way you can get find this subgraph in those graphs is to wrap the graph such that the 1 B node is the same as the 3 B node. The reason why we number the nodes is to make sure they're unique, plus we already found the sub-graph that creates that ABC triangle previously...thus the beauty of the algorithm :). Since there are only 2 graphs with this sub-graph, the minimum support of 3 is not met.

So now let's check for the other growth option (2,3,C,c,A)

That can only be found in one of the graphs

So now we've exhausted our growth options off of the first edge (A,b,B).

2nd Edge (A,c,C)
Before you go crazy thinking that the other 2 edges we need to check will take just as long, let me tell you that we have already done WAY more than half of the work on this problem. Now that we have exhaustively checked all of the graphs for frequent sub-graphs that contain the edge (A,b,B), we can trim down our search space significantly. What I mean is we can delete edges in graphs that contain (A,b,B) and only look for frequent patterns in what is left. If we do this, we get 4 graphs (remember we already got rid of graph 5 because it contained no frequent edges) that look like this.

These graphs are much simpler. I also wanted to point out that I removed any edges from these graphs that were not frequent. We could (should) have done this earlier when we were doing our search on the first edge, but I didn't want to complicate things at that time. Now, much like last time, we start with edge (0,1,A,c,C) and try to grow from 'C'. We can either use (1,2,C,a,B) or (1,2,C,c,A). If we add (1,2,C,a,B) we can find that sub-graph in the following places

All 4 graphs have this sub-graphs, so it's a keeper and we'll go with it (remember we haven't checked adding (1,2,C,c,A) yet. The next step is to try to grow from the last vertex back to the root. If we did this we would have to use the edge (B,b,A), but that edge has been removed from all of our graphs because we have already mined all of the frequent sub-graphs with (A,b,B) in them. So, our only other growth option is to try to extend the chain from 'B'. The only option left to grow from 'B' is (B,a,C) because (B,b,A) has already been checked. So we get a sub-graph like this.

This can be found in the following graphs

That's 3 out of the 4 graphs, which meets our support requirement so we'll keep it. Our first growth priority from here is to link node 3 ('C') back to the root. That would be adding a (3,0,C,_,A) link. The only link possible for this in our data set is (3,0,C,c,A) which would give us this subgraph to look for

The only place this subgraph is found is in our 2nd example:

So, that's a dead end. If we back up, our next growth priority would be to connect the C at node 3 back to the C at node 1. A quick look at the graphs we have tells me that's not going to work. The next priority after that is to add another node after node 3. We could look for a subgraph with C,a,B added to the end that looks like this.

However, that subgraph is only found here:

So, we back up again and try our 4th priority, which is growing a link off of the B at node 2. A quick look at the graphs above show that won't work. Lastly, we'll try to grow further up the graph, like a new node coming off the C at node 2 or a new node off the A at node 0. There's a couple options for growth off of node 2, but I can tell it's only supported by our first graph. There are no additional options for growth from the root node 'A'. so we're done exploring this graph branch.

Now we need to step back to when we had the graph (0,1,A,c,C), (1,2,C,a,B) and try to grow again from there. The last priority for growth for this subgraph is to grow from the root. The only way we can grow from the root is to add (A,c,c), which would look like this.

Checking the four graphs we have left, we only find this sub-graph in our 2nd graph

Since this is only found in one place, it's a dead end and we need to backtrack to other options we didn't pursue earlier. We only had one of those when we were adding our 2nd edge to (A,c,C). So when we go back and try to add (1,2,C,c,A) to (0,1,A,c,C) we look at our graphs and find that it is only present in 2 of our graphs

As it turns out these are all of our options starting with edge (A,c,C). See, I told you the 2nd edge would go faster than the first.

3rd Edge (B,a,C)
Now we have to check the last of our frequent edges to see if there is any way that we can grow a frequent sub-graph with just this edge. Let's start by removing all of the (A,c,C) edges from our graphs to see what we're left with.

I could go through all of the same procedure I have been using for these remaining graphs, but it gets so simple at this point that there really isn't a reason to. If you're using a program runing the gSpan algorithm, it will be disciplined and do the whole search though. From what is left, we can see that (0,1,B,a,C); (0,2,B,a,C) is frequent, but anything larger doesn't work out. We hit our last dead end and now we can report all of the sub-graphs that we found that were frequent.

I have them organized by the edge they were grown from. The last thing we have to do is translate our DFS codes back to the original symbols. This is easy enough and the user would get an output like this.

That's all there is to it. I know that this post got a little long, but for me, being able to see the whole problem worked out helps me learn a TON. I hope it helps somebody else out there too. I did gloss over some details of the programming, etc. but this is essentially what's happening in gSpan. Let me know if you have any additional questions.

Simple PageRank Algorithm Description

2015-04-18T14:51:00.000-04:00

This blog post will give a simple explanation of the original PageRank algorithm This is the algorithm that differentiated Google from the web search incumbents in the late 1990's (e.g. Yahoo!, Altavista, etc.) and helped to make it what it is today. I will be using material I learned in ChengXiang Zhai's "Text Retrieval and Search Engines" Coursera course, as well as the original technical paper on PageRank by Sergei Brin and Larry Page entitled "The PageRank Citation Ranking: Bringing Order to the Web", 1998.

For this blog post I'm going to use a VERY simplified example of the web so that we can work through it and understand what is happening. The example below has only 6 web pages, but has the complexity we need for our discussion.

We'll consider each of the items in the network (e.g. A, B, C) to be webpages. PageRank can actually be used for other things like social networks, etc., but we'll stick with the web page version for this post. Each black arrow represents a link from where the arrow starts to where the arrow ends. For example, webpage B has a link to webpage A, and that's it. Webpage A has links to webpage B, and D. These links, and the somewhat hidden information that they contain, are the most important part of the PageRank algorithm.

Before PageRank, there were actually lots of people that tried to use the information from these links to help with search. These mostly focused on the count of how many links a webpage had pointing to it. If we use this type of thinking page A seems to be the most important with 4 links to it, C and D come in 2nd place with 2, B comes in 3rd with 1 and E and F come in dead last with 0. This may seem to make sense, but there is something interesting that this method misses. Since page A has so many links, it's pretty obvious it is important, but since it's important, the links that page A has to other pages are more important that other links in the network. PageRank takes this into account.

Imagine you are web-surfing in the diagram above. Say you pick a web page at random and then read it, find a link on that webpage and click it, read that webpage, find a link on that webpage and click it...lather, rinse, repeat...until you get bored, or stuck. At this point you decide to not follow a link from the page you're on, but to just pick a webpage at random from the web. Once you're on this new random webpage you start reading and clicking on links all over again. If you did all of this, this is essentially what the PageRank algorithm does. It follows this type of random web-surfing and then calculates the probability of arriving at all of the pages in the web.

Now that I've explained that process conceptually, lets get into the math that matches this type of random web-surfer behaviour. If we start by creating a grid/table where we put a 1 for links from one webpage to another, we'd get something like this

This is helpful because now we can do some analysis on this table, but it's not really what we need. If you're at a webpage, we want to be able to determine how likely you are to pick one of the links on the webpage. If there is only one link on the webpage, this is easy; 100%. If there are 2 or more, the PageRank algorithm just divides this probability up evenly. If there are 2 links, each link has a 1 out of 2 chance of getting clicked (as a side note, in my opinion it would be more interesting if these probabilities were altered based on where the links were on the screen, or some other characteristic like font size/color). If we slightly alter our table above to give these probabilities, instead of just the links, it will look like this

This matrix is called the transition matrix (later we'll refer to this matrix as M). As mentioned above when describing the random web-surfer, there are 2 parts to this algorithm: clicking links and jumping to pages without links. The transition matrix is useful for the link clicking part of the calculation. Let's say that we start at time t=0 by randomly selecting one of the webpages to start from (each page has a 1 out of 6 chance of being the page we start on) and we want to figure out the probability that our next clicked link will take us to webpage A. To do this we would calculate this

For our example this equals 1/2, 50%, or 0.5 (however you want to write it). The method can be generalized to all of the webpages after we pick a starting webpage. The generalize version of this equation looks like this

Don't freak out, I'm going to explain what all the symbols mean. The part of the left of the equal sign is just saying that you're going to be calculating the probability that the next click (t+1) is going to land on page dj. 'j' is a subscript that helps you keep track of what page you're calculating the probability for. In our transition matrix above, if j=3, then dj = C. The right side is a summation starting at 1 and going to N, where N is the number of pages in your web (in our case N=6). Mij is the element from the transition matrix in row i and column j. pt(di) is the probability at the current time that you have arrived at page di. Since we're still on our first step (t=0), we set pt(di) equal to 1/6 in our example. Now we can calculate the probability of being on all of the pages in our web (not just A) at t=1. The result looks like this

You can see that because there are no webpages that point to pages E and F, we have 0 probability of being on those pages after we've clicked just once. The PageRank algorithm gets to rankings of all of the pages after many repetitions of what we just did. I'll walk you through another step (t=1 to t=2) so you get the feel of it. Let's calculate the probability that we're on page A at t=2.

This is very similar to the step from t=0 to t=1. The first value in each multiple is from the transition matrix (M) and the 2nd value comes from the probability of being at each page at time t=1. Once you get the hang of this, it can be repeated quickly and converges to a final answer very quickly. Here's how our web diagram converges.

When we get to about 21 cycles, we see that nothing is really changing and now we have values that can tell us how important each of the webpages are. Based on this example, page A is most important, then pages B and D tie for 2nd. This is not what we got when we just counted how many links each website had. Page B only has 1 link, but it is linked by page A which is the most important, so that link counts for more. Remember though, that this isn't the whole page rank algorithm. We have to add in the random jumping to a different page now. To do this we have to set a probability that, at any given time, we'll "get bored" and jump to another page. Let's say that we have a 15% chance of getting bored on the page we happen to be on. We'll call this value alpha (α=15%)

Here you can see that the first term in the probability is basically the same, but that it get's multiplied by (1-α). This is because we need to discount this probability by the amount that we get get bored and randomly pick a page. The 2nd term is adding in the probability of randomly jumping to another page. The 2nd term could be simpler; it could just be α*(1/N). It is written the way it is because this equation can be turned into a matrix math problem, which I'll show you later. For not just go with it.

If we start over at t=0, using this new form of the equation, we can calculate a couple examples and show you how this equation converges as well. If we want to calculate the probability of being at webpage A at t=1 we would do the following:

You can see that in the first term in the first line, I just used the value we calculated in our prior example without jumping. I also expanded out the second summation so that it is easier to see. The next 2 lines substitute the actual values into the equation and simplify to an answer. If you follow this calculation for the other webpages you get this.

I'll walk you through one more step and then show you the whole table and how it converges to a final answer. If we want to calculate the probability of being at webpage A at time t=2 we get this

The first thing I want you to notice is that I couldn't just use the value from the old method for the first summation. This is because the probability of being at webpage A at t=1 has changed. So, we get to do all of the math this time. Remember that the M terms here come from the transition matrix. Because we're calculating the probability of being at webpage A, we take the values from column A of the transition matrix. The rest is basically the same as the last example. We can do this for all of the webpages (I did it in Excel for this post) and get the table below

You can see that this table converges where values really aren't changing any more by the time it gets to t=17. The decision of when to stop iterating is based on a user defined variable. In the original PageRank white paper, the authors suggest that you loop through 't' until a value, delta (δ) gets smaller than this user defined value epsilon (Ɛ). To calculate delta you need to know this.

The first line shows how to calculate delta, but if you're anything like me, you may have never been exposed to an "L1 Norm" before. Each of the R terms inside the double bars is a vector containing the probabilities we've calculated for all of the webpages at the given time. So R1 = {0.45, 0.095833, 0.2375, 0.1666, 0.025, 0.025} using the clicking and jumping values from our most recent example. R2 would be {0.389792, 0.21625, 0.117083, 0.226875, 0.025, 0.025}. To calculate our delta value we first have to subtract each term in R1 from R2, then we will take the L1 norm of the result. R2 - R1 = {-0.06021, 0.120417, -0.12042, 0.060208, 0, 0}. Now we need to calculate the L1. This is shown in the last equation in the list above. Basically, you just take the absolute value of every term in the vector (R2-R1) and add them all up. For our example from t=1 to t=2, we get delta equal to 0.36125. If I add this delta value to the last column of the convergence table you can see how this value gets smaller as we iterate.

I promised earlier I'd show you how the equations to calculate page rank can be turned into a matrix equation. If you're familiar with linear algebra, this will be interesting to you. The other MAJOR advantage of having equations in matrix form is that the processing time in a computer is MUCH faster for matrix math; this is because ridiculously smart people have optimized matrix math routines in most math libraries so that they get the right answers while minimizing processing time. We want to be able to stand on the shoulders of these giants. So, here's how the equations can get morphed into matrix equations

If you know how to use matrices in Matlab, a programming language, Excel, etc. This last equation can be used to efficiently calculate each iteration.

Before I end this post, I need to explain one more thing. If our web diagram happened to look like the diagram below we would have a problem

The only thing that I changed is that the arrow that was going from F to C, is now going from C to F. The reason why this is a problem is that F isn't pointing to anything. If we randomly select F as our starting point, how can we click on a link to move forward??? We can't. A simple way to solve this problem is for there to be an exception in the PageRank algorithm. If the page has no links to other pages, we don't allow the option to click on a link, we force the algorithm to randomly jump to another page. In essence, in this special case, we set alpha equal to 1. This keeps the algorithm from getting hung up in a dead end.

Well, I think that wraps it up. Let me know if you have any additional questions about PageRank that I can answer for you. As always, I hope this helps somebody out there!

Probabilistic Retrieval Model: Basics, Query Likelihood and Smoothing

2015-04-11T11:32:00.002-04:00

This post discusses a different way (compared to the vector space model) to rank documents when performing text retrieval. This is called the probabilistic retrieval model. It bases its formulas off of probability theory instead of the rules of thumb that were created for the vector space model through trial and error. I'll explain the basics of the probability model, explains some limitations and derive a "smooting" implementation (Jelinek-Mercer), and then give an example of how it all works.

If it has been a while since you've take statistics, or never have, I'm hoping to make this first section easy for you to follow. When it comes to retrieving the right document during a search we can think of the best documents as the ones that have the highest probability of being relevant to the searcher. The trick comes when we have to define that probability. The mathspeak version of this probability definition is written as "p(R=1|d,q)". Translated in to English that reads "the probability that R=1 (i.e. the document is relevant) given that we have document d and query q". If we can efficiently calculate this probability, we can rank the documents in our search by their probability of being relevant.

The problem with the probability p(R=1|d,q) is that it is actually very hard, if not impossible, to calculate from the information we have when a user submits a query. So, the data mining community has developed a substitute probability to calculate that gives us basically the same effect. What they use is the probability that the query entered by the user was randomly generated from a relevant document. That probably didn't make sense yet, so let me explain what a unigram is and then give you an example.

When you analyze text, you have to decide how the text will be divided up for analysis. If the user enters a query for "hard drive" are you going to treat "hard" and "drive" as 2 separate variables? Or, are you going to treat them as one? How do you know which one you should use? For the simplest models we treat each word as a different variable and assume that each word has nothing to do with the other words. This assumption is called statistical independence. It basically means that you're assuming that there is no correlation between the words in the query. This is obviously NOT true, but as it turns out, accepting this assumption actually gives pretty good results anyway so we'll go with it. So each word in the query gets a fancy name called a unigram (basically means 1 word). If you bunched words into groups of 2 they would be 2-grams, etc.

Query Likelihood Example
Now it's time for that example I promised. Suppose you entered the query {hard drive test} and
you're looking at document D4 = {...as part of the factory acceptance, every unit gets a hard drive test...}. This document contains each of the query words once. Think of the document D4 as a bag of marbles where each document word (unigram) is a different marble in the bag.

Now randomly take a marble out, what's the probability that marble was one of the query words? It should be the number of times the word is in the document divided by the number of words in the document. If D4 is only 20 words long, then the probability of pulling out the word "hard" is 1/20. The probability of pulling out "drive" and "test" if also 1/20 for each term. If you're familiar with statistics, this is random sampling WITH replacement (put the marble back in the bag after you pick one), because the probability of pulling out "drive" isn't 1/19 after I've picked my first word. Now that we understand that we can predict the probability of generating the query with the document we're looking at; it's just the probability of pulling out each word multiplied by each other, or (1/20) x (1/20) x (1/20) = 1/8000. Using this methodology, you can look at all of the documents in the collection and do the same calculation; the highest ranked document will be the one that spits out the highest probability. If you want the mathspeak version of this probability it looks like this, "p(q|d,R=1)". This basically assumes that all of the documents are relevant (we know they're not) and gives the probability that they generated your query.

Now that we have that understanding, what happens when a method like the one above runs into a document that doesn't contain one of the query words...think about it...the probability calculated will be 0. That's because if 1 query word is missing from the document, then one of the terms that get multiplied together will be something like (0/20) which equals 0. Multiplying anything by 0 gives you 0 so this seems to penalize the document WAY to much. What if this document had the phrase "hard drive examination" instead of "hard drive test". Would we really want to rank that document at the very bottom...I think not! There's a way to fix this problem, but I'm going to have to explain some formulas before I do that.

Query Likelihood Formulas
Some words are very rare in documents. If you happen to be searching for a rare term, then the probability of finding this term will be very small. This small probability will multiply with other fractions and then it's pretty likely that you'll end up with tiny values for your query probability. In a computer, when variable values get too close to 0 there is a growing risk that round off error in the computer will start to become significant and distort your results. To avoid this, the magical properties of the logarithm come to save the day. If you look at the chart below for the log(x) you will see that as x increases, the log(x) also increases (by the way I assume log base 10 here and through the rest of this post). It's not a linear increase, but when it comes to ranking things, that's OK. if X is greater than Y, then log(X) is greater than log(Y). Also notice that the chart spans values where X is 0 to 1, like probabilities are required to do.

The other thing that is SO cool about the logarithm is that log (A*B) = log(A) + log(B). If we apply this rule to the probabilities we're adding together then for the example above (1/20) x (1/20) x (1/20) becomes log(1/20) + log(1/20) + log(1/20). It doesn't look very different here, but this minimizes the round-off problem in a computer. In mathspeak, if we have a lot of terms that get multiplied together we use a large pi symbol. The way we calculated the probability of a query being generated from a document above would look like this

f(q,d) is just a ranking function that depends on the query, q, and the document, d. The fraction to the right is the count of a word (this is a word in the query) in a document divided by the number of words in the document. This fraction can also be written as p(wi|d), or probability of a word given document, d. That BIG pi symbol means that you're going to multiply all of fractions to the right together from the first word (i=1) in the query to the last word in the query (there are n words in the query). If we take the logarithm of this formula, then we get this

Both of these formulas are equivalent, but before your head explodes, calm down and we'll walk through each one slowly. For both of them, we still have f(q,d) as the ranking function. We don't have log(f(q,d)) because there's no reason to do this since we have proven that the order is maintained when we take the logarithm. In the top equation, we have the same fraction on the right, but we take the logarithm of this fraction. Instead of multiplying all of these fractions from i=1 to n, we add them all up. That's what the BIG sigma symbol means. Now the difference between the first line and the 2nd one is the subscript on the sigma symbol and the term c(w,q). The subscript on the sigma symbol means that we are going to sum over all of the words in the volume (or words in the collection of documents). The reason we can do this without messing everything up is that we are multiplying each term by c(w,q) which is the count of the word in the query. So when 'w' in the summation equals "hard" then c(w,q)=1 and in effect we're saying this one counts/matters. If the 'w' in the summation equals "banana", that's not part of our query, so c(w,q)=0 and we add nothing to our ranking function. At this point in time, you may be thinking, why would anybody add that complexity to the equation? We'll see why this makes some of the notation easier to understand in upcoming sections

Language Model Smoothing
If we plot the probability of a word being selected in a document using the query likelihood model on one axis and the word number on a 2nd axis, we might get something that looks like this

As described earlier, if the word doesn't exist in the document, then there is 0 probability that it can be picked. As described earlier, this is probably not desirable because there might be a document that matches very closely, but does not contain one of the words in the query. What would be better is if we could adjust the curve to a little to look more like the green line below

Notice that near the end of the curve there are non-zero values for p(w|d). These non-zero values will help solve the problem of query terms that don't show up in a document. Since this green curve represents the average probabilities for all the words in the document collection, we wouldn't want to just use this curve for p(w|d). If we did, all of the documents would have the same score and the calculations would be pointless. What we really want is a way to kind of take a weighted average between the actual document we're ranking and the whole collection of documents. You can imagine that if we did this that the bumpy blue curve would become much smoother, thus the name "smoothing" for this approach. One method of doing this is called the Jelinek-Mercer(JM) model.

To get us to the JM model we've got to go through a derivation first. I've tried to make this derivation as visual as possible. If you don't really care about how the equation is derived, you can just skip down a little bit. But, if you're feeling adventurous, here's what I've come up with to explain it.

The top equation is the one we have already explained. The next step down splits this equation into 2 parts that represent a weighting for the probability of words found in the document and a weighting for the probability of words found in the rest of the collection. You'll notice that the probability of words in the document turned into a weird looking Pseen. I'm going to ask you to just ignore this for now, we'll explain this more later. On the 3rd line we split the 2nd term on the 2nd line because the sum over all of the terms not found in the document is the same as taking all the words in the collection and subtracting the words found in the document. On the 4th line we use one of the properties of logarithm to split the alpha term out from the p(w|C) term. Once we've done all of this we combine and reorganize all the terms in the last line. The first term in the first line takes advantage of the fact that log(a)-log(b)=log(a/b). For the 2nd term, since alpha is a constant we can simplify the notation where |q| is the count of words in the query. The last term on the last line is just added to the end from the 4th line.

Now that we have this equation on the last line above, we can adapt it to the JM method of smoothing. To do this we need to define this term circled in red

The JM method starts by defining how it wants to perform the weighting between the probabilities from the documents and the collections. Essentially it is giving a new definition for p(w|d), or the probability of a word given a document. Here's how this is defined in the JM method

Let's start by saying that lambda(λ) is a user selected variable that ranges between 0 and 1 (it's a different way of defining alpha in earlier equations). The first term is the weighted probability of a word based on data from the document, and the 2nd term is the weighted probability of a word based on the collection of documents. The 2nd half of the first term where we have that fraction, we're just taking the count of the word in the document, divided by the count of words in the document. You can see that if we set lambda to a large value close to 1, that we will basically just be taking the probability of a word based on the document collection. With this one definition, we can now derive the rest of the JM ranking function.

If you're one of those people that don't care about derivations (it's OK, I used to be one of them) just look at the last equation line and use it. If not, here's a quick explanation. The first equation row just used the definition for Pseen to simplify the fraction a little bit. Notice that lamda gets substituted for alpha. Up until now we have been using alpha as our generic term that defines the weighting between the document and collection. Since the JM method defines this as lambda, we just swap alpha out for lambda. After obtaining a simplified fraction for that weird Pseen fraction term, we substitute it into the ranking function in the 2nd line. We also get to completely remove the last 2 terms of this ranking function because they're actually constant. The last term is a probability of a word in the collection and that will be the same for any document in the collection. The 2nd to last term is based on the number of words in the query, which is the same for every document we're trying to rank as well. So, there's no need to calculate these values if we're only interested in ranking documents. They get the X, and we end up with a simpler equation in the last line. The sum in this final equation is over all the words in the query and document

Now that we have this equation, let's finally do a simple example with it to show how to use it. Let's say that we have the same query and documents used in the post about the vector space model:

q = {hard drive test}
D1 = {...it's hard to determine...}; |D1|=365
D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}; |D2|=50
D3 = {...before I bought my new car, I took it out for a test drive...}; |D3|=75
D4 = {...as part of the factory acceptance, every unit gets a hard drive test...}; |D4|=50
D5 = {...that was a hard test...a standardized test is design to be hard for...}; |D5|=230

To do all of the calculations we need to know the probability of finding the query words in the collection of documents. We take the count of the query words in all the documents and divide them by the total number of words in our collection. For example, the word "hard" shows up 5 times in our collection of documents, and there are 365+50+75+50+230=770 words in the collection. So the probability of "hard" in the collection is 5/770 = 0.006494 = p("hard"|collection). If we do the same for drive and test we get p("drive"|collection) = 0.005195 and p("test"|collection) = 0.005195.

The only thing left before we rank some documents is picking a value for lambda. Let's just say that we use 0.5 for this example. For the first document we would get the following.

Notice that I only included 3 terms here because the other terms have values of c(w,q) that are equal to zero. This is because there are only three query terms. All of the other words in the collection aren't in the query so their value is zero. This is actually a big weakness for the JM method. It doesn't actually solve the problem where there are terms missing from our query. Instead is smooths out the probability of the terms that are in our query with the probability from the collection. To solve the problem where we don't include a term in our query, you have to use another method like Dirichlet Prior of BM25. These are examples of other smoothing methods that the data mining community has created. If we continue our example using the JM method we get the following ranking values for the documents in our collection:

We can sort these scores from largest to smallest and output the documents to the user. That's basically how it all works. I think that wraps it up. If you have any other specific questions about this method, please say so in the comments and I'll see what I can do to augment this explanation to cover it. As always I hope this helped somebody out there!

Precision and Recall...and Other Variations

2015-04-04T18:40:00.000-04:00

When evaluating the effectiveness of a search algorithm, or a classification problem, one the problems we have is how to measure this "effectiveness". This blog post will cover 2 basic measures for this effectiveness, precision and recall, along with some variations that stem from them.

Let's say that you are using Google and you want to find some information about how to fix your kitchen sink when the garbage disposal stops working. You might go to www.google.com and type "fix garbage disposal in sink" into the search field. When I did this I got this

The actual results in the image above don't matter much to our current discussion. The point is, if you're anything like me, sometimes you do a search like this and ~50% of the results are not what you're looking for. You end up looking through the first page or so, hoping to find what you're actually looking for, and then maybe give up, try different search terms, try again... To avoid that, we need a way to measure our search algorithms' effectiveness and report them so they can be improved. To do this, let's say that you perform a search and based on your judgement, this is how you would assess the results returned

Each result in this table is like another result in the ranked list that Google returns when you do a search. We'll use this data to explain precision and recall

Precision
Think of precision as answering the question, "how many of the results that I got are actually relevant?" This question is answered as a percent. So in the example above, we say that we have 4 relevant results out of 10, or 40% (0.4). When it comes to search algorithm performance, a standard rule of thumb is to just use the first 10 results to measure this. We'll be using this throughout this blog post.

Recall
Think of recall as answering the question, "How many of all of the relevant results did the search show me?" I haven't told you how many total relevant results there are for the example above yet though so we need more information. Let's say that in the whole database (could be the web), there were actually only 5 relevant documents (webpages, whatever,...). Once we know that we would be able to say that the recall in this example was 4 out of 5, or 80% (0.8).

Which one to use?
So would you rather use precision or recall? Think about it. Which question are you trying to answer? Do you want all of the results to be relevant, or do you want to make sure you return as many of the relevant results as possible? They're both desirable, but a little different.

For most Google searches I perform, precision seems to be more important because I really only need to find 1 or 2 websites that answer my question; having 100 websites that answer my question doesn't add much value.

If we're trying to do an exhaustive search for all of the documents that cover a certain topic (e.g. a patent search, or information search for a PhD dissertation), then recall seems to be what we care about.

In most applications there is a desirable balance between precision and recall. The easiest way to calculate a balance would be to just take the average. What happens when we do this? In the example above we would get 0.60 which seems to make sense. What if we have a different example where there are 2 relevant documents, but the results look like this

In this case the precision is 2/10, or 0.2. The recall is 2/2, or 1.0. If we take the simple average of precision and recall, we get 0.6 again. This doesn't make sense because even though the precision was much worse, the recall of 100% balanced it out. What we really want (instead of using a simple mean) is a way to get an average that is only high when both are high, and can get low when either precision or recall get low. One special way of doing this is called F1. The F1 formula looks like this

This formula is pretty handy. Since precision and recall are always going to be values between 0 and 1, the numerator will always be between 0 and 2 (2 times 0, or 2 times 1). The denominator will also always be between 0 and 2 (0+0, or 1+1). The numerator will also always be smaller than the denominator; this gives us F1 values that are always between 0 and 1. I want to show you some examples so you get a feeling for how this works. The table below shows a wide array of F1 values based on different values of precision and recall

I used some conditional formatting in Excel to make this table a little more visual. Red values are lower and green values higher. You can see that there is strong preference for instances when both precision and recall are high, and a penalty when either one of them is too low.

F-Measure (general)
As mentioned above there may be times when you want your search engine to favour precision over recall, or vice versa. To deal with this problem there is a general form of the F1 formula that allows the user to adjust a value, Beta (β), in order to give more weight to precision or recall. This general form looks like this

If you set beta equal to 1, then you'll find that this formula simplifies to the F1 formula I already showed you earlier. It may not be intuitive for people out there how you should change the value of beta to favour precision or recall, so I thought that I would create a small chart that would give you some examples of how the F value changes with different values of beta.

First thing you need to notice is that the horizontal axis on this chart is in a logarithmic scale. This should help you understand that if you really want to change the way the F-score favours precision or recall you need to think orders of magnitude, not small changes. I created the red and blue lines in the chart to show that switching the values of precision and recall doesn't create symmetric lines. That's OK though because when you use an F score, you're really only trying to figure out if the results are better than another; relative comparison.

Let's use this F-score chart to think through a theoretical example. Say that the chart above represents the results from one query for 4 different versions of a search engine I created. If that were true, I could then use the chart above to estimate a value of beta that seems to match the way I want the results to come out. Suppose that I think that the results from the purple version of the search engine are the best for what I'm looking for, then the green one, then the blue one, then the red one. In that scenario, I would need to pick a value of beta somewhere between 0.4 and 1 (remember it's a log scale so that cross-over point between the blue and green line is difficult to estimate visually).

Average Precision
Another way to try to combine the effects of precision and recall is Average Precision. To visualize this approach we'll plot what is called the precision recall curve. To create this curve, we have to figure out what the precision and recall is as you progressively work down the results from the query. If we do this for the 2 examples I used at the beginning of this post you'll get something like this.

It's important to see that the precision is calculated as you go. So for the first line in the first example, the precision is 1, because it's the first result and it's relevant. The second line has a precision of 0.5 because we still have the relevant result from the first line, but the 2nd one isn't relevant and we've only looked at 2 results (1/2). When you construct a precision recall curve, you only care about the points where you find a relevant result (green dots below). The curves for the 2 examples above look like this.

When you look at these 2 curves you can see that the results returned aren't perfect. If they were, there would be a string of green dots at precision 1.0 along the top, and then there would be a drop off at the end once all of the relevant results were returned. To compare against this ideal state, we can do something analogous to taking the area under the curves to measure how well the search performed. The formula for this is the sum of the precision at all of the green dots divided by the total number of relevant documents/results that could have been found (NOT the number of results returned in the query). For the first example that's (1 + 0.5 + 0.6 + 0.5) / 5 = 0.52. If you work through the 2nd example on your own you'll see that the average precision is 0.156 (much worse, and that shows in the curve). This 2nd example is so much worse because although the search returned all of the relevant results they showed up at the bottom of the list. This average precision metric does a great job of penalizing ranked lists if the relevant documents aren't near the top of the list.

These are the methods and variants of calculating precision and recall that I learned from professor ChengXiang Zhai's "Text Retrieval and Search Engines" Coursera course. There are other ways of measuring the effectiveness of a search algorithm that were presented in that course as well. Perhaps I'll write a post about that some other time in the future.

As always, I hope this helped somebody out there understand these topics more clearly.!

Simple Text Retrieval Vector Space Model Explanation

2015-03-28T11:08:00.002-04:00

This post is going to be the beginning of my coverage of the content in the Text Retrieval and Search Engines course taught by Chengxiang Zhai at the University of Illinois through Coursera. I'm going to review the basic vector space model for text retrieval, and hint at some of the evolutions of this model and why they're important.

Vector Space Model
I'm not sure how many of you out there took linear algebra courses, or know much about vectors, but let's discuss this briefly, otherwise you'll be completely lost. Vector space doesn't look like outer space, it looks more like this if you look at a simple 2-dimensional vector.

The vector here starts at the origin of the graph and then extends out to the point (1,1). In text retrieval, the 2 axes of the graph might represent 2 search terms you're looking for (think googling "hard drive"). The vector on the graph would represent the query itself. Here, the fact that the vector points to (1,1) means that the query contains 1 instance of the term "hard" and one instance of the term "drive and this could also be represented as {hard, drive}. On the same graph, we can add documents that are in our database (could also be the world wide web). there are a couple ways to do this, but we'll start simple. The easiest way to do this is just search through the document(s) and determine if the search terms are present in the document. If they are, assign that term a 1, if not assign it a 0. This is called the bit vector approach. Assume that the following represents 2 documents with only the relevant text included.

D1 = {...it's hard to determine....}
D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}

Using the 1 and 0 rule explained above, I would assign D1 the vector (1,0) and D2 the vector (1,1). In this version of text retrieval these vectors are called bit vectors. Plotting these on the vector space graph we get this.

'q' here represents our "query". You can see that D2 matches the query perfectly using this system. D1 seems to be going in a different direction. Data scientists use this "closeness"of the vectors to measure similarity between the search terms and the documents reviewed. To do this, they use the dot product. A dot product is just a fancy way of saying you multiply the terms together and then add up the sum. If you're familiar with MS Excel, this is like using the sumproduct() function. For our example above, the dot product between the query and D1 is 1x1+1x0=1. For D2 we get 1x1+1x1=2. The dot product value can be used the rank the documents. "Closeness" is ranked like a basketball score, the higher the better, so you can see that D2 ranks higher than D1.

Now I want to step back and explain the dot product a little more so you can get some intuition behind why the dot product works well for measuring the similarity of vectors. If you look up the definition of a dot product on Wikipedia you'll find that the dot product (A⋅B) can also be calculated this way A⋅B = ||A|| ||B|| cosθ. The ||A|| represents the length of the vector A. if you remember back to trigonometry. The length of the hypotenuse of a right angle is calculated like this

If you use this thinking and apply it to vector space, you can see that for the query and the document we have right angle triangles and we can calculate the length of the vectors using this method. When we do this we get ||D1||=1, and ||D2||=||q||=1.414... where ||q|| represents the length of the query vector. Just because the lengths are equal doesn't mean that the vectors are the same. That is why the dot product definition includes the cosθ term. If the value of theta (θ) is close to 0, then cosine is close to 1; as theta gets closer to 90 degrees, cosine gets closer to 0. So, if you have two vectors that are pointed in almost the same direction, then the cosine between them will be ~1. This allows the ||A|| ||B|| to add to the relevance score. If the two vectors are 90 degrees apart from each other the ||A|| ||B|| will get multiplied by something close to 0, essentially eliminating the effect.

Now that we have that simple understanding of vectors and dot products lets look at this in 3 dimensions. In the diagram below we have a 3 word query that creates 3 dimensions to compare documents against. Here we're assuming the query is {hard, drive, test}.

If we look at several documents, we might have documents that contain the following text:
D1 = {...it's hard to determine...}
D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}
D3 = {...before I bought my new car, I took it out for a test drive...}
D4 = {...as part of the factory acceptance, every unit gets a hard drive test...}
D5 = {...I think I failed, that was a hard test...a standardized test is design to be hard for...}

When I put these documents in the 3 dimensional vector space diagram it looks like this

You can see the angles between the query vector and the document vectors and get a feel for which documents match closely. The actual dot product calculations for these query/document comparisons turn out like this using the binary bit vector approach described earlier.

Based on the dot product values, it's obvious that D4 is the best match. You can also see this in the last column that gives you the angle between the vectors. To calculate this angle I had to calculate the length of the vectors. This is VERY similar to doing it in 2 dimensions. 3D length is calculated like this

A, B, and C are perpendicular lengths in the vector space (distance along hard, test, and drive dimensions). You just square them all and take the square root. This same principle applies to higher dimensions (e.g. the length of a 4 dimensional vector is the square root of sum of the 4 squared terms/dimensions).

Term Frequency
The bit vector (1's and 0's) approach is a very simplistic model however, and there are times when this approach doesn't give you very good results. Let's the do the same problem over again with what is called the term frequency approach. This method counts the frequency of each of the terms found in the query, instead of just determining if the term exists in the document or not. If we apply this rule to the problem above, the vector space looks like this

And the overall dot product calculation table looks like this

In bit vector mode, D2, D3, and D5 were all considered equally relevant. Using term frequency we are able to say that D5 is a better match than D2, and D2 is a better match than D3. The problem though is that now D5 looks like it's a better match than D4, which doesn't make sense. One method to deal with this problem is to discount the score impact of terms that occur frequently in documents. In our case, if we discounted "hard", or "test", we might find that D4 comes out on top again. This method is called inverse document frequency and I'll write another post about that later. When I do, the link to it will be HERE.

As always, I hope this post helps somebody out there.

Graph Pattern Mining (gSpan) - Introduction

2015-03-21T15:51:00.000-04:00

For those of you out there that follow this blog fairly regulary, you may have noticed that there has been a lull in the amount of posts recently. This is not because I have not been working on my next post, it is because it has taken me quite some time to understand the topic described in this post. That being said, I'm assuming (yes I know what happens when people assume) that for many of you out there, this topic would drain several precious hours of your life to understand as well. So, I'm going to share what I've learned about the gSpan algorithm. In this post I will lay the foundation by describing how data scientists create some sort of order out of all of the different possible graph configurations that can exist. In my gSpan algorithm post, I'll describe how the information presented in this post is used to find frequent graph patterns. The information in this post is based on "gSpan: Graph-Based Substructure Pattern Mining" by Xifeng Yan and Jiawei Han, September 3, 2002.

Graphs and why they're tricky to pattern mine
First of all, let's start SUPER simple. What do people mean when they talk about finding patterns in graphs? Let's just say that they aren't talking about something like this:

They are actually talking about some sort of visual representation of something, like this:

I have no idea what the graph above represents...I totally made it up. But, it could look like a chemical composition diagram to some of you out there, or maybe a social network diagram, a family history tree...whatever. Essentially, graphs are systems that have nodes (the A, B and C in the picture above) and connections between them (solid, dashed and double lines above). In graph mining terminology they call a node a vertex and a connection an edge. If you ask me why, I have no idea, but that random knowledge might help you if you ever decide to pick up a technical paper on the subject.

Now, if you have followed some of my previous posts on pattern mining (Apriori Basket Analysis, or Generalized Sequential Pattern Mining) you might look at that graph above and ask yourself, where do I start with that thing? This is a major question that have vexed many mathematicians and data scientists. Let's just start by stating the fact that, in general, when we're dealing with graphs, we don't really care about the orientation of the graph, or the how it "looks" per se; we really only care about what vertices are connected to each other and how. As an example here's that first graph again, with some other equivalent graphs.

You'll notice the term "Isomorphism" in that picture. That's just a fancy way of saying those graphs are essentially the same as graph 1 if you only consider the vertices and edges, but rearranged to look different. Since what we really want, is to be able to look for sub-graphs (or small subsets of the graphs in our database) and see if they are frequent, we would love to be able to create some sort of order for the graphs so we could kind of treat them like a sequential pattern. In sequential pattern mining we represent patterns something like this <A,(B,C)D,E,(E,F)>. There's a chronological order there: first A, then B and C, then D, then E, then E and F. But with a graph, where do you start??? For example, I can number the vertices of graph 1 several different ways

First of all, make a note that the numbering starts with the number 0; this is referred to as the root. Two of the variations above start at the C at the top (but is that really the "top"?) One starts at the A in the lower left corner, and the last one starts at the B. The order that the nodes are numbered can create a LOT of numbered graphs, but they're all different versions of the same thing. What a data miner really needs is a way to create some sort of sequential order so that we always number graph nodes/vertices the same way and therefore don't waste a whole bunch of time dealing with graphs that might be numbered differently, but represent the same thing we already have. Since, I'm focusing on the gSpan algorithm, I'm going to describe how DFS codes solve this problem.

DFS in DFS code stands for depth first search. When we get into the gSpan algorithm, the reason for this will become more apparent, but for now just go with it. A DFS code is just a way of documenting the vertices and edges of a graph in tabular form, but with special rules. We'll represent each edge by defining 5 things. Do do this, first we need to create codes for the features in our graphs.

With this coding system, we can turn the green edge in the 1st graph above into (0,1,C,c,B). The 0 is for the starting vertex; the 1 is for the ending vertex; 'C' is for the starting vertex label/type; 'c' is for the edge label/type; and 'B' is for the ending vertex label/type. The reason for having a code for both the vertex number and the vertex label is that if you're analysing a graph for frequent patterns, the vertex number doesn't give you much information about patterns, but we need it to keep track of where we are in the graph. If you finish coding all of the edges in that first graph you might come up with something that looks like this.

Since you don't have any rules at this point to guide you, you could have just as easily come up with DFS codes that look like any of the DFS codes below...or more

All of these examples start from 0 in the lower left corner, but are very different. It becomes obvious that we need to come up with a standardized way to order our graphs for several reasons; (1) so that we don't drive ourselves crazy comparing codes like the ones above only to realize they're the same graph, (2) to make is easier to find a subgraph we might think is frequent in many different graphs that will likely have very different configurations. One of the things required to solve these problems is a set of rules called "Neighborhood Restriction"that limit the ways that we can create these lists of connections in our graphs. These rules govern the order that you add edges to the list when you're creating codes for a graph. The simplified English version of the rules listed in the technical paper go something like this.

If the first vertex of the current edge is less than the 2nd vertex of the current edge (forward edge)

If the first vertex of the next edge is less than the 2nd vertex of the next edge (forward edge)

If the first vertex of the next edge is less than or equal to the 2nd vertex of the current edge
AND If the 2nd vertex of the next edge is equal to the 2nd vertex of the current edge plus one this is an acceptable next edge
Otherwise the next edge being considered isn't valid

Otherwise (next edge is a backward edge)

If the first vertex of the next edge is equal to the 2nd vertex of the current edge
AND If the 2nd vertex of the next edge is less than the 1st vertex of the current edge this is an acceptable next edge
Otherwise the next edge being considered isn't valid

Otherwise (the current edge is a backward edge

If the first vertex of the next edge is less than the 2nd vertex of the next edge (forward edge)

If the first vertex of the next edge is less than or equal to the 1st vertex of the current edge
AND If the 2nd vertex of the next edge is equal to the 1st vertex of the current edge plus one this is an acceptable next edge
Otherwise the next edge being considered isn't valid

Otherwise (next edge is a backward edge)

If the first vertex of the next edge is equal to the 1st vertex of the current edge
AND If the 2nd vertex of the current edge is less than the 2nd vertex of the next edge this is an acceptable next edge
Otherwise the next edge being considered isn't valid

I know that's hard to follow, I'll walk you through one and give you the answers to a couple more of these. It just so happens that Example 1 above meets all of these criteria. So, for (0,1,C,c,B) to (1,2,B,a,A) we see that 0,1 is a forward edge, and 1,2 is a forward edge. so we need to make sure that the 1 in (1,2,B,a,A) is less than or equal to the 1 in (0,1,C,c,B)...check; AND, we need to make sure that the 2 in (1,2,B,a,A) is equal to the 1 in (0,1,C,c,B) +1; 2 = 1+1...check. so the step from edge 0 to edge 1 is valid.

Let's do another one. In example 1 still, (1,2,B,a,A) is a forward edge because 1 is less than 2 and (2,0,A,b,C) is a backward edge because 2 is greater than 0. So we check if the 2 in (2,0,A,b,C) is equal to the 2 in (1,2,B,a,A)...check; AND we check if the 0 in (2,0,A,b,C) is less than the 1 in (1,2,B,a,A)...check.

If you keep going through Example 1, and the rest of the examples above, you'll see how hard it might be to create this type of order at random

I created example 1 to fit the rules on purpose. Examples 2, 3 and 4 were created at random. I was surprised to see how many edge steps I accidentally got right for those 3.

If you take some time and are careful, you can create several different DFS codes like example 1 that meet the neighborhood restriction rules, but are different from example 1. Below are 3 examples (1, 5, and 6) that meet the rules. To get these 2 other examples, I started at different locations in the graph. The easiest way to get these patterns (and the method that is actually used in gSpan) is to do the following: First you take a step forward, then try to make any backward connections possible connecting from smallest to largest (e.g. 3,0 comes before 3,1 if applicable). Then, you take another step forward (e.g. 3,4) and repeat until you're done with the graph.

So, it appears we've gotten closer to figuring out the 1 way we should represent a graph, but we're not quite there. To fix this problem the creators of the gSpan algorithm have an ordering system that will allow us to figure out if example 1 is "less than" example 5, or if example 5 is "less than" example 6. This ordering system is great because once we have this, we just define THE representation of the graph (the one we'll be using in the gSpan algorithm) as the minimum DFS code possible. This order is called DFS Lexicographic Order. You may remember from other posts that lexicographic is just a fancy way of saying the order is going from smallest to largest (e.g. 1,2,3 or A, B, C...). To do this we also make use of the fact that we have given codes to our vertex types and edge types. Remember how the blue circles in our examples are 'A', and the dotted green lines are 'c'? 'A' is a vertex label and 'c' is an edge label. Here's the rules that govern this order that will help us get our "minimum" DFS code; I'm going to write it in a very simplified pseudo-code format.

Let X be the one version of the graph's DFS code Y be another version for the same graph

Assume that X > Y to start
Start by comparing the first edge (edge 0) of both DFS codes
Start a loop for comparisons going through all the edges

Check each of the following rules; if one applies, set X < Y and exit the loop

Is X a backward edge and Y a forward edge?
Is X a backward edge, Y a backward edge, and the 2nd vertex of X < 2nd vertex of Y?
Is X a backward edge, Y a backward edge, 2nd vertex of X = 2nd vertex of Y, and the edge label of X is less than the edge label of Y?
Is X a forward edge, Y a forward edge, the 1st vertex of Y < the 1st vertex of X
Is X a forward edge, Y a forward edge, the 1st vertex of X = the 1st vertex of Y, and the label for the first vertex of X is less than the label for the first vertex of Y?
Is X a forward edge, Y a forward edge, the 1st vertex of X = the 1st vertex of Y, the first vertex label of X = the first vertex label of Y, and the edge label for X is less than the edge label for Y?
Is X a forward edge, Y a forward edge, the 1st vertex of X = the 1st vertex of Y, the first vertex label of X = the first vertex label of Y, the edge label for X = the edge label for Y, and the 2nd vertex label of X < the 2nd vertex label of Y?

If you've made it to this section of code, you haven't proven that A is less than B yet
If there's another edge left in the graph to check, increment up one edge (e.g. go from edge 0 to 1)

End Loop

Again, like a lot of things in data mining, this might look complex, but when you see it in a couple of examples it is pretty easy to follow. If we apply this to Example 1, 5, and 6 above you'll see what I mean.

Let's let example 1 be DFS code X and example 5 be DFS code Y in the code above. For the first edge, they are both forward edges with 0,1. So, rules 1 through 4 in the code don't apply. When we look at rule 5, we see that the 'C' in (0,1,C,c,B) is "greater than" the 'A' in (0,1,A,a,B) so rule 5 doesn't apply either. Because the first vertex label 'C' and 'A' aren't equal, rules 6 & 7 don't apply. So, comparing the first edge (edge 0) didn't give us the answer. We move on to the next edge (edge 1). For X we have (1,2,B,a,A) and for Y we have (1,2,B,c,C). The first two items match and they're forward edges so we jump all the way down to rule 5. We see that the first vertex labels match too so we go to rule 6. When we look at rule 6 we see that the edge 'a' is "less than" the edge 'c' so we get to say that X<Y, or Example 1 is "less than" Example 5. We don't have to check the rest of the edges because we break out of the loop once we find just one example of one of the 7 rules being met.

Let's compare example 5 (X) and example 6 (Y). Starting at edge 0, we see that they're both forward edges with (0,1,...), so rules 1-4 don't apply. When we look at rule 5 we see that the 'A' from (0,1,A,a,B) is "less than" the 'B' from (0,1,B,c,C) which satisfies rule 5. This makes example 5 "less than" example 6; the loop breaks and we have our answer.

So, for our 3 examples, we know that example 1 < example 5 and example 5 < example 6. Just like other places where inequalities are used, this information tells us that example 1 < example 6, or to be complete example 1 < example 5 < example 6.

With the information and understanding we've covered in this post we are armed to tackle the gSpan algorithm itself. When I complete that blog post, I'll add a link to it right HERE. Hope this helps somebody out there!

Generalized Sequential Pattern (GSP) Mining

2015-03-10T21:06:00.003-04:00

This is going to be my first post about sequential data pattern mining. I'm starting this post by explaining the concept of sequential pattern mining in general, then I'll explain how the generalized sequential pattern (GSP) algorithm works along with its similarities to the Apriori method.

Sequential Pattern Mining

If you're a store owner and you want to learn about your customer's buying behaviour, you may not only be interested in what they buy together during one shopping trip. You might also want to know about patterns in their purchasing behaviour over time. "If a customer purchases baby lotion, then a new-born blanket, what are they likely to buy next?" With information like this, a store owner can create clever marketing strategies to increase sales and/or profit.

To get this data, the Apriori algorithm has to be modified to account for this time delay between transactions. For this portion of the discussion I'll be referencing a technical paper by Rakesh Agrawal and Ramakrishnan Srikant entitled "Mining Sequential Patterns". Say you start with a database full of transactions that looks something like this.

Suppose that the transaction data represents the day of the month that the item was purchased. As a human, we can see some repeat customers, and make some guesses about some of the repeat patterns that might exist, but we want an automated approach. An easy way to find sequential patterns is to utilize a modified process similar to the one developed for the Aprioi Algorithm that is for single purchases. To do this we just sort the table first by customer ID, and then by transaction time stamp. We'll get something that looks like this.

You'll also notice an extra column at the right that takes the sorted information and turns it into customer sequences. The reason for this is that we can utilize this format to data mine sequential patterns much like patterns within transactions are mined (e.g basket analysis). How people format the items in the sequence varies a little bit among the couple technical papers I've read, but let me explain how I've formatted it above. Every sequence of transactions starts with a '<' and ends with a '>'. That's all those symbols mean. The next thing you need to know is that any time 2 or more items (letters in my example) are surrounded by parentheses (e.g. "(AB)", which might be an apple and an orange), those items were purchased at the same time. If any items are not surrounded by parentheses, then that means that when those items were bought, the customer didn't buy anything else with them. So if we had a sequence like <A(BC)DE(FG)> that would mean that there were 5 different transactions. The customer first bought A, 2nd B and C together, 3rd D, 4th E, and 5th F and G together. Also, I'm not sure if you noticed, but I always have the items within parentheses ordered alphabetically. This is common to sort the items bought together from least to greatest (the fancy technical term for this is lexicographic order). With that understanding, now we can start talking about how the GSP Algorithm works.

GSP Algorithm

The Generalized Sequence Pattern algorithm was created from a simpler algorithm for mining sequences, but it has some extra bells and whistles added so it can be more flexible for different situations. To explain the process, I'm going to start with the basics, then add the bells and whistles at the end. The beginning of the GSP algorithm is basically the same as the Apriori Algorithm, because for 1 item, there really is no order. So, we find all of the items in our customer ID sequence database that meet the minimum support threshold. We'll use the data above and say that the minimum support is 2 in this case. If that's true we get the table below.

There are a couple of tricky things here. The more obvious one is that item E did not have enough support to meet the threshold, so it get's removed. Based on the Apriori principle, we are able to conclude that because E is not frequent, then any sequence with E in it would not be frequent either. So, we don't need to keep track of it anymore.

The next tricky is thing is that the support for B is listed at 5, not 6, even though the 3rd customer bought item B twice. This practice is also borrowed from the Apriori algorithm. In the Apriori algorithm, if a customer buys 2 candy bars at once, then we only count 1 candy bar when calculating the support, because we count transactions. The same applies here, except instead of counting how many transactions contain an item, or itemset, we are counting how many customers have an item/sequence.

Once we have these items we start the iterative process of generating larger patterns from these items and checking if they have support. This is where things get really interesting. In Apriori, the total possible combinations would be AB, AC, AD, AF, AG, BC, BD, BF, BG, CD, CF, CG, DF, DG, FG. When generating sequences, that is not nearly sufficient because the order matters, AND we can have sequences where single items repeat (e.g. I bought A, then the next day I bought A again). So we have to create tables that look like this

Because order matters, there are a lot more options. Oh and in case you forgot, there is also the possibility of 2+ items being bought in the same transaction

The diagonal values are blank here because they're already covered in the first 2-item table. Just to make sure you understand, (AA) isn't there because if A and A were bought in the same transaction it would only be counted once so we never have two of the same item together within a parentheses. The lower left is also blank, but this is because of the convention I already described of always listing items purchased together in ascending order.

So, now that we understand all of that, we can check our 51 possible 2-item sequences against the database to see if they meet the support threshold. I'll walk you through the first row of 2-item sequences to show you how this is done (see the table below).

In the first column, we see that there are no customers that ever bought item A on more than 1 occasion, so support there is 0. For <AB>, you can see that customers 1 and 5 bought those items in that sequence. For <AC>, notice that customer 4 bought (AB) together then C, but we get to pick out just A from that combined transaction to get our <AC> sequence. Notice that the items don't have to be right next to each other. It's OK to have other items in between. <AD> follows similar logic. For <AF> we don't count customers 3 or 4 because although they have A and F, they aren't in the right order. To be counted here, they need to be A, then F. When we look for the support of <FA> those will get counted. There is one last rule you need to know for counting support of sequences. If you are looking for support of a sequence like <FG>, then customer 1 doesn't count. It only counts if F is bought before G, NOT at the same time. If you work through the rest of the 2-item sequences that weren't purchased together (i.e. not ones like (AB)), you get support counts that look like this

You see that I have struck out any 2-item sequence that do not meet the support threshold of 2. Now let's take a look at the 2-item sets that could have been purchased together (e.g. (AB)). This one is a little bit easier, because there are only 5 times in our database when 2 items were purchased at the same time. They were (FG), (AB), (AB), (BC) and (DE). Only (AB) shows up twice so we keep it and get rid of the rest. Now we have these 2-item sequences left to work with:

Now we start the next iteration looking for 3-item sequences. This is where I'm going to start referring to "Mining Sequential Patterns: Generalizations and Performance Improvements" by Ramakrishnan and Rakesh Agrawal. In that paper they describe a process for generating candidate sequences that I think is pretty cool, now that I understand it. You take all of the remaining sequences from the last step (AB, AC, AD, AF, AG, BC...), then you remove the first item from the sequence. For AB, we remove A and we're left with B. Do this for all of the sequences (see 2nd column in table below). Then you do the same thing, but remove the last item (3rd column in table below) . This is pretty straightforward except when you're dealing with a sequence that ends in a multiple item purchase like (AB). When that happens, you have to remove all of the possible items one at a time and treat them separately. So if you're removing the "first" item from (AB) you remove A and get B leftover, and then you remove B and get A leftover. Hopefully I haven't lost you in that explanation. If I did, I think reviewing this table will demonstrate what I'm talking about.

Now that we have this information (if we were programming this we wouldn't actually create a table...it's just to make it easier to see and understand for this blog post), We combine the sequences together where their "-1st" and "-Last" columns match. So if we're starting from the top, we see that the 2-item sequence AB matches up with BC, BD, BF, BG and (AB) to create ABC, ABD, ABF, ABG and A(AB). The A(AB) one is tricky because you have to remind yourself that the order within the parentheses is just convention and it could easily be written A(BA) which makes it easier to see that the B's match up. Working through the rest of the table this way (being careful not to create any duplicate candidates) we populate the "3-seq after join" column in the table below

The next step is to prune these 3-item sequences. Many of them have portions of their 3-item sequences that aren't supported from the previous round of 2-item sequences. A good example of this is the candidate sequence AFA. AF and FA are supported 2-item sequences, but AA is not a supported 2-item sequence, so this one gets pruned out. This pruning is based on the same Apriori principle that we have used in prior posts. This pruning helps save computing time because we don't want to comb through the database to find the support for a sequence that we know won't meet the support threshold. Now that we have this slimmed down list of 3-item sequences, we scan the database again to determine support just as it was described above for 2-item sequences. When that is completed we end up with the 3-item sequences in the last column of the table above.

At this point it is just "lather, rinse, repeat" as my shampoo bottle suggests. Take that list of 3-item sequences and generate 4-item sequences using the same method. We remove the first and last items and put them in the "-1st" and "-Last" columns below.

Then we look for sequences that have a "-1st" value that match a "-Last" value of another sequence. In this case we only find matches for "BF" and "BG". You should prune the sequences if they aren't supported by 3-item sequences that are supported/frequent, but in this case they're fine. Then we check the database for support to finalize the list. If you do that you end up with information similar to the table below.

You can see that, at this point, this example gets pretty simple. If you try the iteration again, the trick of removing the first item and last item yields no matches and you're done. After that you would just output all of the supported/frequent sequences to the user, which in this case would be the following (as long as I don't make a copy and paste mistake :/ ).

Bells and Whistles
I promised to describe the bells and whistles for this process as well. Here's a list of extras that can be added

Taxonomies/Hierarchies: Harry Potter and the Sorcerer's Stone is part of a series by J.K.Rowling, which is part of the children's fantasy genre. Taxonomies like this can be incorporated so that more general sequences can be found if the more specific ones aren't supported. The way that the algorithm handles these taxonomies is to add in the higher level items (e.g. J.K. Rowling is higher level than Harry Potter) to the combined transactions. For example if you had a sequence like this <A(1, B)A> then you would translate that into a sequence like <(A, Letter, Symbol)(1, Number ,B , Letter, Symbol)(A, Letter, Symbol)> if you had a taxonomy like the one here:
Window Size: If you've had a customer for years, maybe it isn't that interesting if they support a particular pattern because they have so much buying behaviour on record that they might just support it by chance. Window sizes limit the span of time when transactions are deemed to support a sequence. (e.g. I only want buying patterns for pregnant women during a 9 month window)
Max-gap: This parameter is defined by the user to filter out large gaps in data sequences. Let's say a customer buys an iPod, then even though they are within the specified window size, they have a large gap between that purchase and a new set of headphones. To a business owner, this gap might make them "uninteresting" as a customer to market to, so the GSP algorithm can filter this out as it is running and checking for support
Min-Gap: Think of this as answering the question "how much time would I let pass between transactions before I would consider 2 purchases separate elements in my sequences?" An example of this would be me at Home Depot. I go to start a project on a Friday evening and buy what I think I need. It isn't until Saturday about noon that I realize I had no idea what I REALLY needed and go back for more stuff. Some store owners may want to treat these as one "transaction" or event. If the min-gap was set to 24 hours in this case, all of my purchases would be grouped together to look something like this <(Friday night stuff, Saturday noon stuff)>.

I admit that there are more details to how you might actually implement these bells and whistles if you had to code them. For those that are interested in that I recommend you let me know you want another post on that. Otherwise, I refer you to "Mining Sequential Patterns: Generalizations and Performance Improvements" by Ramakrishnan and Rakesh Agrawal.

For those of you with the perseverance to get to the end of this lengthy post I congratulate you! As always, I hope this was helpful for somebody out there.

Null-Invariant Measures of Interestingness

2015-03-03T21:39:00.000-05:00

In 2 of my previous posts about measures of interestingness (Lift and Chi-Square) I mentioned I would be writing a post about null-invariant measures of interestingness. I'm a man of my word so here you go. The problem I presented in those posts can be summarized with these 2 tables.

If you click back to those previous posts (links above) you'll see that the lift for the first table was 0.714 and the lift for the 2nd table would be calculated as 0.929. The chi-square value for the first value was 0.586 and the value for the 2nd is calculated as 0.022. For both these measures we get different numbers when the size of the bottom right cell ("null" count) is changed. The table below shows this information in a slightly different view.

Since we're pattern mining, and we're really only interested in the interaction between the two items (apples and oranges in this example), we'd love to have a way to measure interestingness that doesn't include that "null" count in the calculation at all. Data scientists call this type of an interestingness measure "null-invariant". Let's see what they've come up with to help us out here.

Based on a white paper entitled "Selecting the Right Interestingness Measure for Association Patterns" written by Pang-Ning Tan, Vipin Kumar and Jaideep Srivastava in 2002, there are 3 measures for interestingness that have this "null-invariant" property that we're looking for. The three that they found are Confidence, Cosine, and Jaccard. The Confidence measure actually has at least 3 variations to it now that I'm aware of so we'll cover 5 different null-invariant measures in total here.

Confidence
If you've read my post about support and confidence, you may remember near the end that I mentioned that confidence gives different values based on the way it's calculated (e.g the confidence for Apples à Oranges may not be the same as the confidence for Oranges à Apples). If that is the case, which one should you pick? With so many potential patterns that can be found in data already, it is preferable for data scientists to find a way to combine these 2 options to get 1 answer instead of having to duplicate all of their results to show both sides of the same coin all the time. To do this, they have taken the original confidence equation P(A|B), or P(B|A) depending on how you look at it, and created the following alternate forms

At this point in time it is my duty as your guide to understanding data mining in a simple way to calm you down if the stuff in the definition column gave you heart palpitations. All three of these are actually VERY easy to understand if you already understand confidence. I'm actually going to work from the bottom of this list up. Max Confidence basically says, I'm just going to pick the biggest value of confidence between the 2 ways of looking at it. Kulczynski (pronounced KƏl-zin’-skee) takes the approach of averaging the 2 together instead. For, All Confidence, it may not be apparent, but you're actually taking the minimum value of confidence. Since confidence is just the count (or support) of records that have both items (A∩B) divided by the count of the first item in a statement like AàB. If you're dividing by the maximum value of the support for AàB and BàA, then you're really just finding the minimum value of confidence between the 2 options. So, like I said, it might look scary, but it's simple.

Cosine
Another good interestingness measure is the cosine function. The way that I think about the cosine function is that is essentially the lift function in disguise, but is null invariant. Let me show you what I mean. Here are the equations for both

You can see that the only real difference between the 2 is that Cosine has a square root in the denominator. There's a good reason for this. In my post about lift, I mentioned that to calculate lift correctly, you have to make sure you are using the support as a percent of the total records in the database. If you don't do this, the lift value is wrong. The fact that you have to divide by the grand total of all the records is also why the lift measure is susceptible to changes in the number of null records. Cosine overcomes this with a trick I was taught in intermediate school

When I first saw the lift equation written out this way I really wanted to get the "grand total" values to cancel out because it would be so much cleaner. Another way of writing (or think about) lift is something like this (AB/G)/((A*B)/G^2) where AB is the count of A & B together, A is the count of A, B is the count of B, and G represents the grand total. Using fraction math, you can cancel out one of the grand totals, but not the 2nd one leaving you with (AB)/(A*B/G). Cosine takes care of this problem. When we take the square root of the denominator, we can cancel out the grand total from the function entirely making the measure null-invariant. It looks like this: (AB/G)/sqrt((A*B)/G^2), which simplifies to AB/sqrt(A*B).

Jaccard

The Jaccard function is defined as |A∩B|/|A∪B|. To explain what that means let's go to Venn diagram world.

As a reminder, that upside-down U symbol is an intersection and is represented by the orange area in the diagram above. The right-side-up U symbol is a union and is represented by all of the area that is either yellow, orange or red. The '|' symbols in the formula are just a notation that is used to indicate that we want the count of items in that set, instead of the set itself. Now that that is fresh in our minds again, the numerator of the Jaccard is the Orange area, and the denominator is the area of the yellow-orange-red area, NOT double counting the orange area because of the perceived overlap. With that understanding, you can see why Jaccard is null invariant; it doesn't use any data from the blank space, null records, to be calculated. I could make the white space in Venn diagram above HUGE and the Jaccard would still have the same value.

Comparisons

Lastly, let's take a look at our apple and orange example one more time with all of these different measures of interestingness.

First of all, in the 2nd row, you can see that all of the measures we've added to this table aren't affected by the different "null" number. I also have added a couple of rows to demonstrate how the different measures change with different values in the table. In row 3 you can see a what a positive correlation looks like, instead of a negative one. In row 4, you can see what no correlation looks like. It's interesting to note that for Jaccard no correlation is represented by a value of 1/3. Then in rows 5 and 6 I showed the extremes of these null variant measures...all of them are bound between 0 and 1. 0 means STRONG negative correlation, and 1 means STRONG positive correlation.

Row 7 is there as a tale of caution. Based on an astute comment on this post, revised/change the table above so that you can compare rows 1, 2, and 3. They all have the same values except for the "null" number. If the "null" number gets too big, like in row 7, then "someone who has bought an orange is far more likely than the general population to buy an apple". This is not immediately obvious, and can be a stumbling block when using null-invariant measures of interestingness.

I think that sums it up. I hope that helps!

Chi-Squared

2015-02-28T15:33:00.005-05:00

Another method of "interestingness" that can be used in data mining is the chi-square test. This test is actually one that I have a fair amount of experience with from my six sigma and manufacturing background. I'll use a simple example from my past to walk you through how it is calculated first, then give some warnings about ways it can be misused.

Suppose you have a factory that makes widgets and have 2 machines in the manufacturing process that perform the same step, say they're both plastic injection molding machines. When parts come out you have an inspector that classifies the parts as good or bad. Over time, you collect all of this data and come up with a table like the one below.

I looks like they produced roughly the same quantity of bad parts, but machine 2 also made less parts overall. What chi-squared will do is help us determine if that difference is (statistically) significant. The first thing we need to do is determine what we would expect if the 2 machines had the same quality. To get this expectation we use the values in the row total and column total. If there were no real difference between the 2 then we would expect there to be a good ratio of 605/657=92.08% for both machines. To get the bad ratio (defect rate) we just take 1 minus the good ratio and we get 7.92%. Now we just need to account for the fact that the 2 machines produced different quantities. With these expected good/bad ratios we can calculate how many good parts we expect if we produce 400 or 257 parts. We would expect that machine 1 would produce 400*92.08% = 368.34 good parts and 400*7.92%=31.66 bad parts. We would expect that machine 2 would make 257*92.08% = 236.66 good parts and 257*7.92% = 20.34 bad parts. this gives us an expected table that looks like this.

The formula to calculate any expected value in the table above is [Row Total]*[Column Total]/[Grand Total]. If you look back to the logic in the paragraph above, you'll see that is exactly what we did.

Now that we have expected values, we can calculate the chi-square statistic. This is done by taking each cell (e.g. good parts from machine 1 as one example) and calculate (observed value - expected value)^2/(expected value). For the upper left cell this would be (375-368.34)^2/368.34 = 0.1203. After doing this for each cell we get a table that looks like this.

Then you just have to add these up to get you your chi-square statistic; in this example it's 3.89. A chi-square statistic doesn't mean much unless you know how many degrees of freedom you have. Don't know what degrees of freedom means? That's OK. For now, all you need to know is that for a chi-square test it is equal to (# of rows - 1)*(# of columns - 1). We just have a 2x2 table here so we get (2-1)*(2-1) = 1 degree of freedom. The 3.89 chi-squared value and the 1 degree of freedom are used to lookup a p-value. I think of a p-value as a probability value. When you're looking at a p-value, you are looking at the expected probability that nothing interesting is going on in your data. So, if you want to find something interesting, you're hoping for REALLY small p-values. In order to get the p-value I almost always use MS Excel. The Excel function would look like this "=CHIDIST(3.89,1)". For our problem, we get a p-value of 0.0486. This can be interpreted that we think there is 4.86% chance that nothing interesting is happening here. A common threshold that statisticians use for this p-value is 5%. Since, our 4.86% is less than 5%, we would say that this difference is statistically significant.

Now that we know the mechanics of how to calculate it, let's talk briefly about the intuition behind the numbers. If 2 features (e.g. good/bad and machine 1/2) have nothing interesting going on what is the chi-square value going to be? To start, you would expect the values in the expectations table to be exactly the same as the data you started with. Once you know that, you also know that all of the values in the chi-square table will be zero; (observed-expected)^2/expected will give you 0 divided by something, which is zero. The chi-square value can never be negative because the numerator is a squared value, and the denominator is an expected number of positive counts. The least interesting thing possible is 0, and the most interesting thing would be...some ridiculously large number (theoretically infinity), but in real life you don't ever get infinity as the answer.

Now, after all that explanation, I've got to let you down a little bit because chi-square isn't very good for data mining applications for the same reason why the lift measure has problems. If we use the same example that I used in my lift post, we have transactions for apples and oranges like this

If I follow the instructions above to calculate chi-square for the first table, I get 0.586. If I do the same thing for the 2nd table, I get 714,283.7...hmmm. In theory the interaction between apple and orange purchases are the same in both tables, but the chi-square statistic gets very confused by the double null transactions. That is why null-invariant measures for interestingness are so important when mining data patterns (another post to come soon explaining these measures).

Lift

2015-02-26T22:02:00.001-05:00

Lift is an objective measure of "interestingness" that has been used in various fields including statistics. It is also an option when we are data mining. In case you've run across this measure and don't fully understand it, let me give you a quick summary of what it is, how it's calculated and some of its properties.

Much like the properties of support and confidence, lift attempts to use a numerical calculation to determine how importance a pattern or correlation is. If we use the table below as a simple example we can kind of see that people who buy oranges, are less likely to buy an apple and vice versa (this is called a negative correlation), but in a data mining process we want to have an automated way to see this. Lift is one option to automate and determine this.

To calculate lift, first find the support for both items together (in this case 2/20=10%.) and this will be your numerator. Notice that I used the "% of total" version of support. If you don't do this you will get a different answer and it will be wrong. For the denominator, multiply the total support for both items together (bottom left and top right total values); in this case that would be 8/20 * 7/20 = 0.14. The lift for buying apples and oranges together would be ~0.714 for this example.

So what does this 0.714 mean really? If, the lift were to work out to equal exactly 1, then that would mean that there is no correlation at all between the 2 items. Since, our lift is less than 1, it means that there is a negative correlation (that's what we could see with our own intuition before we started). If the lift turns out to be greater than 1, then there is a positive correlation. If you look at how lift is calculated you will notice that because all of the values that go into the fraction are positive counts (or fractions of positive counts), the value of lift always has to be positive (>0). It can get really BIG though. If I change the table above to have lots of transactions without apples or oranges, then I can get a really big number for my lift

If you do the same calculation we did above on this table, you get a lift of 357,143.32. This huge swing in the lift value is the greatest limitation of lift in data mining. The only thing I changed in the data I was analyzing was the number of transactions that didn't have apples or oranges. Intuitively this shouldn't make any difference whether the correlation is interesting or not. That is why data mining has developed other measures of interestingness that are called null-invariant. Null-invariant measures aren't sensitive to that lower right value in the table. Eventually I'll write a blog post about those and add a link here, but that won't be tonight.

Core Patterns of Colossal Patterns

2015-02-24T21:16:00.003-05:00

Professor Jiawei Han at the University of Illinois and some of his colleagues created a method of data mining to find what are called colossal patterns. "Colossal" in this context basically means really big patterns with lots of different items included. As an example of this, think about people working in biostatistics looking for large sequences of DNA that are interesting. The "Pattern-Fusion" method that they created has 2 parts to the algorithm.

Use a regular pattern generation method (e.g. Apriori, FP Growth, etc.) to create the first couple levels of patterns that are supported. A common level of complexity for this step to stop is when the algorithm has 3-item patterns (e.g. ABC), but this can be selected by the user when this algorithm is run
Take these large-ish patterns and combine them together to make bigger patterns faster than incrementally looking for patterns that only have 1 more item.

The reason why this method is so valuable can be seen if you look at how many possible combinations(patterns) are possible as the number of items increases. The table below shows the number of combinations possible if you have 5, 10, or 15 items in each column, and pick 1-15 items in the rows.

If you're really only interested in finding the BIG patterns that the data supports, you don't really want to get bogged down in looking at 6,435 patterns that have 7 items in them (see the last column above). Plus, this tabular example above is VERY simplified, how many sequences are in a strand of DNA? Oh yeah! Old algorithms might get bogged down forever in that mess in the middle for that type of problem.

Core Pattern and Pattern Robustness
Now that I've summarized the conceptual process, I'm going to explain core patterns and pattern robustness. Core patterns are the ones that the algorithm "likes" to use in order to create the BIG patterns it's really looking for. I got my butt kicked by these 2 concepts during my quiz for professor Han's Coursera course, so I figured I should probably take the time to really understand them and explain them simply for everybody else.

In Han's textbook, Data mining : concepts and techniques, he says that...
"for a pattern α, an itemset β⊆α is said to be a t-core pattern of α if where is the number of patterns containing α in database D." emphasis added.

Say what?!? For those of you out there who are not fluent in mathspeak, I will translate. For a pattern named Beta to be a core pattern of pattern Alpha it needs to meet a couple requirements.

Criteria 1
Beta needs to be a subset of Alpha (β⊆α). In Venn diagram world this looks like the diagrams below. The line under that 'c' looking symbol means that β can actually be as big as α (kind of like a less-than-or-equal-to symbol), and therefore they can be the same set of items and still satisfy this criteria.

Criteria 2

That complicated equation needs to have a value greater than τ. What is tau, other than a greek letter? It' just a value between 0 and 1 that the user of the algorithm picks so that there is a criteria to rate patterns against. If the calculated value in the equation is really high, we're more certain the pattern being considered is a core pattern, if it's really low...not so much. That explains tau, but now we need to understand the equation.

|Dα| is defined as "the number of patterns containing α in database D". Although the symbols might be confusing, this is actually relatively easy to calculate. Just count up all of the records, transactions, etc. in your database that have the pattern α. You need to make sure that your looking at sub-patterns too. Say one transaction has ABDFE, and you're looking for ABF (this is α in this example), this transaction counts for this calculation. The same process is followed for β but based on the first criteria, β is going to be a smaller subset pattern of α or the same pattern as α. So in the quick example above, β might be A, AB, AF, BF, or even ABF. The whole point of this ratio is to give a measure for how important β is in the make-up of the bigger pattern α. If there are 100 records that contain α, and 250 records that contain β, then this core pattern "strength", as I'll call it, would be 0.4 (P.S. don't get confused by the Venn diagrams above here. Those diagrams are for the items in the patterns themselves, NOT for the frequency of the patterns in the database you're mining). If tau were set by the user as 0.25, then β would be considered a .25-core pattern of α.

Example
Now let's use an example to clarify this and show how to calculate core pattern robustness. Suppose we have a database with some patterns that have the counts as shown in the table below

Let's assume that we set tau to be equal to 0.5. So we'll need to do the core pattern calculation explained above for all of the subsets of each item in this list to see if they are a core pattern. For the first pattern, ACE, there are 10 instances in the database (this one is simple, but others are a little tricky). Now we figure out the counts for A, C, E, AC, AE, and CE in the database. A=20, C=40, E=10, AC=20, AE=10, CE=10, and we already know ACE=10. So, for each of these patterns we can calculate the ratio ACE/A=0.5; ACE/C=0.25; ACE/E=1; ACE/AC=0.5; ACE/AE=1; ACE/CE=1; and ACE/ACE=1. So based on our criteria, tau = 0.5, we have core patterns of A, E, AC, AE, CE, and ACE; C got cut out because it only had 0.25.

BCD on the next line is similar, but for BCD the count is 20 because BCD has 10 transactions in the 2nd row and 10 transactions in ABCDF in the 4th row; told you there were trickier ones. Follow the same process as the above paragraph for the rest of the calculations. If you do this, you should come up with a table that looks something like this.

As an exercise, try to figure out which patterns didn't make the cut. There are three of them, and I already showed you one of them already. Obviously if the value of tau were set lower, we would have cut out a lot more patterns and the table would have a lot fewer entries in the last column

Core Pattern Robustness

Now, we need to explain pattern robustness. if you have the information above already, pattern robustness is pretty easy to figure out. In mathspeak, and example of how robustness might be stated goes something like this "A pattern α is (d, τ) robust if d is the maximum number of items that can be removed from α for the resulting pattern to remain a τ-core pattern." (Han's textbook again). All that really means is that you're trying to answer the question, "how many items can I remove from the pattern I'm looking at and still have what's leftover qualify as a core pattern?". For the first pattern in the table above, ACE, we have 3 items A, C, and E. If we take away 1 item, say A, we can see that CE is still a core pattern in the table. if we take away a 2nd item C, E is still a core pattern. If we took away E, we'd have nothing so that's no good. By definition the null set, or nothing, can't be a core pattern. So for ACE we were able to take away 2 items and still have a core pattern. So, if we were to translate that example to mathspeak we would be able to say that pattern ACE is (2, 0.5)-robust. Using similar logic we could add another column to our table stating robustness of each of the rows. Please double check me...at the time of this writing it is getting late and my mental powers are waning. :)

I hope this helps people out there understand something that took me quite a while to understand from just a couple of sentences in a textbook.

Closed Itemsets

2015-02-17T21:20:00.004-05:00

When we are looking for patterns in data sets, often the amount of patterns that can be found can be huge. To demonstrate this I'll use a simple set of examples. Lets say that you have one transaction with items A and B (I know this wouldn't give you much information, but it proves my point). In this transaction there are 3 patterns/combinations: A, B and AB. Now let's say that the transaction has A, B, and C in it. This transaction allows for 7 patterns: A, B, C, AB, AC, BC, and ABC. If it had A, B, C, D, then there would be 15 patterns: A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, and ABCD. You can see that the number of patterns that can be created grows MUCH faster than the number of items in the transaction. This growth pattern is based on the mathematical concept of combinations (wikipedia).

Also, any data set that anybody would actually care about would have more than one transaction, probably thousands, if not millions or billions. This explosion of complexity that comes from all of the possible patterns that can be found in data is the motivation for ways of compressing the possible patterns/combinations. That is what closed itemsets are all about.

CLOSED ITEMSETS

Now I'm going to have to get a little technical here, but I'm going to translate for you too so don't worry. The definition of a closed pattern goes like this, "A pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y ⊃ X, with the same support as X" (J. Han, Pattern Discovery in Data Mining, Coursera Lecture Material)

Let's break this definition down. "itemset" is just a group of items. It could be one item (A), 2 items (AB), 3-items (ABC)... k-items. People sometimes refer to itemset size by telling you what k equals. k is just a variable that represents the size/complexity of the pattern you're looking at. By, "frequent", the definition is saying that this itemset, X, you're looking at meets the minimum support requirements, so it might be interesting. The whole bit in the definition about "super-pattern Y ⊃ X" Basically means that all of X is contained in Y. In Venn diagram world, if Y is a super-pattern of X then it looks like this

supper-pattern Venn diagram

In our example above, an example of this "super-pattern" notion would be like saying that ABC is a super-pattern of AB. The notation Y ⊃ X is a fancy mathematical way of saying the same thing. The way I think about a closed pattern is that it's the biggest, most complex, pattern I can find at various levels of support. It isn't exactly like that, but it will become clearer with an example.

Let's suppose we have a simple set of transactions like this

Simple transaction table

If you are looking for patterns that have support greater than or equal to 1, there would be a lot of combinations possible for just 2 transactions. Based on the work we already did above, there are 15 combinations for the first transaction. Transaction 20 also has the 15 combinations, but there is some overlap in possible combinations in transaction 10 and 20. To visualize this, lets list them out in a table.

Combination Table

If you look closely at the combination table above, you can find some repeats if you compare transaction 10 and 20. To make this easier to see, I've rearranged the entries a little bit here.

Expanded Combination Table

The green highlighted combinations are found in both transactions. Based on this, there is support of 1 for every combination in the expanded combination table, except for C, D, and CD which have support of 2. This is where closed itemsets come in handy. If you remember the Apriori Principle, if a more complex combination meets the minimum support requirements, then the subset, or simpler patterns, will also meet those requirements. In this example, if I make the statement that I know that CD has support of 2, then I automatically know that C, and D also have support of at least 2. In this case C and D have a support of exactly 2 so CD is one of my closed patterns in this data. Let's revisit the definition of a closed pattern to make sure that's true. There are 2 requirement for a pattern to be a closed pattern

It's frequent; yep CD meets that criteria
No super-pattern with the same support as CD; There are bigger patterns in the data, but none of them have support of 2 so we're good here too

What's so awesome about this is that I can say I have closed pattern CD with support of 2, and I automatically know that C and D also have support of 2 because of the apriori principle. We can use this one close pattern to compress three patterns without losing any information!

Now let's look for some other closed patterns in the data, so you can get the hang of it. The way I explained closed data sets before was that they are normally the most complicated patterns that are frequent. In our table above we've got ABCD and CDEF to work with that seem to fit that criteria. So which one is a closed pattern? Both actually. Let's look at the 2 rules for ABCD

It's frequent; in our example we only need frequency of 1, so we're good here
No super-pattern with the same support as ABCD; CDEF has the same support as ABCD, but CDEF is not a supper pattern of ABCD.

The same logic is applied to CDEF. Going back to Venn diagram world the whole set of closed patterns in this data looks like this, where S is the support for each closed pattern.

Closed pattern Venn diagram

If you use the Apriori principle, and these 3 closed patterns, you can reconstruct the entire data table we started with without losing any information. Pretty awesome right?

MAX-PATTERN ITEMSETS

There is another way to compress pattern data called max-patterns. They are almost exactly the same as closed patterns, except that we don't care about whether other patterns have the same support. A max-pattern still has to satisfy the minimum support threshold though. The easiest way to explain this is by using the example above for closed patterns. In that example we have closed patterns ABCD, CD, and CDEF. CD is a distinct closed pattern because it has a support of 2 and the others only have support of 1. This is despite that fact that it is actually a subset, or part of, ABCD and CDEF. If you're looking for max-patterns you don't care about this distinction so you only get ABCD and CDEF as max-patterns; that's because CD is part of the larger patterns ABCD and/or CDEF. A max-pattern will give you more pattern compression, but you can see that even in this very simple example, you're losing some of the information that was originally contained in the raw data because now we don't know how much support CD, C, or D had.

Apriori Principle

2015-02-17T19:59:00.001-05:00

When making sure that all of the patterns in a set of data meet the minimum support requirements, we want to find all of the patterns that are supported, and not waste time looking at patterns that aren't. This seems simple, but in much larger data sets, it can become difficult to keep track of which patterns are support and which aren't. The Apriori Principle helps with this.

To explain, let's use the data in this table and assume that the minimum support is 2.

We start by looking for single items that meet the support threshold. In this case, it's simply A, B, C, D, and E, because there is at least 2 of each of these in the table. This is summarized in the single item support table below

single item support table

Next, we take all of the items that meet the support requirements, everything so far in this example, an make all of the patterns/combinations we can out of them; AB, AC, AD, AE, BC, BD, BE, CD, CE, DE. When we list all of these combinations in a table, and determine the support for each, we get a table that looks like this.

2 items support before filtering

Several of these patterns don't meet the support threshold of 2, so we remove them from the list of options.

2 item support table

At this point, we use the surviving items to make other patterns that contain 3 items. If you logically work through all of the options you'll get a list like this: ABC, ABD, ABE, BCD, BCE, BDE (Notice that I didn't list ABCD, or BCDE here because they are 4 items long).

Before I create the support table for these let's look at these patterns. The first one, ABC, was created by combining AB and BC. If you look in the 2 item support table (before or after filtering), you'll find that AC doesn't have the minimum support required. If AC isn't supported, a more complicated pattern that includes AC (ABC) can't be supported either. This is a key point of the Apriori Principle. So, without having to go back to the original data, we can exclude some of the 3-item patterns. When we do this, we eliminate ABC (AC not supported), ABD (AD not supported), ABE (AE not supported), BCE (CE not supported) and BDE (DE not supported). This process of removing patterns that can't be supported because their subsets (or shorter combination) aren't supported is called pruning. This pruning process leaves only BCD with a support of 2.

3 item support table

The final list of all of the patterns that have support greater than or equal to 2 are summarized here.

Takeaways:

The Apriori Principle can be used to simplify the pattern generation process when mining patterns in data sets
If a simple pattern is not supported, then a more complicated one with that simple pattern in it can not be supported (e.g. if AC isn't supported, there is no way that ABC is supported)
You can also look at takeaway 2 in the opposite direction. If a complicated pattern meets the minimum support requirements, all of the simpler patterns that can be created from that complicated pattern must be supported. (e.g. if ABC is supported, then AB, BC, AC, A, B, and C all have to be supported)

Support and Confidence in Pattern Discovery

2015-02-12T21:15:00.001-05:00

When analyzing patterns in data, what we are really looking for are patterns that are interesting. There are subjective ways to determine if data is interesting, but data analysis can be sped up significantly by creating objective measures for "interesting". When looking for patterns, associations and correlations in data, many algorithms will use the objective measures of support and confidence. These concepts are easier to understand looking at an example.

Example: Suppose we have happen to have a small sample of the transaction records from a grocery store. The transactions might look something like this:

A quick glance at this table shows that there seems to be a pattern of buying milk at the same time bread is bought, but let's figure out what the support and confidence for this is. The support for a pattern between milk and bread are the instances where they show up together in a transaction.

Basically this is the equivalent of an AND logic statement. So, for our example above the support for Milk U Bread would be 3. In larger datasets, it makes more sense to divide this number by the total number of transactions so we can get a feeling for what percentage of transactions have these items together. If we did that the support would be 60%. Support does not have to be just for 2 items, we can figure out support for 1 item alone (support for Salsa = 1, or 20%) or for several items together (support for Flour, Milk and Bread = 1,or 20%)

Confidence is defined as P(Y|X). If it's been a couple years since you were in a statistics class, that probably when over you head, but that's OK...don't be afraid. In math speak P(Y|X) means "probability of Y given X". In English it means, if you already know that you have something, say milk = X, in your transaction, then what's the probability that you also have something else, say bread = Y? So if you have milk...

What percentage of the transactions that have milk have bread also?

So to calculate the Confidence that if you have milk, you have bread too, just take the support for milk and bread (we counted 3 above) and divide it by the support for milk (it's 4 in our list). That gives a confidence of 75%. The mathematical notation for this association is Milk --> Bread (60%, 75%). The 60% is the support for the pair pattern, and the 75% is the confidence. Notice that this calculation can be done both ways. What if I know I have bread and want to know the confidence I have of there being milk also? The support for the combination of the two items is still 3, but the support for bread on its own is also 3. That means the confidence is 100% based on our data. So Bread --> Milk (60%, 100%). It's obvious from the numerical results that the Venn diagrams I drew above are definitely NOT to scale, but it makes it easier to visualize the difference between the different items and their intersection.

Purpose of this Blog

2015-02-10T20:20:00.000-05:00

To those of you who may stumble upon this blog, I want to describe the purpose of it. First let me introduce myself. At the time of this writing, I am a Senior Data Scientist at Carvana (an online retailer of used cars) and focus on vehicle purchasing analytics. I also created and run a business called Price Monster that provides automatic/dynamic pricing analytics for the self-storage industry. I've also spend about a decade of my career working in various large manufacturing companies including GE and Canadian Solar almost always working in manufacturing quality. This gave me a heavy dose of Six Sigma training and the statistics understanding that comes with it. Even before joining GE, while getting my masters degree in mechanical engineering, I was fascinated by the power of various methods of learning from data, especially design of experiments and optimization methods.

The more I climbed the corporate ladder, the more I realized that I missed the awe and excitement that came from gleaning something important and powerful out of data that nobody could see before. I realized that I love to use the data generated from real life processes to create software/programs that help the organizations "see" what is really happening, and help them react to the issues that really matter. Being able to create this type of understanding is limited by the types of data analysis techniques one knows. The more I learn about data mining and associated topics, the more I want to understand all of the data analysis options that are out there so I can pull the right tool from the toolbox when the time comes to analyze a particularly tricky data set. That is what this blog is about.

I plan to use this blog to document what I am learning as I take various online courses, read books, and do research on the subject of data mining. I plan to make my posts simple. Having read some of the papers and taken some classes on the subject already, there are plenty of PhD's out there that teach at the level of their understanding. It may be easy for them to understand, but it doesn't make it clear for the rest of us. This blog is going to be my personal solution for that problem. I hope it helps me...and you. ;)