Simple Data Mining: April 2015

Saturday, April 25, 2015

How to Mine Frequent Patterns in Graphs with gSpan including a Walk-thru Example

In this blog post, I'm going to explain how the gSpan algorithm works for mining frequent patterns in graphs. If you haven't already read my introductory post on this subject, please click HERE to read the post that lays the foundation for what will be described in this post.

Getting Started
To explain how this algorithm works, we're going to work through a simplified data set so that we can see each step and understand what is going on. Suppose that you have 5 graphs that look like this

Don't get confused by the greek characters here. I used random symbols to show that this method can be used for any type of graph (chemicals, internet web diagrams, etc.) even something completely made up like what I did above. The first thing we need to do with this graph data set is count how many vertices and edges we have of each type. I count 7 alpha(α) vertices, 8 beta(β) vertices, and 14 lambda(λ) vertices. There are 15 edges that are solid blue, 13 that are double red, and 8 that are dotted green. Now that we have these counts we need to order the original labels for vertices and edges into a code that will help us prioritize our search later. To do this we create a code that starts with the vertices and edges with the highest counts and then descends like this

We could have used numbers instead of letters here, it doesn't really matter as long as you can keep track of the symbols' order based on frequency. If you relabel the graphs with this order you get something like this

You can see that the vertex labels now have A, B, C and each edge is labelled a, b, or c to match the new label legend we created.

Finding Frequent Edges
For this example we will assume that we have a minimum support frequency of 3 (i.e. 3 of the 5 graphs have to have the pattern for us to consider it important). We start by looking at individual edges (e.g. A,b,A) and see if they meet this minimum support threshold. I'll start by looking at the first graph at the top. There is an edge that is (B,a,C) and I need to count how many of the graphs have an edge like this. I can find it in the 1st graph, 2nd graph, 3rd graph, and 4th graph so it has support of 4; we'll keep that one. Also notice that I had the option of writing that edge as (C,a,B) as well, but chose to write it with B as the starting vertex, because of the order we established above. We move on a little and check (A,a,C) and find that it can only be found in the first graph; it gets dropped. We check (A,c,C) and find it in the first 4 graphs as well; keep it. If we continue going through all of the graphs being careful not to repeat any edge definitions we'll get a list of frequent single edges that looks like this

You can see that we only have 3 edges that are frequent (this will simplify our lives in this example greatly). We need to sort the edges that are still frequent in DFS lexicographic order (See prior blog post for explanation of this). After sorting we end up with this list of edges to consider going forward

Pay attention to the fact that I added the "0,1" to the beginning of these edge definitions. These are the starting and ending points of the edges. We'll "grow" the rest of the frequent sub-graphs from these edges in a way similar to the method described in my prior blog post on this subject.

gSpan starts with the "minimum" edge, in this case (0,1,A,b,B), and then tries to grow that edge as far as it can while looking for frequent sub-graphs. Once we can't find any larger frequent sub-graphs that include that first edge, then we move on to the 2nd edge in the list above (0,1,A,c,C) and grow it looking for frequent sub-graphs. One of the advantages of this is that we won't have to consider sub-graphs that include the first edge (0,1,A,b,B) once we've moved on to our 2nd edge because all of the sub-graphs with the first edge will have already been explored. If you didn't follow that logic, it's OK; you'll see what I mean as we continue the example.

You might notice that the last graph in our set of graphs doesn't have any edges that are frequent. When we're starting with edge (A,b,B) we keep track of which graphs have that edge present. In our example that's graphs 1, 2, 3, & 4. Now that we have that "recorded" we start the growth from (0,1,A,b,B) by generating all of the potential children off of the 'B' vertex.

Growing from the 1st Edge
Before we grow our first edge, you need to understand how gSpan prioritizes growth. Let's assume that we have a generic subgraph that looks like this already (don't worry about how we got it yet, it will be come clear in a second)

I have removed the colors and labels from the diagram above to explain how gSpan grows without confusing you with labels. The numbers represent the order the vertices were added. If you are at this point in the algorithm, you will try to grow this graph in the following order of priority.

So the first priority is always to link the current end vertex back to the first vertex (the one labelled 0). If that's not possible, we see if we can link it back to the 2nd vertex (the one labelled 1). If we had a bigger sub-graph, we would continue trying to link to the 3rd, 4th, etc. if it's possible. If none of those are possible then our next (3rd) priority is to grow from the vertex we're at (labelled 4 above), and grow the chain longer. If that doesn't work, we go back to the most recent fork, take another step back and then try to grow from there (see how 5 is growing from vertex labelled 1 in the 4th priority). If that doesn't work, then we take another step up the chain and try to grow from there (5th example above). The 4th and 5th examples above also generalize for longer chains. If there were several vertices above the closest fork in the chain, you would progressively try to grow off of each of those vertices until you get back to the root (vertex labelled '0').

With that understanding in mind, we can finally start growing from our first edge. Our first edge is (0,1,A,b,B). If we look at the priority explanation above, we can tell that the 1st and 2nd growth priorities are not really options, so we take the 3rd priority and try to grow from vertex 1, or the B vertex in our example. Since we only have 2 frequent edges that have 'B' vertices in them our two options are as follows:

Option 1 Option 2
(0,1,A,b,B) (0,1,A,b,B)
(1,2,B,b,A) (1,2,B,a,C)

Now we are going to check for support of these 2 options. We only need to look in the graphs that we determined contain the frequent edges as described earlier. When we look for Option 1 we find it in the following graphs left that we're considering:

Notice that we can find the first option in 3 different places in the first graph, but we can't find it in any of the other graphs, so the support is only 1. That's a dead end, so we check option 2

You can see that we can find this pattern in multiple places in all of the 4 graphs we have left (I didn't show all of the places it is found in the first graph). The support for this sub-graph is 4 so it's a keeper. Also, all 4 graphs we looked at have this sub-graph present so we keep those too. That ends that round of edge growth so we take option 2 from above and try to grow from it. Remembering the priorities of growth, we know we need to try to connect the last vertex (2 or 'C' in our example) to the root (0 or 'A' in our example) if we can. If we look at the list of frequent edges to choose from, we see an edge that is (A,c,C) which could be written as (2,0,C,c,A). This would make the connection we need. None of the other frequent edges meet this criteria, so we check to see if (0,1,A,b,B); (1,2,B,a,C); (2,0,C,c,A) is frequent

This time this sub-graphs occurs twice in the 1st, 2nd and 4th graph. It only occurs once in the 3rd graph. Regardless, the support count for this sub-graph is still 4 so we've got a keeper again. Now we have a graph that looks something like this.

Since we have already connected the last vertex ('C') to the root ('A') and to the vertex after it ('B'), our next priority is to grow from the last vertex we created. The frequent edges we can try to add onto 'C' are (B,a,C) and (A,c,C). However, since we're working from 'C' they would be better written this way: (2,3,C,a,B) and (2,3,C,c,A). We'll start by checking if adding (2,3,C,a,B) works because it is the lowest in terms of DFS lexicographic order. Here's what we find.

As it turns out, this 4 edge pattern can only be found in 2 of the 4 graphs we have left so that doesn't work. Now we'll check if (2,3,C,c,A) works out.

Nope, that's only in the first graph so it's not frequent either. Since neither of these worked out, we need to go to our next priority (4th) which is growing from the vertex a step up from the last attempted branch. When you look at the sub-graph we've grown so far you can tell we need to try to grow from 'B' now. That being said, we take a look at our options for frequent edges that contain 'B' vertices and we see that we should try to add (B,a,C) or (B,b,A). Here's what trying to add (B,a,C) looks like in this scenario

Let's see if we can find that in any of the graphs

As it turns out, we CAN find this sub-graph in 3 of the 4 graphs we have left so that's a keeper. At this point in time I want you to notice that I am going to ignore the other edge that could grow from 'B' until later. This is because gSpan is a depth first algorithm, so once you get success you try to go deeper. When we hit a dead end, we will come back to all of these straggling options to see if they actually work. If you're familiar with computer programming, this type of algorithm is implemented with recursion. Our next step based on our priority rules is going to be to try to link our vertex 3 ('C') to the root, or the 'A'. Again the only frequent edge we can choose from to do this is (C,c,A). This option would look like this

At this point, remember that the last sub-graph that we found that was frequent could only be found in the first 3 graphs in our set, so now we only have to look for this new graph in those three graphs

This only exists in one of our graphs, so we go to our next growth priority. Since there is already a link between our 'C' at vertex 3 and the 'B' and vertex 1, the 2nd priority isn't an option, so we try to grow/extend from the 'C' at vertex 3 again. We can either add (C,a,B) or (C,c,A). Adding (C,a,B) would look like this

If we look in the 3 graphs we have left we find this

That's a dead end because we only have 1 graph with this 5 edge sub-graph. We back up and try adding (C,c,A) instead. I'm not going to do the visuals for this, but it can't be found in any of the graphs we have left either. Since we hit a dead end again here, we go to to our 5th priority growth option: growing from the root again. If we try to grow from the root 'A' we can either use (A,b,B) or (A,c,C). Starting with (A,b,B) we would get a sub-graph that looks like this

This can't be found in any of our graphs, so we try adding (A,c,C) as our last ditch effort

This can't be found either, so we have reached the end of this depth first search. Now we need to go back and take care of the straggler options that we didn't pursue because we were going for depth first. The last straggling option we left was when we were considering adding (B,b,A) to vertex 1.

Remember, at that point in time, we were still looking at 4 of the 5 original graphs so when we look for this sub-graph in those 4 graphs this is what we find.

We only find it in one graph, so that turns into a dead end too.

This Section Until "2nd Edge (A,c,C)" is a Correction (Thanks to Elife Öztürk)
If we look back further in our algorithm we find that we also haven't checked all of the growth priorities off of the first frequent edge (0,1,A,b,B). Our first growth came from using the 3rd priority growth strategy, but we haven't checked the 4th/5th priority grown strategies there yet. If we do, we can see that it might be possible to make a link like (0,1,A,b,B) and (0,2,A,c,C). Let's see if we can find that in the graphs

We can see that in all four of our remaining graphs so we'll keep this sub-graph.

Our first priority for growth on this sub-graph would be to connect back to the root, but we don't have any double bonds (sorry for the chemistry reference) in our example so that won't work. The 2nd priority is also impossible, so let's check the 3rd. Can we grow an edge off of the C node? Our 2 options would be (2,3,C,a,B) and (2,3,C,c,A). Adding (2,3,C,a,B) looks like this.

When we search through the graphs, we find this in 2 places

Notice that I'm not counting the instances in the 2nd and 3rd graphs. This is because the only way you can get find this subgraph in those graphs is to wrap the graph such that the 1 B node is the same as the 3 B node. The reason why we number the nodes is to make sure they're unique, plus we already found the sub-graph that creates that ABC triangle previously...thus the beauty of the algorithm :). Since there are only 2 graphs with this sub-graph, the minimum support of 3 is not met.

So now let's check for the other growth option (2,3,C,c,A)

That can only be found in one of the graphs

So now we've exhausted our growth options off of the first edge (A,b,B).

2nd Edge (A,c,C)
Before you go crazy thinking that the other 2 edges we need to check will take just as long, let me tell you that we have already done WAY more than half of the work on this problem. Now that we have exhaustively checked all of the graphs for frequent sub-graphs that contain the edge (A,b,B), we can trim down our search space significantly. What I mean is we can delete edges in graphs that contain (A,b,B) and only look for frequent patterns in what is left. If we do this, we get 4 graphs (remember we already got rid of graph 5 because it contained no frequent edges) that look like this.

These graphs are much simpler. I also wanted to point out that I removed any edges from these graphs that were not frequent. We could (should) have done this earlier when we were doing our search on the first edge, but I didn't want to complicate things at that time. Now, much like last time, we start with edge (0,1,A,c,C) and try to grow from 'C'. We can either use (1,2,C,a,B) or (1,2,C,c,A). If we add (1,2,C,a,B) we can find that sub-graph in the following places

All 4 graphs have this sub-graphs, so it's a keeper and we'll go with it (remember we haven't checked adding (1,2,C,c,A) yet. The next step is to try to grow from the last vertex back to the root. If we did this we would have to use the edge (B,b,A), but that edge has been removed from all of our graphs because we have already mined all of the frequent sub-graphs with (A,b,B) in them. So, our only other growth option is to try to extend the chain from 'B'. The only option left to grow from 'B' is (B,a,C) because (B,b,A) has already been checked. So we get a sub-graph like this.

This can be found in the following graphs

That's 3 out of the 4 graphs, which meets our support requirement so we'll keep it. Our first growth priority from here is to link node 3 ('C') back to the root. That would be adding a (3,0,C,_,A) link. The only link possible for this in our data set is (3,0,C,c,A) which would give us this subgraph to look for

The only place this subgraph is found is in our 2nd example:

So, that's a dead end. If we back up, our next growth priority would be to connect the C at node 3 back to the C at node 1. A quick look at the graphs we have tells me that's not going to work. The next priority after that is to add another node after node 3. We could look for a subgraph with C,a,B added to the end that looks like this.

However, that subgraph is only found here:

So, we back up again and try our 4th priority, which is growing a link off of the B at node 2. A quick look at the graphs above show that won't work. Lastly, we'll try to grow further up the graph, like a new node coming off the C at node 2 or a new node off the A at node 0. There's a couple options for growth off of node 2, but I can tell it's only supported by our first graph. There are no additional options for growth from the root node 'A'. so we're done exploring this graph branch.

Now we need to step back to when we had the graph (0,1,A,c,C), (1,2,C,a,B) and try to grow again from there. The last priority for growth for this subgraph is to grow from the root. The only way we can grow from the root is to add (A,c,c), which would look like this.

Checking the four graphs we have left, we only find this sub-graph in our 2nd graph

Since this is only found in one place, it's a dead end and we need to backtrack to other options we didn't pursue earlier. We only had one of those when we were adding our 2nd edge to (A,c,C). So when we go back and try to add (1,2,C,c,A) to (0,1,A,c,C) we look at our graphs and find that it is only present in 2 of our graphs

As it turns out these are all of our options starting with edge (A,c,C). See, I told you the 2nd edge would go faster than the first.

3rd Edge (B,a,C)
Now we have to check the last of our frequent edges to see if there is any way that we can grow a frequent sub-graph with just this edge. Let's start by removing all of the (A,c,C) edges from our graphs to see what we're left with.

I could go through all of the same procedure I have been using for these remaining graphs, but it gets so simple at this point that there really isn't a reason to. If you're using a program runing the gSpan algorithm, it will be disciplined and do the whole search though. From what is left, we can see that (0,1,B,a,C); (0,2,B,a,C) is frequent, but anything larger doesn't work out. We hit our last dead end and now we can report all of the sub-graphs that we found that were frequent.

I have them organized by the edge they were grown from. The last thing we have to do is translate our DFS codes back to the original symbols. This is easy enough and the user would get an output like this.

That's all there is to it. I know that this post got a little long, but for me, being able to see the whole problem worked out helps me learn a TON. I hope it helps somebody else out there too. I did gloss over some details of the programming, etc. but this is essentially what's happening in gSpan. Let me know if you have any additional questions.

Saturday, April 18, 2015

Simple PageRank Algorithm Description

This blog post will give a simple explanation of the original PageRank algorithm This is the algorithm that differentiated Google from the web search incumbents in the late 1990's (e.g. Yahoo!, Altavista, etc.) and helped to make it what it is today. I will be using material I learned in ChengXiang Zhai's "Text Retrieval and Search Engines" Coursera course, as well as the original technical paper on PageRank by Sergei Brin and Larry Page entitled "The PageRank Citation Ranking: Bringing Order to the Web", 1998.

For this blog post I'm going to use a VERY simplified example of the web so that we can work through it and understand what is happening. The example below has only 6 web pages, but has the complexity we need for our discussion.

We'll consider each of the items in the network (e.g. A, B, C) to be webpages. PageRank can actually be used for other things like social networks, etc., but we'll stick with the web page version for this post. Each black arrow represents a link from where the arrow starts to where the arrow ends. For example, webpage B has a link to webpage A, and that's it. Webpage A has links to webpage B, and D. These links, and the somewhat hidden information that they contain, are the most important part of the PageRank algorithm.

Before PageRank, there were actually lots of people that tried to use the information from these links to help with search. These mostly focused on the count of how many links a webpage had pointing to it. If we use this type of thinking page A seems to be the most important with 4 links to it, C and D come in 2nd place with 2, B comes in 3rd with 1 and E and F come in dead last with 0. This may seem to make sense, but there is something interesting that this method misses. Since page A has so many links, it's pretty obvious it is important, but since it's important, the links that page A has to other pages are more important that other links in the network. PageRank takes this into account.

Imagine you are web-surfing in the diagram above. Say you pick a web page at random and then read it, find a link on that webpage and click it, read that webpage, find a link on that webpage and click it...lather, rinse, repeat...until you get bored, or stuck. At this point you decide to not follow a link from the page you're on, but to just pick a webpage at random from the web. Once you're on this new random webpage you start reading and clicking on links all over again. If you did all of this, this is essentially what the PageRank algorithm does. It follows this type of random web-surfing and then calculates the probability of arriving at all of the pages in the web.

Now that I've explained that process conceptually, lets get into the math that matches this type of random web-surfer behaviour. If we start by creating a grid/table where we put a 1 for links from one webpage to another, we'd get something like this

This is helpful because now we can do some analysis on this table, but it's not really what we need. If you're at a webpage, we want to be able to determine how likely you are to pick one of the links on the webpage. If there is only one link on the webpage, this is easy; 100%. If there are 2 or more, the PageRank algorithm just divides this probability up evenly. If there are 2 links, each link has a 1 out of 2 chance of getting clicked (as a side note, in my opinion it would be more interesting if these probabilities were altered based on where the links were on the screen, or some other characteristic like font size/color). If we slightly alter our table above to give these probabilities, instead of just the links, it will look like this

This matrix is called the transition matrix (later we'll refer to this matrix as M). As mentioned above when describing the random web-surfer, there are 2 parts to this algorithm: clicking links and jumping to pages without links. The transition matrix is useful for the link clicking part of the calculation. Let's say that we start at time t=0 by randomly selecting one of the webpages to start from (each page has a 1 out of 6 chance of being the page we start on) and we want to figure out the probability that our next clicked link will take us to webpage A. To do this we would calculate this

For our example this equals 1/2, 50%, or 0.5 (however you want to write it). The method can be generalized to all of the webpages after we pick a starting webpage. The generalize version of this equation looks like this

Don't freak out, I'm going to explain what all the symbols mean. The part of the left of the equal sign is just saying that you're going to be calculating the probability that the next click (t+1) is going to land on page dj. 'j' is a subscript that helps you keep track of what page you're calculating the probability for. In our transition matrix above, if j=3, then dj = C. The right side is a summation starting at 1 and going to N, where N is the number of pages in your web (in our case N=6). Mij is the element from the transition matrix in row i and column j. pt(di) is the probability at the current time that you have arrived at page di. Since we're still on our first step (t=0), we set pt(di) equal to 1/6 in our example. Now we can calculate the probability of being on all of the pages in our web (not just A) at t=1. The result looks like this

You can see that because there are no webpages that point to pages E and F, we have 0 probability of being on those pages after we've clicked just once. The PageRank algorithm gets to rankings of all of the pages after many repetitions of what we just did. I'll walk you through another step (t=1 to t=2) so you get the feel of it. Let's calculate the probability that we're on page A at t=2.

This is very similar to the step from t=0 to t=1. The first value in each multiple is from the transition matrix (M) and the 2nd value comes from the probability of being at each page at time t=1. Once you get the hang of this, it can be repeated quickly and converges to a final answer very quickly. Here's how our web diagram converges.

When we get to about 21 cycles, we see that nothing is really changing and now we have values that can tell us how important each of the webpages are. Based on this example, page A is most important, then pages B and D tie for 2nd. This is not what we got when we just counted how many links each website had. Page B only has 1 link, but it is linked by page A which is the most important, so that link counts for more. Remember though, that this isn't the whole page rank algorithm. We have to add in the random jumping to a different page now. To do this we have to set a probability that, at any given time, we'll "get bored" and jump to another page. Let's say that we have a 15% chance of getting bored on the page we happen to be on. We'll call this value alpha (α=15%)

Here you can see that the first term in the probability is basically the same, but that it get's multiplied by (1-α). This is because we need to discount this probability by the amount that we get get bored and randomly pick a page. The 2nd term is adding in the probability of randomly jumping to another page. The 2nd term could be simpler; it could just be α*(1/N). It is written the way it is because this equation can be turned into a matrix math problem, which I'll show you later. For not just go with it.

If we start over at t=0, using this new form of the equation, we can calculate a couple examples and show you how this equation converges as well. If we want to calculate the probability of being at webpage A at t=1 we would do the following:

You can see that in the first term in the first line, I just used the value we calculated in our prior example without jumping. I also expanded out the second summation so that it is easier to see. The next 2 lines substitute the actual values into the equation and simplify to an answer. If you follow this calculation for the other webpages you get this.

I'll walk you through one more step and then show you the whole table and how it converges to a final answer. If we want to calculate the probability of being at webpage A at time t=2 we get this

The first thing I want you to notice is that I couldn't just use the value from the old method for the first summation. This is because the probability of being at webpage A at t=1 has changed. So, we get to do all of the math this time. Remember that the M terms here come from the transition matrix. Because we're calculating the probability of being at webpage A, we take the values from column A of the transition matrix. The rest is basically the same as the last example. We can do this for all of the webpages (I did it in Excel for this post) and get the table below

You can see that this table converges where values really aren't changing any more by the time it gets to t=17. The decision of when to stop iterating is based on a user defined variable. In the original PageRank white paper, the authors suggest that you loop through 't' until a value, delta (δ) gets smaller than this user defined value epsilon (Ɛ). To calculate delta you need to know this.

The first line shows how to calculate delta, but if you're anything like me, you may have never been exposed to an "L1 Norm" before. Each of the R terms inside the double bars is a vector containing the probabilities we've calculated for all of the webpages at the given time. So R1 = {0.45, 0.095833, 0.2375, 0.1666, 0.025, 0.025} using the clicking and jumping values from our most recent example. R2 would be {0.389792, 0.21625, 0.117083, 0.226875, 0.025, 0.025}. To calculate our delta value we first have to subtract each term in R1 from R2, then we will take the L1 norm of the result. R2 - R1 = {-0.06021, 0.120417, -0.12042, 0.060208, 0, 0}. Now we need to calculate the L1. This is shown in the last equation in the list above. Basically, you just take the absolute value of every term in the vector (R2-R1) and add them all up. For our example from t=1 to t=2, we get delta equal to 0.36125. If I add this delta value to the last column of the convergence table you can see how this value gets smaller as we iterate.

I promised earlier I'd show you how the equations to calculate page rank can be turned into a matrix equation. If you're familiar with linear algebra, this will be interesting to you. The other MAJOR advantage of having equations in matrix form is that the processing time in a computer is MUCH faster for matrix math; this is because ridiculously smart people have optimized matrix math routines in most math libraries so that they get the right answers while minimizing processing time. We want to be able to stand on the shoulders of these giants. So, here's how the equations can get morphed into matrix equations

If you know how to use matrices in Matlab, a programming language, Excel, etc. This last equation can be used to efficiently calculate each iteration.

Before I end this post, I need to explain one more thing. If our web diagram happened to look like the diagram below we would have a problem

The only thing that I changed is that the arrow that was going from F to C, is now going from C to F. The reason why this is a problem is that F isn't pointing to anything. If we randomly select F as our starting point, how can we click on a link to move forward??? We can't. A simple way to solve this problem is for there to be an exception in the PageRank algorithm. If the page has no links to other pages, we don't allow the option to click on a link, we force the algorithm to randomly jump to another page. In essence, in this special case, we set alpha equal to 1. This keeps the algorithm from getting hung up in a dead end.

Well, I think that wraps it up. Let me know if you have any additional questions about PageRank that I can answer for you. As always, I hope this helps somebody out there!

Pages

Saturday, April 25, 2015

How to Mine Frequent Patterns in Graphs with gSpan including a Walk-thru Example

Saturday, April 18, 2015

Simple PageRank Algorithm Description