Saturday, May 2, 2015

How to Deal with Mixed Data Types when Performing K-means Clustering

When we look at a group of dots on a page, most of us can intuitively put these dots into groups that appear to "belong" together because they seem to be close to each other.  In data mining we call these clusters.  Clustering with the human eye works pretty well for problems in 2 or sometimes 3 dimension.  As an example, if you were to look at the chart below how would you cluster/group the points together?

I think that most of us would group the 3 points in the upper left corner together, and then group the points on the right side together.  Or maybe you'd group the upper left together, and then create 2 groups on the right side (one for the upper 2 dots, and one for the lower 2).  As you can see this art and practice are not really definitive, but there is a lot of information to be gained from data sets by grouping data points together.  Therefore, data mining seeks to make this process of finding the best clusters objective so that we can perform this analysis on multi-dimensional data with little to no human interaction.

Sometimes our data does not just have numerical data.  We may have to deal with data that has categories like male/female, ethnical background, etc.  Or, maybe our data set has ordered categories (ordinal data) like t-shirt size (small, medium, large, extra large), or relative temperature (cold, warm, and hot).  We would like to still be able to cluster our data points, even though all of the data types don't fit the same cookie cutter mould.  To explain this, I'm going to start by showing a simple example of the k-means algorithm with just numerical data.  Then, we'll add in a category and an ordinal data type to see how we can perform the same process with different data types.

Since we already have a simple 2-dimensional set of data points above, we'll stick with that example and use the k-means algorithm to solve for the best clusters.  The 'k' in k-means is a user defined variable that controls how many clusters we're going to look for.  To keep things simply we'll set k=2, which means that when we're done, we'll have the data points in the example split and assigned to 2 groups.

To mathematically define these 2 clusters, k-means uses 2 points in space that aren't on the graph yet.  These points are called cluster centers, or centroids.  Think of a centroid as the center of gravity of all of the points assigned to it.  When we're all done, we want the distance between these centroids and the points assigned to them to be minimized.  If we do that, we'll get centroids that , have data points tightly packed around them, which is intuitively what we are looking for when we are doing this by eye.  The original k-means algorithm starts by randomly picking values for these cluster centers and then iteratively improves their position to minimize the distance as described above.  For our example, I have randomly placed 2 cluster centers (red and green).  These cluster centers are also referred to as seeds. The red seed is located at (2.2087,4.1076) and the green seed is located at (4.8579, 0.3341).



The question is how we will define "distance" from the data points to these seeds.  There are multiple ways to define the distance between 2 points.  To visualize a couple of them, let's say that we have randomly picked two cluster centers (in red and green below) and we're trying to calculate the distance to one of the other blue data points.


A very simplistic way of calculating the distance between 2 data points is to use the "Manhattan" distance.  Think of this as how far you would have to walk to get somewhere in Manhattan, New York where it's all city blocks and you can't just jump over or walk through buildings.  This is also called the L1 norm.  I described how to calculate this in my PageRank blog post.  If you're curious about it you can learn more there.

Another way of calculating distance is called the Euclidean distance.  This is like the saying "the shortest distance between 2 points is a straight line".  To calculate this distance we use the square root of the sum of the squares.  If we have 2 points in space where point 1 = (x1,y1) and point 2 = (x2,y2), then the Euclidean distance between them looks like this


There are actually several other distance metrics we could use, but since it is fairly simple and it was the one used in the original paper on k-means, we'll stick with the Euclidean distance in our example.

Now that we know how we're going to calculate distance and we know where our seeds are located we can start the iterative process of finding the best location for the centroids.  The process goes like this

  1. Calculate the distances between all of the data points and all of the cluster centers
  2. Assign each data point to the cluster center closest to it
  3. Calculate the average x and y values for the points assigned to each clusted center
  4. Set each cluster center's location to be equal to the average of the points assigned to it
  5. Check to see if any data points have been assigned to a different cluster center
    1. If so, go back to the top and repeat
    2. If not, then end the algorithm, you've got your answer

If you didn't follow that, it will become clear as we work through the example.  For step 1, I'll show you a couple of the calculations by hand and then jump to a table of all of the distances.  If we take the data point at (1,5), the distance from that point to the red seed at (2.2087,4.1076) is


You can see that I also gave you the answer for how to calculate the distance from (1,5) to the green cluster center (4.8579, 0.3341) in the image above.  At this point, I'm going to assume that we can all replicate this until we're sick of it, so I'll just give you the rest of the answers in this table


Now that we have that completed we can assign the points to their nearest cluster center.  The following points get assigned to the red cluster center: (2,3); (1,5); (3,5); (7,5).  All the rest get assigned to the green cluster center.



Now we need to take the average values of the point assigned to each cluster center.  For the red center, the average X value is (2+1+3+7)/4 = 3.25 and the average Y value is (3+5+5+5)/3 = 4.5.  For the green cluster, the average X value is (9+8+7)/3 = 8 and the average Y value is (1+4+2)/3 = 2.3333.  Now we assign these average values to be the new locations of the 2 cluster centers.  So the new location of the red cluster center is (3.25, 4.5) and the new location of the green cluster center is (8, 2.3333).

Since this is the first iteration, it's obvious that we have changed the assignments of the data points to the cluster centers (they were previously unassigned).  So, we have to repeat the process.  With our new cluster centers of (3.25, 4.5) and (8, 2.3333), the distances between the points and the cluster centers looks like this


When you look to see which cluster is closest to each data point, you can see that some of the assignments are going to change.  The data point at (7,5) is now closer to the green cluster center than the red cluster center.


Now we take the average values of all of the data points assigned to red and green.  The new location of the red center is (2,4.3333) and the new location of the green center is (7.75, 3).


If you do the distance calculations again and assign the centers, as we have in prior iterations, you'll find that none of the data points get assigned to a different center even though the location of the center moved a little bit.  We now have our final answer.  The red cluster center is located at (2, 4.3333) and points (1,7), (2,3) and (3,7)  are assigned to the red cluster.  The green cluster center is located at (7.75,3) an points (7,2), (7,5), (8,4), and (9,1) are assigned to the green cluster.

In a nutshell, that's how k-means works.  It can be expanded to many many dimensions using similar calculations.  For 'k' different dimensions the Euclidean distance calculation looks like this


Don't get confused by the letters in this formula.  You just keep adding squares of differences for each dimension you're trying to compute the distance for.  So if you're trying to calculate the distance between 2 points in 4 dimensional space like (1,2,3,4) and (10,9,8,7), you would get

Once you adapt this to your problem.  The rest of the algorithm is basically the same.

The selection of the seed cluster centers is also very important.  As I mentioned, I randomly picked the values in this example.  If we randomly pick different values, depending on the "shape" and structure of the data, we can get different answers.  The originator of the algorithm suggested that you just run the algorithm several times with different random starting values and then pick the answer that minimizes the total distance between cluster centers and their assigned data points.  The k-means++ algorithm improves on this, but that's not part of the scope of this post.

The last thing I need to tell you before we change topics is what you need to do if for some reason no data points get assigned to one of your cluster centers.  If this happens, basically, the cluster center is too far away from the data.  One simple way to deal with this is just to randomly assign a new location to that cluster center again and continue the algorithm.  This might make the algorithm wander around a little more looking for the solution, but it's better than only getting 1 cluster center when you were looking for 2.

Dealing with Different Data Types
Now what do you do if your data set looks like this?


Before you answer "uuuhh... give up?" read the rest of this post.

I know that there are some of you out there that thought "white and black aren't colors; they're shades" when you looked at the table above.  OK Mr. Smarty Pants, you're right but let it go for now.

The most important part of adding in different data types is figuring out a way to measure "distance" between these variables.  We'll start by describing how this can be done with the size category above.  Since size has a sort of intuitive order to it where medium is bigger than small and large is bigger than medium, we can create a pseudo-distance fairly easily.  the easiest way to think of this is to just make the lowest value ("small" in our case) equal to 0 and the highest value ("large") equal to 1.  Then we just divide up the space between 0 and 1 equally to fit in any other values we have in our list.  Since we only have 3 values in our example, this is easy and we set "medium" to 0.5.  If you have a longer list of values, you can use this general equation to assign values to your labels between 0 and 1.

In this equation, the Zi is a pseudo-location like an X or Y value.  The ri term is the rank of the label in the list you're trying to convert, and M is the total number of labels in the list.  So, for our example, M=3 and if I'm trying to calculate Zi for "medium" then ri is equal to 2, ri for "small" would be equal to 1, and 3 for "large".  Take some time to plug those numbers into the equation above to make sure you can see how this gives you the values we already intuitively assigned in the paragraph above.

Now that we have these "location" values for size, we can treat the size variable just like the X and Y variables.  Although, we'll have to modify X and Y soon too, but there'll be more on that later.

Now how do we deal with the pesky color value?  Some of you out there may think that color could be done the same way as size because we can use the light freqency spectrum values for order.  Let's just assume instead, for demonstration purposes, that we are thinking the colors are from a color wheel with no beginning and end.  That means we have to treat them like categories.  For categories, we don't transform the values ("white", "red", or "black") into numbers, but we use a different distance calculation.  We say that if 2 points have the same color, the category distance between them is 0.  If the 2 points have different colors, the category distance between them is 1.  So the distance between "white" and "white" is 0, between "white" and "red" is 1, and between "white" and "black" is 1 too.  You can see that it is more difficult to be "close" to a point here if there are multiple values to choose from in a category.

You may have noticed that our transformations of size and color both have a scale from 0 to 1.  If we just leave the values of X and Y the way they are, their distances will appear much larger than the distances for size and color just because of their scale.  To solve this problem we can scale X and Y down to something close to 0 to 1.  Once we do this all of the different variables (X, Y, size, color) will all get roughly the same distance weighting.  A simple way to scale numerical variables like X and Y is to compute the Z-score.  The Z-score will likely look very familiar to you if you've been through a statistics course.

The Z will replace our X and Y values in our table above after this transformation.  The x, in the equation is a specific value of X or Y in the table above.  Mu (μ) is the average (mean) value of X or Y, and sigma (σ) is the standard deviation of X or Y.  For example, If I want to transform the values in the Y column, I need the average (1+4+2+3+5+5+5)/7 = 3.5714..., and the standard deviation which equals 1.6183... (if you want more understanding of how to calculate a standard deviation you can easily find this on Wikipedia here).  Now that I have those 2 values, I'll calculate the first couple Z-scores for the Y variable.



If we complete all of these calculations, we can transform our table into something that we can use to measure distance with all of these different data types.


Now that we've adjusted all of the variables, we're ready to introduce the mixed variable type distance equation.  It works pretty much like the Manhattan distance described earlier.


The first time I saw this I just about fell over from all of the subscripts and superscripts in this equation.  It looks WAY more complicated than it actually is.  Thing of it as a weighted average of distance.  The Dij on the left is the distance between points i and j (e.g. point #'s 1 and 5 in our table above).  The summations on the top and bottom are just summing up over all of the attributes/variables we are analyzing in our table.  So P would be equal to 4 because we have 4 attributes (Zx, Zy, Zsize, and Color).  The Wij terms are user selected weights that can be used to push the clustering to consider some attributes to be more important than others.  In our example, we'll set these equal to 1 for simplicity.  Once we do that, the formula is really just an average (mean) Manhattan distance of all of the attributes in our table.  If you didn't follow that, you'll see what I mean when we find the new cluster centers for our example.

If I plot the Zx and Zy variables on chart (I can't plot 4 dimensions and 3 dimensions is a little tricky) it would look like this.


You can see that it is not so easy to visually identify where the cluster centers are going to end up. This is why we solve this with math instead of by eye.  To start, just like last time, we randomly select new seed cluster centers and then calculate the distances from all of the data points to these cluster centers.  Suppose that I pick the seed locations to be (-0.614, 0.107, 0.5, "red") and (-0.160, -0.116, 1, "black").  I can plot these on the chart like this


Notice that I changed the size labels from words to their coded number values in this version of the graph.  The seeds I "randomly" selected have values for Zsize that matched exactly with "medium" and "large", but there is no restriction that ordinal attribute seeds match one of the values in the series.  I could have just as easily picked a value like 0.33356.  Of course the category attribute values have to be selected from the available options ("white", "red" or "black).  Now we just need to compute some distances.  Like last time, I'll show you how to calculate the distances from the 2 seeds to the point in the top left corner, and then give the rest of the answers.


Notice that I am using absolute values for the distance calculations.  When you use the Euclidean distance you don't need to worry about this because a squared number will always be positive.  The other thing you need to notice is that for the last term in the numerator, I just did the color category distance comparison in my head and put it in there.  The first seed (the red diamond...don't get confused here with the colors of the cluster centers and the labels) had its color label equal to "red" and the data point in the top left is "red" too, so the color distance between them is 0.  For the green seed, it was labelled "black" so the color distance between it and the top left data point is 1.  If you continue to calculate the distances between the cluster centers and all of the data points, you'll end up with a table that looks like this


I highlighted in green the distances that are lower for each data point.  We'll use these lower distances to assign each point to a cluster center.  So points 1, 2, 4, and 6 are assigned to the green cluster center and points 3, 5, and 7 are assigned to the red cluster center.  Now, we have to calculate the new cluster centers based on the means of all of the attributes.  We do this the exact same way we did it before, except for the color attribute.  Zx, Zy, and Zsize are all now numerical and we just calculate the mean of the values assigned to each cluster.  For the color attribute,  we don't actually take the mean.  Instead we take the mode.  If you don't remember what a mode is, it's just the value that comes up most often in the data.  So for the red cluster (points 3,5,and 7) we have have 2 points that are "red" and 1 point that is "white".  So we'll make the new cluster center point "red".  Similarly, the green cluster center has points that are "white", "black", "black", and "black".  So the new green cluster center is assigned to be "black".  There is a possibility that there is tie between 2 values.  If that ever happens the easiest thing to do is just to create an arbitrary tie-breaker rule like "always prefer white over red, and prefer red over black".  This rule will create some bias in the results, but in a large data set, this effect will happen pretty infrequently so it is not much of a concern.  If we do all of the mean calculations and assign the new colors to the cluster centers we get new cluster centers that look like the last 2 rows of the table below.


If we plot this on the graph, we can see the shift in the location of the cluster centers.


If we go through the same distance calculation process as before we see the distances come out like this

You may notice that although the distances are different, the assignments are the same.  The red cluster center is still closest to points 3, 5, and 7 and the green cluster center is closest to points 1, 2, 4, and 6.  That's our signal to stop the algorithm and give the final assignments.  They look like this


In 2D space this final answer to the clustering problem doesn't look so awesome, but as far as I can tell with my limited 4-dimensional vision it's the optimal solution with the seeds we started with.  The total distance from the cluster centers to the assigned data points can be calculated from the distances in the table above.  Red distances = 0.465 + 0.592 + 0.686 = 1.743  Green distances = 0.996 + 0.311 + 0.437 + 0.591 = 2.335.  The total distance is 4.078.  As mentioned, we have no guarantee that this is the absolute best way to cluster these data points.  I used Excel to run this clustering hundreds of times and found an optimal solution whose total distance was only 3.229.  The clustering for that solution looks like this.


This looks a lot like the solution for the non categorical problem.  It seems that the clusters we found in that example are still pretty significant in this example.

That summarizes how to do K-means clustering with numerical and non-numerical data.  I hope this has helped somebody out there.  Let me know if you have any comments or questions about anything in this post.

Saturday, April 25, 2015

How to Mine Frequent Patterns in Graphs with gSpan including a Walk-thru Example

In this blog post, I'm going to explain how the gSpan algorithm works for mining frequent patterns in graphs.  If you haven't already read my introductory post on this subject, please click HERE to read the post that lays the foundation for what will be described in this post.

Getting Started
To explain how this algorithm works, we're going to work through a simplified data set so that we can see each step and understand what is going on.  Suppose that you have 5 graphs that look like this

Don't get confused by the greek characters here.  I used random symbols to show that this method can be used for any type of graph (chemicals, internet web diagrams, etc.) even something completely made up like what I did above.  The first thing we need to do with this graph data set is count how many vertices and edges we have of each type.  I count 7 alpha(α) vertices, 8 beta(β) vertices, and 14 lambda(λ) vertices.  There are 15 edges that are solid blue, 13 that are double red, and 8 that are dotted green.  Now that we have these counts we need to order the original labels for vertices and edges into a code that will help us prioritize our search later.  To do this we create a code that starts with the vertices and edges with the highest counts and then descends like this


We could have used numbers instead of letters here, it doesn't really matter as long as you can keep track of the symbols' order based on frequency. If you relabel the graphs with this order you get something like this
You can see that the vertex labels now have A, B, C and each edge is labelled a, b, or c to match the new label legend we created.

Finding Frequent Edges
For this example we will assume that we have a minimum support frequency of 3 (i.e. 3 of the 5 graphs have to have the pattern for us to consider it important).  We start by looking at individual edges (e.g. A,b,A) and see if they meet this minimum support threshold.  I'll start by looking at the first graph at the top.  There is an edge that is (B,a,C) and I need to count how many of the graphs have an edge like this.  I can find it in the 1st graph, 2nd graph, 3rd graph, and 4th graph so it has support of 4; we'll keep that one.  Also notice that I had the option of writing that edge as (C,a,B) as well, but chose to write it with B as the starting vertex, because of the order we established above.  We move on a little and check (A,a,C) and find that it can only be found in the first graph; it gets dropped.  We check (A,c,C) and find it in the first 4 graphs as well; keep it.  If we continue going through all of the graphs being careful not to repeat any edge definitions we'll get a list of frequent single edges that looks like this

You can see that we only have 3 edges that are frequent (this will simplify our lives in this example greatly).  We need to sort the edges that are still frequent in DFS lexicographic order (See prior blog post for explanation of this).  After sorting we end up with this list of edges to consider going forward


Pay attention to the fact that I added the "0,1" to the beginning of these edge definitions.  These are the starting and ending points of the edges.  We'll "grow" the rest of the frequent sub-graphs from these edges in a way similar to the method described in my prior blog post on this subject.

gSpan starts with the "minimum" edge, in this case (0,1,A,b,B), and then tries to grow that edge as far as it can while looking for frequent sub-graphs.  Once we can't find any larger frequent sub-graphs that include that first edge, then we move on to the 2nd edge in the list above (0,1,A,c,C) and grow it looking for frequent sub-graphs.  One of the advantages of this is that we won't have to consider sub-graphs that include the first edge (0,1,A,b,B) once we've moved on to our 2nd edge because all of the sub-graphs with the first edge will have already been explored.  If you didn't follow that logic, it's OK; you'll see what I mean as we continue the example.

You might notice that the last graph in our set of graphs doesn't have any edges that are frequent.  When we're starting with edge (A,b,B) we keep track of which graphs have that edge present.  In our example that's graphs 1, 2, 3, & 4.  Now that we have that "recorded" we start the growth from (0,1,A,b,B) by generating all of the potential children off of the 'B' vertex.

Growing from the 1st Edge
Before we grow our first edge, you need to understand how gSpan prioritizes growth.  Let's assume that we have a generic subgraph that looks like this already (don't worry about how we got it yet, it will be come clear in a second)

I have removed the colors and labels from the diagram above to explain how gSpan grows without confusing you with labels.  The numbers represent the order the vertices were added.  If you are at this point in the algorithm, you will try to grow this graph in the following order of priority.


So the first priority is always to link the current end vertex back to the first vertex (the one labelled 0).  If that's not possible, we see if we can link it back to the 2nd vertex (the one labelled 1).  If we had a bigger sub-graph, we would continue trying to link to the 3rd, 4th, etc. if it's possible.  If none of those are possible then our next (3rd) priority is to grow from the vertex we're at (labelled 4 above), and grow the chain longer.  If that doesn't work, we go back to the most recent fork, take another step back and then try to grow from there (see how 5 is growing from vertex labelled 1 in the 4th priority).  If that doesn't work, then we take another step up the chain and try to grow from there (5th example above).  The 4th and 5th examples above also generalize for longer chains.  If there were several vertices above the closest fork in the chain, you would progressively try to grow off of each of those vertices until you get back to the root (vertex labelled '0').

With that understanding in mind, we can finally start growing from our first edge.  Our first edge is (0,1,A,b,B).  If we look at the priority explanation above, we can tell that the 1st and 2nd growth priorities are not really options, so we take the 3rd priority and try to grow from vertex 1, or the B vertex in our example.  Since we only have 2 frequent edges that have 'B' vertices in them our two options are as follows:

Option 1           Option 2
(0,1,A,b,B)       (0,1,A,b,B)
(1,2,B,b,A)       (1,2,B,a,C)

Now we are going to check for support of these 2 options.  We only need to look in the graphs that we determined contain the frequent edges as described earlier.  When we look for Option 1 we find it in the following graphs left that we're considering:


Notice that we can find the first option in 3 different places in the first graph, but we can't find it in any of the other graphs, so the support is only 1.  That's a dead end, so we check option 2


You can see that we can find this pattern in multiple places in all of the 4 graphs we have left (I didn't show all of the places it is found in the first graph).  The support for this sub-graph is 4 so it's a keeper.  Also, all 4 graphs we looked at have this sub-graph present so we keep those too.  That ends that round of edge growth so we take option 2 from above and try to grow from it.  Remembering the priorities of growth, we know we need to try to connect the last vertex (2 or 'C' in our example) to the root (0 or 'A' in our example) if we can.  If we look at the list of frequent edges to choose from, we see an edge that is (A,c,C) which could be written as (2,0,C,c,A).  This would make the connection we need.  None of the other frequent edges meet this criteria, so we check to see if (0,1,A,b,B); (1,2,B,a,C); (2,0,C,c,A) is frequent


This time this sub-graphs occurs twice in the 1st, 2nd and 4th graph.  It only occurs once in the 3rd graph.  Regardless, the support count for this sub-graph is still 4 so we've got a keeper again.  Now we have a graph that looks something like this.


Since we have already connected the last vertex ('C') to the root ('A') and to the vertex after it ('B'), our next priority is to grow from the last vertex we created.  The frequent edges we can try to add onto 'C' are (B,a,C) and (A,c,C).  However, since we're working from 'C' they would be better written this way: (2,3,C,a,B) and (2,3,C,c,A).  We'll start by checking if adding (2,3,C,a,B) works because it is the lowest in terms of DFS lexicographic order.  Here's what we find.


As it turns out, this 4 edge pattern can only be found in 2 of the 4 graphs we have left so that doesn't work.  Now we'll check if (2,3,C,c,A) works out.


Nope, that's only in the first graph so it's not frequent either.  Since neither of these worked out, we need to go to our next priority (4th) which is growing from the vertex a step up from the last attempted branch.  When you look at the sub-graph we've grown so far you can tell we need to try to grow from 'B' now.  That being said, we take a look at our options for frequent edges that contain 'B' vertices and we see that we should try to add (B,a,C) or (B,b,A).  Here's what trying to add (B,a,C) looks like in this scenario

Let's see if we can find that in any of the graphs


 As it turns out, we CAN find this sub-graph in 3 of the 4 graphs we have left so that's a keeper.  At this point in time I want you to notice that I am going to ignore the other edge that could grow from 'B' until later.  This is because gSpan is a depth first algorithm, so once you get success you try to go deeper.  When we hit a dead end, we will come back to all of these straggling options to see if they actually work.  If you're familiar with computer programming, this type of algorithm is implemented with recursion.  Our next step based on our priority rules is going to be to try to link our vertex 3 ('C') to the root, or the 'A'.  Again the only frequent edge we can choose from to do this is (C,c,A).  This option would look like this


At this point, remember that the last sub-graph that we found that was frequent could only be found in the first 3 graphs in our set, so now we only have to look for this new graph in those three graphs


This only exists in one of our graphs, so we go to our next growth priority.  Since there is already a link between our 'C' at vertex 3 and the 'B' and vertex 1, the 2nd priority isn't an option, so we try to grow/extend from the 'C' at vertex 3 again.  We can either add (C,a,B) or (C,c,A).  Adding (C,a,B) would look like this

If we look in the 3 graphs we have left we find this


That's a dead end because we only have 1 graph with this 5 edge sub-graph.  We back up and try adding (C,c,A) instead.  I'm not going to do the visuals for this, but it can't be found in any of the graphs we have left either.  Since we hit a dead end again here, we go to to our 5th priority growth option: growing from the root again.  If we try to grow from the root 'A' we can either use (A,b,B) or (A,c,C).  Starting with (A,b,B) we would get a sub-graph that looks like this


This can't be found in any of our graphs, so we try adding (A,c,C) as our last ditch effort


This can't be found either, so we have reached the end of this depth first search.  Now we need to go back and take care of the straggler options that we didn't pursue because we were going for depth first.  The last straggling option we left was when we were considering adding (B,b,A) to vertex 1.


Remember, at that point in time, we were still looking at 4 of the 5 original graphs so when we look for this sub-graph in those 4 graphs this is what we find.


We only find it in one graph, so that turns into a dead end too.

This Section Until "2nd Edge (A,c,C)" is a Correction (Thanks to Elife Öztürk) 
If we look back further in our algorithm we find that we also haven't checked all of the growth priorities off of the first frequent edge (0,1,A,b,B).  Our first growth came from using the 3rd priority growth strategy, but we haven't checked the 4th/5th priority grown strategies there yet.  If we do, we can see that it might be possible to make a link like (0,1,A,b,B) and (0,2,A,c,C).  Let's see if we can find that in the graphs



We can see that in all four of our remaining graphs so we'll keep this sub-graph.


Our first priority for growth on this sub-graph would be to connect back to the root, but we don't have any double bonds (sorry for the chemistry reference) in our example so that won't work.  The 2nd priority is also impossible, so let's check the 3rd.  Can we grow an edge off of the C node?  Our 2 options would be (2,3,C,a,B) and (2,3,C,c,A).  Adding (2,3,C,a,B) looks like this.



When we search through the graphs, we find this in 2 places

Notice that I'm not counting the instances in the 2nd and 3rd graphs.  This is because the only way you can get find this subgraph in those graphs is to wrap the graph such that the 1 B node is the same as the 3 B node.  The reason why we number the nodes is to make sure they're unique, plus we already found the sub-graph that creates that ABC triangle previously...thus the beauty of the algorithm :).  Since there are only 2 graphs with this sub-graph, the minimum support of 3 is not met.

So now let's check for the other growth option (2,3,C,c,A)

That can only be found in one of the graphs

So now we've exhausted our growth options off of the first edge (A,b,B).

2nd Edge (A,c,C)
Before you go crazy thinking that the other 2 edges we need to check will take just as long, let me tell you that we have already done WAY more than half of the work on this problem.  Now that we have exhaustively checked all of the graphs for frequent sub-graphs that contain the edge (A,b,B), we can trim down our search space significantly.  What I mean is we can delete edges in graphs that contain (A,b,B) and only look for frequent patterns in what is left.  If we do this, we get 4 graphs (remember we already got rid of graph 5 because it contained no frequent edges) that look like this.


These graphs are much simpler.  I also wanted to point out that I removed any edges from these graphs that were not frequent.  We could (should) have done this earlier when we were doing our search on the first edge, but I didn't want to complicate things at that time.  Now, much like last time, we start with edge (0,1,A,c,C) and try to grow from 'C'.  We can either use (1,2,C,a,B) or (1,2,C,c,A).  If we add (1,2,C,a,B) we can find that sub-graph in the following places


All 4 graphs have this sub-graphs, so it's a keeper and we'll go with it (remember we haven't checked adding (1,2,C,c,A) yet.  The next step is to try to grow from the last vertex back to the root.  If we did this we would have to use the edge (B,b,A), but that edge has been removed from all of our graphs because we have already mined all of the frequent sub-graphs with (A,b,B) in them.  So, our only other growth option is to try to extend the chain from 'B'.  The only option left to grow from 'B' is (B,a,C) because (B,b,A) has already been checked.  So we get a sub-graph like this.


This can be found in the following graphs


That's 3 out of the 4 graphs, which meets our support requirement so we'll keep it.  Our first growth priority from here is to link node 3 ('C') back to the root.  That would be adding a (3,0,C,_,A) link.  The only link possible for this in our data set is (3,0,C,c,A) which would give us this subgraph to look for

The only place this subgraph is found is in our 2nd example:

So, that's a dead end.  If we back up, our next growth priority would be to connect the C at node 3 back to the C at node 1.  A quick look at the graphs we have tells me that's not going to work.  The next priority after that is to add another node after node 3.  We could look for a subgraph with C,a,B added to the end that looks like this.


However, that subgraph is only found here:


So, we back up again and try our 4th priority, which is growing a link off of the B at node 2.  A quick look at the graphs above show that won't work.  Lastly, we'll try to grow further up the graph, like a new node coming off the C at node 2 or a new node off the A at node 0.  There's a couple options for growth off of node 2, but I can tell it's only supported by our first graph.  There are no additional options for growth from the root node 'A'. so we're done exploring this graph branch.

Now we need to step back to when we had the graph (0,1,A,c,C), (1,2,C,a,B) and try to grow again from there.  The last priority for growth for this subgraph is to grow from the root.  The only way we can grow from the root is to add (A,c,c), which would look like this.

Checking the four graphs we have left, we only find this sub-graph in our 2nd graph


Since this is only found in one place, it's a dead end and we need to backtrack to other options we didn't pursue earlier.  We only had one of those when we were adding our 2nd edge to (A,c,C).  So when we go back and try to add (1,2,C,c,A) to (0,1,A,c,C) we look at our graphs and find that it is only present in 2 of our graphs


As it turns out these are all of our options starting with edge (A,c,C).  See, I told you the 2nd edge would go faster than the first.

3rd Edge (B,a,C)
Now we have to check the last of our frequent edges to see if there is any way that we can grow a frequent sub-graph with just this edge.  Let's start by removing all of the (A,c,C) edges from our graphs to see what we're left with.


I could go through all of the same procedure I have been using for these remaining graphs, but it gets so simple at this point that there really isn't a reason to.  If you're using a program runing the gSpan algorithm, it will be disciplined and do the whole search though.  From what is left, we can see that (0,1,B,a,C); (0,2,B,a,C) is frequent, but anything larger doesn't work out.  We hit our last dead end and now we can report all of the sub-graphs that we found that were frequent.



I have them organized by the edge they were grown from.  The last thing we have to do is translate our DFS codes back to the original symbols.  This is easy enough and the user would get an output like this.


That's all there is to it.  I know that this post got a little long, but for me, being able to see the whole problem worked out helps me learn a TON.  I hope it helps somebody else out there too.  I did gloss over some details of the programming, etc. but this is essentially what's happening in gSpan.  Let me know if you have any additional questions.