Tuesday, March 10, 2015

Generalized Sequential Pattern (GSP) Mining


This is going to be my first post about sequential data pattern mining.  I'm starting this post by explaining the concept of sequential pattern mining in general, then I'll explain how the generalized sequential pattern (GSP) algorithm works along with its similarities to the Apriori method.

Sequential Pattern Mining
If you're a store owner and you want to learn about your customer's buying behaviour, you may not only be interested in what they buy together during one shopping trip.  You might also want to know about patterns in their purchasing behaviour over time.  "If a customer purchases baby lotion, then a new-born blanket, what are they likely to buy next?"  With information like this, a store owner can create clever marketing strategies to increase sales and/or profit.  

To get this data, the Apriori algorithm has to be modified to account for this time delay between transactions.  For this portion of the discussion I'll be referencing a technical paper by Rakesh Agrawal and Ramakrishnan Srikant entitled "Mining Sequential Patterns".  Say you start with a database full of transactions that looks something like this.


Suppose that the transaction data represents the day of the month that the item was purchased.  As a human, we can see some repeat customers, and make some guesses about some of the repeat patterns that might exist, but we want an automated approach.  An easy way to find sequential patterns is to utilize a modified process similar to the one developed for the Aprioi Algorithm that is for single purchases.  To do this we just sort the table first by customer ID, and then by transaction time stamp.  We'll get something that looks like this.


You'll also notice an extra column at the right that takes the sorted information and turns it into customer sequences.  The reason for this is that we can utilize this format to data mine sequential patterns much like patterns within transactions are mined (e.g basket analysis).  How people format the items in the sequence varies a little bit among the couple technical papers I've read, but let me explain how I've formatted it above.  Every sequence of transactions starts with a '<' and ends with a '>'.  That's all those symbols mean.  The next thing you need to know is that any time 2 or more items (letters in my example) are surrounded by parentheses (e.g. "(AB)", which might be an apple and an orange), those items were purchased at the same time.  If any items are not surrounded by parentheses, then that means that when those items were bought, the customer didn't buy anything else with them.  So if we had a sequence like <A(BC)DE(FG)> that would mean that there were 5 different transactions.  The customer first bought A, 2nd B and C together, 3rd D, 4th E, and 5th F and G together.  Also, I'm not sure if you noticed, but I always have the items within parentheses ordered alphabetically.  This is common to sort the items bought together from least to greatest (the fancy technical term for this is lexicographic order). With that understanding, now we can start talking about how the GSP Algorithm works.

GSP Algorithm
The Generalized Sequence Pattern algorithm was created from a simpler algorithm for mining sequences, but it has some extra bells and whistles added so it can be more flexible for different situations.  To explain the process, I'm going to start with the basics, then add the bells and whistles at the end.  The beginning of the GSP algorithm is basically the same as the Apriori Algorithm, because for 1 item, there really is no order.  So, we find all of the items in our customer ID sequence database that meet the minimum support threshold.  We'll use the data above and say that the minimum support is 2 in this case.  If that's true we get the table below.

There are a couple of tricky things here.  The more obvious one is that item E did not have enough support to meet the threshold, so it get's removed.  Based on the Apriori principle, we are able to conclude that because E is not frequent, then any sequence with E in it would not be frequent either.  So, we don't need to keep track of it anymore.

The next tricky is thing is that the support for B is listed at 5, not 6, even though the 3rd customer bought item B twice.  This practice is also borrowed from the Apriori algorithm.  In the Apriori algorithm, if a customer buys 2 candy bars at once, then we only count 1 candy bar when calculating the support, because we count transactions.  The same applies here, except instead of counting how many transactions contain an item, or itemset, we are counting how many customers have an item/sequence.

Once we have these items we start the iterative process of generating larger patterns from these items and checking if they have support.  This is where things get really interesting.  In Apriori, the total possible combinations would be AB, AC, AD, AF, AG, BC, BD, BF, BG, CD, CF, CG, DF, DG, FG.  When generating sequences, that is not nearly sufficient because the order matters, AND we can have sequences where single items repeat (e.g. I bought A, then the next day I bought A again).  So we have to create tables that look like this
 Because order matters, there are a lot more options.  Oh and in case you forgot, there is also the possibility of 2+ items being bought in the same transaction
The diagonal values are blank here because they're already covered in the first 2-item table.  Just to make sure you understand, (AA) isn't there because if A and A were bought in the same transaction it would only be counted once so we never have two of the same item together within a parentheses.  The lower left is also blank, but this is because of the convention I already described of always listing items purchased together in ascending order.

So, now that we understand all of that, we can check our 51 possible 2-item sequences against the database to see if they meet the support threshold.  I'll walk you through the first row of 2-item sequences to show you how this is done (see the table below).

In the first column, we see that there are no customers that ever bought item A on more than 1 occasion, so support there is 0.  For <AB>, you can see that customers 1 and 5 bought those items in that sequence.  For <AC>, notice that customer 4 bought (AB) together then C, but we get to pick out just A from that combined transaction to get our <AC> sequence. Notice that the items don't have to be right next to each other.  It's OK to have other items in between.  <AD> follows similar logic.  For <AF> we don't count customers 3 or 4 because although they have A and F, they aren't in the right order. To be counted here, they need to be A, then F.  When we look for the support of <FA> those will get counted.  There is one last rule you need to know for counting support of sequences.  If you are looking for support of a sequence like <FG>, then customer 1 doesn't count.  It only counts if F is bought before G, NOT at the same time.  If you work through the rest of the 2-item sequences that weren't purchased together (i.e. not ones like (AB)), you get support counts that look like this
You see that I have struck out any 2-item sequence that do not meet the support threshold of 2.  Now let's take a look at the 2-item sets that could have been purchased together (e.g. (AB)).  This one is a little bit easier, because there are only 5 times in our database when 2 items were purchased at the same time.  They were (FG), (AB), (AB), (BC) and (DE).  Only (AB) shows up twice so we keep it and get rid of the rest. Now we have these 2-item sequences left to work with:


Now we start the next iteration looking for 3-item sequences. This is where I'm going to start referring to "Mining Sequential Patterns: Generalizations and Performance Improvements" by Ramakrishnan and Rakesh Agrawal.  In that paper they describe a process for generating candidate sequences that I think is pretty cool, now that I understand it.  You take all of the remaining sequences from the last step (AB, AC, AD, AF, AG, BC...), then you remove the first item from the sequence.  For AB, we remove A and we're left with B.  Do this for all of the sequences (see 2nd column in table below).  Then you do the same thing, but remove the last item (3rd column in table below) .  This is pretty straightforward except when you're dealing with a sequence that ends in a multiple item purchase like (AB).  When that happens, you have to remove all of the possible items one at a time and treat them separately.  So if you're removing the "first" item from (AB) you remove A and get B leftover, and then you remove B and get A leftover.  Hopefully I haven't lost you in that explanation.  If I did, I think reviewing this table will demonstrate what I'm talking about.
Now that we have this information (if we were programming this we wouldn't actually create a table...it's just to make it easier to see and understand for this blog post), We combine the sequences together where their "-1st" and "-Last" columns match.  So if we're starting from the top, we see that the 2-item sequence AB matches up with BC, BD, BF, BG and (AB) to create ABC, ABD, ABF, ABG and A(AB).  The A(AB) one is tricky because you have to remind yourself that the order within the parentheses is just convention and it could easily be written A(BA) which makes it easier to see that the B's match up.  Working through the rest of the table this way (being careful not to create any duplicate candidates) we populate the "3-seq after join" column in the table below
The next step is to prune these 3-item sequences.  Many of them have portions of their 3-item sequences that aren't supported from the previous round of 2-item sequences.  A good example of this is the candidate sequence AFA.  AF and FA are supported 2-item sequences, but AA is not a supported 2-item sequence, so this one gets pruned out.  This pruning is based on the same Apriori principle that we have used in prior posts.  This pruning helps save computing time because we don't want to comb through the database to find the support for a sequence that we know won't meet the support threshold.  Now that we have this slimmed down list of 3-item sequences, we scan the database again to determine support just as it was described above for 2-item sequences.  When that is completed we end up with the 3-item sequences in the last column of the table above.

At this point it is just "lather, rinse, repeat" as my shampoo bottle suggests.  Take that list of 3-item sequences and generate 4-item sequences using the same method.  We remove the first and last items and put them in the "-1st" and "-Last" columns below.

Then we look for sequences that have a "-1st" value that match a "-Last" value of another sequence. In this case we only find matches for "BF" and "BG".  You should prune the sequences if they aren't supported by 3-item sequences that are supported/frequent, but in this case they're fine.  Then we check the database for support to finalize the list.  If you do that you end up with information similar to the table below.


You can see that, at this point, this example gets pretty simple.  If you try the iteration again, the trick of removing the first item and last item yields no matches and you're done.  After that you would just output all of the supported/frequent sequences to the user, which in this case would  be the following (as long as I don't make a copy and paste mistake :/ ).

Bells and Whistles
I promised to describe the bells and whistles for this process as well.  Here's a list of extras that can be added

  1. Taxonomies/Hierarchies: Harry Potter and the Sorcerer's Stone is part of a series by J.K.Rowling, which is part of the children's fantasy genre.  Taxonomies like this can be incorporated so that more general sequences can be found if the more specific ones aren't supported.  The way that the algorithm handles these taxonomies is to add in the higher level items (e.g. J.K. Rowling is higher level than Harry Potter) to the combined transactions.  For example if you had a sequence like this <A(1, B)A> then you would translate that into a sequence like <(A, Letter, Symbol)(1, Number ,B , Letter, Symbol)(A, Letter, Symbol)> if you had a taxonomy like the one here:
  2. Window Size:  If you've had a customer for years, maybe it isn't that interesting if they support a particular pattern because they have so much buying behaviour on record that they might just support it by chance.  Window sizes limit the span of time when transactions are deemed to support a sequence.  (e.g. I only want buying patterns for pregnant women during a 9 month window)
  3. Max-gap:  This parameter is defined by the user to filter out large gaps in data sequences.  Let's say a customer buys an iPod, then even though they are within the specified window size, they have a large gap between that purchase and a new set of headphones.  To a business owner, this gap might make them "uninteresting" as a customer to market to, so the GSP algorithm can filter this out as it is running and checking for support
  4. Min-Gap: Think of this as answering the question "how much time would I let pass between transactions before I would consider 2 purchases separate elements in my sequences?"  An example of this would be me at Home Depot.  I go to start a project on a Friday evening and buy what I think I need.  It isn't until Saturday about noon that I realize I had no idea what I REALLY needed and go back for more stuff.  Some store owners may want to treat these as one "transaction" or event.  If the min-gap was set to 24 hours in this case, all of my purchases would be grouped together to look something like this <(Friday night stuff, Saturday noon stuff)>.
I admit that there are more details to how you might actually implement these bells and whistles if you had to code them.  For those that are interested in that I recommend you let me know you want another post on that.  Otherwise, I refer you to "Mining Sequential Patterns: Generalizations and Performance Improvements" by Ramakrishnan and Rakesh Agrawal.

For those of you with the perseverance to get to the end of this lengthy post I congratulate you!  As always, I hope this was helpful for somebody out there.

18 comments:

  1. Really very informative! Thanks a ton! :D

    ReplyDelete
  2. The best blog I have seen about GSP

    ReplyDelete
  3. This really helped me to understand, thank you so much!

    ReplyDelete
  4. WoW it really helped me to understand.

    ReplyDelete
  5. thank you sir ! now i got it. thank you for taking the pain to create such a long blog

    ReplyDelete
  6. F(AB)D is also one of the candidate 4-sequence, it comes from ABD + F(AB)

    ReplyDelete
    Replies
    1. I AM SORRY, it comes from F(AB) + ABD

      Delete
  7. very helpful sir thank you! but can you please explain why A(AB) gets pruned out.

    ReplyDelete
    Replies
    1. AA is a subsequence of A(AB) but it's not one of the 2-item sequences.

      Delete
  8. Thanks a ton for such a walkthrough is really tedious to make but super helpful!

    ReplyDelete
  9. very helpful thanks a lot

    ReplyDelete
  10. Thanks a lot, perfect!!

    ReplyDelete
  11. Awesome article. I’m taking Data Mining and our instructor recommends this article to us.

    ReplyDelete