Simple Data Mining: Text

This post discusses a different way (compared to the vector space model) to rank documents when performing text retrieval. This is called the probabilistic retrieval model. It bases its formulas off of probability theory instead of the rules of thumb that were created for the vector space model through trial and error. I'll explain the basics of the probability model, explains some limitations and derive a "smooting" implementation (Jelinek-Mercer), and then give an example of how it all works.

If it has been a while since you've take statistics, or never have, I'm hoping to make this first section easy for you to follow. When it comes to retrieving the right document during a search we can think of the best documents as the ones that have the highest probability of being relevant to the searcher. The trick comes when we have to define that probability. The mathspeak version of this probability definition is written as "p(R=1|d,q)". Translated in to English that reads "the probability that R=1 (i.e. the document is relevant) given that we have document d and query q". If we can efficiently calculate this probability, we can rank the documents in our search by their probability of being relevant.

The problem with the probability p(R=1|d,q) is that it is actually very hard, if not impossible, to calculate from the information we have when a user submits a query. So, the data mining community has developed a substitute probability to calculate that gives us basically the same effect. What they use is the probability that the query entered by the user was randomly generated from a relevant document. That probably didn't make sense yet, so let me explain what a unigram is and then give you an example.

When you analyze text, you have to decide how the text will be divided up for analysis. If the user enters a query for "hard drive" are you going to treat "hard" and "drive" as 2 separate variables? Or, are you going to treat them as one? How do you know which one you should use? For the simplest models we treat each word as a different variable and assume that each word has nothing to do with the other words. This assumption is called statistical independence. It basically means that you're assuming that there is no correlation between the words in the query. This is obviously NOT true, but as it turns out, accepting this assumption actually gives pretty good results anyway so we'll go with it. So each word in the query gets a fancy name called a unigram (basically means 1 word). If you bunched words into groups of 2 they would be 2-grams, etc.

Query Likelihood Example
Now it's time for that example I promised. Suppose you entered the query {hard drive test} and
you're looking at document D4 = {...as part of the factory acceptance, every unit gets a hard drive test...}. This document contains each of the query words once. Think of the document D4 as a bag of marbles where each document word (unigram) is a different marble in the bag.

Now randomly take a marble out, what's the probability that marble was one of the query words? It should be the number of times the word is in the document divided by the number of words in the document. If D4 is only 20 words long, then the probability of pulling out the word "hard" is 1/20. The probability of pulling out "drive" and "test" if also 1/20 for each term. If you're familiar with statistics, this is random sampling WITH replacement (put the marble back in the bag after you pick one), because the probability of pulling out "drive" isn't 1/19 after I've picked my first word. Now that we understand that we can predict the probability of generating the query with the document we're looking at; it's just the probability of pulling out each word multiplied by each other, or (1/20) x (1/20) x (1/20) = 1/8000. Using this methodology, you can look at all of the documents in the collection and do the same calculation; the highest ranked document will be the one that spits out the highest probability. If you want the mathspeak version of this probability it looks like this, "p(q|d,R=1)". This basically assumes that all of the documents are relevant (we know they're not) and gives the probability that they generated your query.

Now that we have that understanding, what happens when a method like the one above runs into a document that doesn't contain one of the query words...think about it...the probability calculated will be 0. That's because if 1 query word is missing from the document, then one of the terms that get multiplied together will be something like (0/20) which equals 0. Multiplying anything by 0 gives you 0 so this seems to penalize the document WAY to much. What if this document had the phrase "hard drive examination" instead of "hard drive test". Would we really want to rank that document at the very bottom...I think not! There's a way to fix this problem, but I'm going to have to explain some formulas before I do that.

Query Likelihood Formulas
Some words are very rare in documents. If you happen to be searching for a rare term, then the probability of finding this term will be very small. This small probability will multiply with other fractions and then it's pretty likely that you'll end up with tiny values for your query probability. In a computer, when variable values get too close to 0 there is a growing risk that round off error in the computer will start to become significant and distort your results. To avoid this, the magical properties of the logarithm come to save the day. If you look at the chart below for the log(x) you will see that as x increases, the log(x) also increases (by the way I assume log base 10 here and through the rest of this post). It's not a linear increase, but when it comes to ranking things, that's OK. if X is greater than Y, then log(X) is greater than log(Y). Also notice that the chart spans values where X is 0 to 1, like probabilities are required to do.

The other thing that is SO cool about the logarithm is that log (A*B) = log(A) + log(B). If we apply this rule to the probabilities we're adding together then for the example above (1/20) x (1/20) x (1/20) becomes log(1/20) + log(1/20) + log(1/20). It doesn't look very different here, but this minimizes the round-off problem in a computer. In mathspeak, if we have a lot of terms that get multiplied together we use a large pi symbol. The way we calculated the probability of a query being generated from a document above would look like this

f(q,d) is just a ranking function that depends on the query, q, and the document, d. The fraction to the right is the count of a word (this is a word in the query) in a document divided by the number of words in the document. This fraction can also be written as p(wi|d), or probability of a word given document, d. That BIG pi symbol means that you're going to multiply all of fractions to the right together from the first word (i=1) in the query to the last word in the query (there are n words in the query). If we take the logarithm of this formula, then we get this

Both of these formulas are equivalent, but before your head explodes, calm down and we'll walk through each one slowly. For both of them, we still have f(q,d) as the ranking function. We don't have log(f(q,d)) because there's no reason to do this since we have proven that the order is maintained when we take the logarithm. In the top equation, we have the same fraction on the right, but we take the logarithm of this fraction. Instead of multiplying all of these fractions from i=1 to n, we add them all up. That's what the BIG sigma symbol means. Now the difference between the first line and the 2nd one is the subscript on the sigma symbol and the term c(w,q). The subscript on the sigma symbol means that we are going to sum over all of the words in the volume (or words in the collection of documents). The reason we can do this without messing everything up is that we are multiplying each term by c(w,q) which is the count of the word in the query. So when 'w' in the summation equals "hard" then c(w,q)=1 and in effect we're saying this one counts/matters. If the 'w' in the summation equals "banana", that's not part of our query, so c(w,q)=0 and we add nothing to our ranking function. At this point in time, you may be thinking, why would anybody add that complexity to the equation? We'll see why this makes some of the notation easier to understand in upcoming sections

Language Model Smoothing
If we plot the probability of a word being selected in a document using the query likelihood model on one axis and the word number on a 2nd axis, we might get something that looks like this

As described earlier, if the word doesn't exist in the document, then there is 0 probability that it can be picked. As described earlier, this is probably not desirable because there might be a document that matches very closely, but does not contain one of the words in the query. What would be better is if we could adjust the curve to a little to look more like the green line below

Notice that near the end of the curve there are non-zero values for p(w|d). These non-zero values will help solve the problem of query terms that don't show up in a document. Since this green curve represents the average probabilities for all the words in the document collection, we wouldn't want to just use this curve for p(w|d). If we did, all of the documents would have the same score and the calculations would be pointless. What we really want is a way to kind of take a weighted average between the actual document we're ranking and the whole collection of documents. You can imagine that if we did this that the bumpy blue curve would become much smoother, thus the name "smoothing" for this approach. One method of doing this is called the Jelinek-Mercer(JM) model.

To get us to the JM model we've got to go through a derivation first. I've tried to make this derivation as visual as possible. If you don't really care about how the equation is derived, you can just skip down a little bit. But, if you're feeling adventurous, here's what I've come up with to explain it.

The top equation is the one we have already explained. The next step down splits this equation into 2 parts that represent a weighting for the probability of words found in the document and a weighting for the probability of words found in the rest of the collection. You'll notice that the probability of words in the document turned into a weird looking Pseen. I'm going to ask you to just ignore this for now, we'll explain this more later. On the 3rd line we split the 2nd term on the 2nd line because the sum over all of the terms not found in the document is the same as taking all the words in the collection and subtracting the words found in the document. On the 4th line we use one of the properties of logarithm to split the alpha term out from the p(w|C) term. Once we've done all of this we combine and reorganize all the terms in the last line. The first term in the first line takes advantage of the fact that log(a)-log(b)=log(a/b). For the 2nd term, since alpha is a constant we can simplify the notation where |q| is the count of words in the query. The last term on the last line is just added to the end from the 4th line.

Now that we have this equation on the last line above, we can adapt it to the JM method of smoothing. To do this we need to define this term circled in red

The JM method starts by defining how it wants to perform the weighting between the probabilities from the documents and the collections. Essentially it is giving a new definition for p(w|d), or the probability of a word given a document. Here's how this is defined in the JM method

Let's start by saying that lambda(λ) is a user selected variable that ranges between 0 and 1 (it's a different way of defining alpha in earlier equations). The first term is the weighted probability of a word based on data from the document, and the 2nd term is the weighted probability of a word based on the collection of documents. The 2nd half of the first term where we have that fraction, we're just taking the count of the word in the document, divided by the count of words in the document. You can see that if we set lambda to a large value close to 1, that we will basically just be taking the probability of a word based on the document collection. With this one definition, we can now derive the rest of the JM ranking function.

If you're one of those people that don't care about derivations (it's OK, I used to be one of them) just look at the last equation line and use it. If not, here's a quick explanation. The first equation row just used the definition for Pseen to simplify the fraction a little bit. Notice that lamda gets substituted for alpha. Up until now we have been using alpha as our generic term that defines the weighting between the document and collection. Since the JM method defines this as lambda, we just swap alpha out for lambda. After obtaining a simplified fraction for that weird Pseen fraction term, we substitute it into the ranking function in the 2nd line. We also get to completely remove the last 2 terms of this ranking function because they're actually constant. The last term is a probability of a word in the collection and that will be the same for any document in the collection. The 2nd to last term is based on the number of words in the query, which is the same for every document we're trying to rank as well. So, there's no need to calculate these values if we're only interested in ranking documents. They get the X, and we end up with a simpler equation in the last line. The sum in this final equation is over all the words in the query and document

Now that we have this equation, let's finally do a simple example with it to show how to use it. Let's say that we have the same query and documents used in the post about the vector space model:

q = {hard drive test}
D1 = {...it's hard to determine...}; |D1|=365
D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}; |D2|=50
D3 = {...before I bought my new car, I took it out for a test drive...}; |D3|=75
D4 = {...as part of the factory acceptance, every unit gets a hard drive test...}; |D4|=50
D5 = {...that was a hard test...a standardized test is design to be hard for...}; |D5|=230

To do all of the calculations we need to know the probability of finding the query words in the collection of documents. We take the count of the query words in all the documents and divide them by the total number of words in our collection. For example, the word "hard" shows up 5 times in our collection of documents, and there are 365+50+75+50+230=770 words in the collection. So the probability of "hard" in the collection is 5/770 = 0.006494 = p("hard"|collection). If we do the same for drive and test we get p("drive"|collection) = 0.005195 and p("test"|collection) = 0.005195.

The only thing left before we rank some documents is picking a value for lambda. Let's just say that we use 0.5 for this example. For the first document we would get the following.

Notice that I only included 3 terms here because the other terms have values of c(w,q) that are equal to zero. This is because there are only three query terms. All of the other words in the collection aren't in the query so their value is zero. This is actually a big weakness for the JM method. It doesn't actually solve the problem where there are terms missing from our query. Instead is smooths out the probability of the terms that are in our query with the probability from the collection. To solve the problem where we don't include a term in our query, you have to use another method like Dirichlet Prior of BM25. These are examples of other smoothing methods that the data mining community has created. If we continue our example using the JM method we get the following ranking values for the documents in our collection:

We can sort these scores from largest to smallest and output the documents to the user. That's basically how it all works. I think that wraps it up. If you have any other specific questions about this method, please say so in the comments and I'll see what I can do to augment this explanation to cover it. As always I hope this helped somebody out there!

This post is going to be the beginning of my coverage of the content in the Text Retrieval and Search Engines course taught by Chengxiang Zhai at the University of Illinois through Coursera. I'm going to review the basic vector space model for text retrieval, and hint at some of the evolutions of this model and why they're important.

Vector Space Model
I'm not sure how many of you out there took linear algebra courses, or know much about vectors, but let's discuss this briefly, otherwise you'll be completely lost. Vector space doesn't look like outer space, it looks more like this if you look at a simple 2-dimensional vector.

The vector here starts at the origin of the graph and then extends out to the point (1,1). In text retrieval, the 2 axes of the graph might represent 2 search terms you're looking for (think googling "hard drive"). The vector on the graph would represent the query itself. Here, the fact that the vector points to (1,1) means that the query contains 1 instance of the term "hard" and one instance of the term "drive and this could also be represented as {hard, drive}. On the same graph, we can add documents that are in our database (could also be the world wide web). there are a couple ways to do this, but we'll start simple. The easiest way to do this is just search through the document(s) and determine if the search terms are present in the document. If they are, assign that term a 1, if not assign it a 0. This is called the bit vector approach. Assume that the following represents 2 documents with only the relevant text included.

D1 = {...it's hard to determine....}
D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}

Using the 1 and 0 rule explained above, I would assign D1 the vector (1,0) and D2 the vector (1,1). In this version of text retrieval these vectors are called bit vectors. Plotting these on the vector space graph we get this.

'q' here represents our "query". You can see that D2 matches the query perfectly using this system. D1 seems to be going in a different direction. Data scientists use this "closeness"of the vectors to measure similarity between the search terms and the documents reviewed. To do this, they use the dot product. A dot product is just a fancy way of saying you multiply the terms together and then add up the sum. If you're familiar with MS Excel, this is like using the sumproduct() function. For our example above, the dot product between the query and D1 is 1x1+1x0=1. For D2 we get 1x1+1x1=2. The dot product value can be used the rank the documents. "Closeness" is ranked like a basketball score, the higher the better, so you can see that D2 ranks higher than D1.

Now I want to step back and explain the dot product a little more so you can get some intuition behind why the dot product works well for measuring the similarity of vectors. If you look up the definition of a dot product on Wikipedia you'll find that the dot product (A⋅B) can also be calculated this way A⋅B = ||A|| ||B|| cosθ. The ||A|| represents the length of the vector A. if you remember back to trigonometry. The length of the hypotenuse of a right angle is calculated like this

If you use this thinking and apply it to vector space, you can see that for the query and the document we have right angle triangles and we can calculate the length of the vectors using this method. When we do this we get ||D1||=1, and ||D2||=||q||=1.414... where ||q|| represents the length of the query vector. Just because the lengths are equal doesn't mean that the vectors are the same. That is why the dot product definition includes the cosθ term. If the value of theta (θ) is close to 0, then cosine is close to 1; as theta gets closer to 90 degrees, cosine gets closer to 0. So, if you have two vectors that are pointed in almost the same direction, then the cosine between them will be ~1. This allows the ||A|| ||B|| to add to the relevance score. If the two vectors are 90 degrees apart from each other the ||A|| ||B|| will get multiplied by something close to 0, essentially eliminating the effect.

Now that we have that simple understanding of vectors and dot products lets look at this in 3 dimensions. In the diagram below we have a 3 word query that creates 3 dimensions to compare documents against. Here we're assuming the query is {hard, drive, test}.

If we look at several documents, we might have documents that contain the following text:
D1 = {...it's hard to determine...}
D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}
D3 = {...before I bought my new car, I took it out for a test drive...}
D4 = {...as part of the factory acceptance, every unit gets a hard drive test...}
D5 = {...I think I failed, that was a hard test...a standardized test is design to be hard for...}

When I put these documents in the 3 dimensional vector space diagram it looks like this

You can see the angles between the query vector and the document vectors and get a feel for which documents match closely. The actual dot product calculations for these query/document comparisons turn out like this using the binary bit vector approach described earlier.

Based on the dot product values, it's obvious that D4 is the best match. You can also see this in the last column that gives you the angle between the vectors. To calculate this angle I had to calculate the length of the vectors. This is VERY similar to doing it in 2 dimensions. 3D length is calculated like this

A, B, and C are perpendicular lengths in the vector space (distance along hard, test, and drive dimensions). You just square them all and take the square root. This same principle applies to higher dimensions (e.g. the length of a 4 dimensional vector is the square root of sum of the 4 squared terms/dimensions).

Term Frequency
The bit vector (1's and 0's) approach is a very simplistic model however, and there are times when this approach doesn't give you very good results. Let's the do the same problem over again with what is called the term frequency approach. This method counts the frequency of each of the terms found in the query, instead of just determining if the term exists in the document or not. If we apply this rule to the problem above, the vector space looks like this

And the overall dot product calculation table looks like this

In bit vector mode, D2, D3, and D5 were all considered equally relevant. Using term frequency we are able to say that D5 is a better match than D2, and D2 is a better match than D3. The problem though is that now D5 looks like it's a better match than D4, which doesn't make sense. One method to deal with this problem is to discount the score impact of terms that occur frequently in documents. In our case, if we discounted "hard", or "test", we might find that D4 comes out on top again. This method is called inverse document frequency and I'll write another post about that later. When I do, the link to it will be HERE.

As always, I hope this post helps somebody out there.

Simple Data Mining

Pages

Saturday, April 11, 2015

Probabilistic Retrieval Model: Basics, Query Likelihood and Smoothing

Saturday, March 28, 2015

Simple Text Retrieval Vector Space Model Explanation