Saturday, March 28, 2015

Simple Text Retrieval Vector Space Model Explanation

This post is going to be the beginning of my coverage of the content in the Text Retrieval and Search Engines course taught by Chengxiang Zhai at the University of Illinois through Coursera.  I'm going to review the basic vector space model for text retrieval, and hint at some of the evolutions of this model and why they're important.

Vector Space Model
I'm not sure how many of you out there took linear algebra courses, or know much about vectors, but let's discuss this briefly, otherwise you'll be completely lost.  Vector space doesn't look like outer space, it looks more like this if you look at a simple 2-dimensional vector.

The vector here starts at the origin of the graph and then extends out to the point (1,1).  In text retrieval, the 2 axes of the graph might represent 2 search terms you're looking for (think googling "hard drive").  The vector on the graph would represent the query itself.  Here, the fact that the vector points to (1,1) means that the query contains 1 instance of the term "hard" and one instance of the term "drive and this could also be represented as {hard, drive}.  On the same graph, we can add documents that are in our database (could also be the world wide web).  there are a couple ways to do this, but we'll start simple.  The easiest way to do this is just search through the document(s) and determine if the search terms are present in the document.  If they are, assign that term a 1, if not assign it a 0.  This is called the bit vector approach.  Assume that the following represents 2 documents with only the relevant text included.

   D1 = {'s hard to determine....}
   D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}

Using the 1 and 0 rule explained above, I would assign D1 the vector (1,0) and D2 the vector (1,1).  In this version of text retrieval these vectors are called bit vectors.  Plotting these on the vector space graph we get this.

'q' here represents our "query".  You can see that D2 matches the query perfectly using this system.  D1 seems to be going in a different direction.  Data scientists use this "closeness"of the vectors to measure similarity between the search terms and the documents reviewed.  To do this, they use the dot product.  A dot product is just a fancy way of saying you multiply the terms together and then add up the sum.  If you're familiar with MS Excel, this is like using the sumproduct() function.  For our example above, the dot product between the query and D1 is 1x1+1x0=1.  For D2 we get 1x1+1x1=2.  The dot product value can be used the rank the documents.  "Closeness" is ranked like a basketball score, the higher the better, so you can see that D2 ranks higher than D1.

Now I want to step back and explain the dot product a little more so you can get some intuition behind why the dot product works well for measuring the similarity of vectors.  If you look up the definition of a dot product on Wikipedia you'll find that the dot product (AB) can also be calculated this way AB = ||A|| ||B|| cosθ.  The ||A|| represents the length of the vector A.  if you remember back to trigonometry.  The length of the hypotenuse of a right angle is calculated like this

If you use this thinking and apply it to vector space, you can see that for the query and the document we have right angle triangles and we can calculate the length of the vectors using this method.  When we do this we get ||D1||=1, and  ||D2||=||q||=1.414... where ||q|| represents the length of the query vector.  Just because the lengths are equal doesn't mean that the vectors are the same.  That is why the dot product definition includes the cosθ term.  If the value of theta (θ) is close to 0, then cosine is close to 1; as theta gets closer to 90 degrees, cosine gets closer to 0.  So, if you have two vectors that are pointed in almost the same direction, then the cosine between them will be ~1.  This allows the ||A|| ||B|| to add to the relevance score.  If the two vectors are 90 degrees apart from each other the ||A|| ||B|| will get multiplied by something close to 0, essentially eliminating the effect.

Now that we have that simple understanding of vectors and dot products lets look at this in 3 dimensions.  In the diagram below we have a 3 word query that creates 3 dimensions to compare documents against.  Here we're assuming the query is {hard, drive, test}.

If we look at several documents, we might have documents that contain the following text:
D1 = {'s hard to determine...}
D2 = {...this hard drive has 100GB of memory...make sure the drive is fully installed...}
D3 = {...before I bought my new car, I took it out for a test drive...}
D4 = { part of the factory acceptance, every unit gets a hard drive test...}
D5 = {...I think I failed, that was a hard test...a standardized test is design to be hard for...}

When I put these documents in the 3 dimensional vector space diagram it looks like this

You can see the angles between the query vector and the document vectors and get a feel for which documents match closely.  The actual dot product calculations for these query/document comparisons turn out like this using the binary bit vector approach described earlier.

Based on the dot product values, it's obvious that D4 is the best match.  You can also see this in the last column that gives you the angle between the vectors.  To calculate this angle I had to calculate the length of the vectors. This is VERY similar to doing it in 2 dimensions.  3D length is calculated like this
A, B, and C are perpendicular lengths in the vector space (distance along hard, test, and drive dimensions).  You just square them all and take the square root.  This same principle applies to higher dimensions (e.g. the length of a 4 dimensional vector is the square root of sum of the 4 squared terms/dimensions).

Term Frequency
The bit vector (1's and 0's) approach is a very simplistic model however, and there are times when this approach doesn't give you very good results.  Let's the do the same problem over again with what is called the term frequency approach. This method counts the frequency of each of the terms found in the query, instead of just determining if the term exists in the document or not.  If we apply this rule to the problem above, the vector space looks like this

And the overall dot product calculation table looks like this

In bit vector mode, D2, D3, and D5 were all considered equally relevant.  Using term frequency we are able to say that D5 is a better match than D2, and D2 is a better match than D3.  The problem though is that now D5 looks like it's a better match than D4, which doesn't make sense.  One method to deal with this problem is to discount the score impact of terms that occur frequently in documents.  In our case, if we discounted "hard", or "test", we might find that D4 comes out on top again.  This method is called inverse document frequency and I'll write another post about that later.  When I do, the link to it will be HERE.

As always, I hope this post helps somebody out there.