Monday, 12 September 2016

Multi Word Search Queries

We've already developed a simple way of searching an index to see which documents contain a word.

We even refined the way we want the search results, first prioritising the documents which the highest word count, and then by using our more sophisticated relevance measure.

The challenge now is .. how do we handle queries with more than one word?



More Than One Query Word

Let's have a think about this. Imagine we have 3 documents and an small index of only 3 words, shown here:


You can see, for example, that the word rice doesn't appear in doc1. You can see that it appears in doc2 with a high relevance score. Doc1 seems to feature cake a lot!

What do we want to happen when we give our computer a search query with more than one word - say "rice cake"? That's not a simple question. So let's take it slow.

We don't want our toolkit to ignore one word or the other. That's why we provided both. Good - that's a start, at least.

Do we want our toolkit to consider each word on its own, and return the two sets of results? So we could see the scores for the word rice, and note that doc2 came top, and similarly see that that doc1 came top for the word cake. We could then .. maybe order doc1 above doc2 because the score for cake was 0.5, higher than the 0.4 for rice.

Well ..  that could work, and it's not a terrible start ... but why did we provide two words in the first place? Because we wanted to find results which were related to both words together.

That's our first clue as to how to handle the query. We want to prioritise documents which are related to both search query words - both "rice" and "cake" in our example.

How do we do this? Well. we could look at an index based on 2-grams and try to find "rice cake" in there. That's actually a very good answer - but fails if the two words are not always right next to each other. And anyway, we often find ourselves having to work with an index of 1-grams not 2-grams, or 3-grams, etc ...

Let's have another go using the 1-gram index above. Could we combine these scores in some way that reflects our intention when we issues the search?



Combining Scores

As always, let's start with the simplest thing we can do, and only make things more complex if we really really need to.

A very simple thing we can do is simply add the scores that each query word has for a document.

Have a look again at the index:


Using the words rice and cake:

  • doc1 has a combined score of 0.0 + 0.5 = 0.5
  • doc2 has a combined score of 0.4 + 0.1 = 0.5
  • doc3 has a combined score of 0.3 + 0.3 = 0.6


So this means doc3 wins! ... not doc2 which had the highest score for rice alone, and not doc1 with had the highest score for cake.

You can see how, doc1 might have been abut cakes, but maybe not rice cakes. And doc2 might have been a rice dish but not rice cakes.

So let's try this idea of adding relevance scores for each search word in a query ... often called search terms.



Code

The code to implement this is really simple .. which shows you the power and ease of Python and pandas.

# query string to list of search terms
search_query_list = search_query.split()

# do query
documents = relevance_index.loc[search_query_list]

# sum the scores
results = documents.sum()

# filter out those with score of zero
results = results[results > 0]
return results.sort_values(ascending=False)

The query is still a simple lookup of the index data frame. The sum is also really simple. The results can contain zeros (word not present) so we get rid of those, again very simple. We finally sort the results in descending order, so the highest scores are first.



Results

Let's try it and see what happens:

  • A search for "saffron rice" places 05.txt top for both indices. That's good because it is in fact the recipe for "rice with saffron".
  • A search for "bread soaked in milk" results in both indices returning 17.txt and 03.txt which both contain bread (crumbs) soaked in milk. But the relevance index returns 00.txt as the top result. This is not ideal, especially as the document doesn't contain the word milk but is boosted because the document contains a high concentration of the word bread

That second observation suggests we need to refine the combination of scores. We'll come back to this later, as we've made a good start for now.

No comments:

Post a Comment