Wednesday 24 August 2016

Indexing, Search and Relevance - From Scratch!

A very common thing that people want to do with text is to try to find things in it. That's called search, and we're going to make our own search engine!

Of course, we'll start with a very basic one, but the core ideas are the same across complex search engines.

We deliberately won't be using a 3rd party library to do this - we'll do this from scratch ourselves using basic Python because that's the best way to learn.


The Index at the Back of the Book

Let's start with a familiar example. Almost everyone has used a book which has an index at the back. Here's an example:


You can recall how easy it is to use. If I wanted to find the page(s) mentioning the word "ant" I'd follow the Ant entry along to find that it occurs on pages 4 and 5. Easy!

Some things like "boat" only appear once in the book, at page 2. Other things like "banana" appear on several pages, 4, 5 and 8.

So using the index is easy .. but how do we make an index in the first place?

Let's stay with our book example ... we go through each word, in each sentence, on each page, and for every interesting word, we make a note of which page it appeared on. As we scan through the pages, we'll be building up the index. In the above example, we'll encounter the word "ant" on page 4, ... then again on page 5, so we'll update the index to add that second occurrence. By the time we've reached the end of the last page, we'll have a full index.

We do have to examine every word on every page... which is a bit laborious and boring. Luckily, we have computers to do laborious work for us.


Indexing Text

So let's apply the same ideas we use for indexing a real book to our computer world.

We will again scan through the text, word by word, and keep a note of where we found each word. The following illustrates this:


You can see the word "saw" being scanned and entered into the index. You can also see the word "tree" being entered into the index too.

But have you noticed a difference with the real book example? Instead of noting down which page we found a word, we instead note down which document it came from. What's happening?

We actually have a choice about which one we do. It depends on how useful noting down which page, or document will be. If we only have one document, a single book, then noting down the page is enough. Some documents don't have pages, and so we can't do that anyway. For now, let's keep things simple and just note which document in a collection a word was found.


Searching

Once we've don't the hard work to build an index, searching is super easy. We just look it up in the index!

Using the example above, if we wanted to search for the word "tree", we'd look it up in the index .. follow the dots along .. and see that it was in "document 1". Easy peasy!

Now some readers will say that search isn't that easy, and is in fact very complicated and sophisticated. Well, yes, it can be complicated and sophisticated ... but what we've just done is the very simple, works, and is at the core of the more advanced methods.


Relevance

One problem that we will try to solve, even at this very early stage, is the problem of search result relevance.

Imagine, doing a search for the word "banana", and the results coming back telling us that it is to be found in 258 documents. That's cool, and very thorough ... but it doesn't help us decide which documents might be relevant to us, which ones to start looking at first.

This is actually quite a hard problem which nobody has solved perfectly .. but, we can apply a very simple idea that works quite well in sorting these 258 results in order of relevance.

Let's walk through the thinking ourselves ... imagine documents that are actually about bananas. Then imagine documents that mention a banana, but only do so in passing, and aren't actually about bananas. It's a good bet to say that documents actually about bananas will mention the word several time .. many times.

Maybe we can have a theory that says, the more a word is mentioned in a document, the more likely that document is about that word. Sounds reasonable enough!

If we believe this theory, then we can sort our search results with the documents that had the word many times at the top. Even if we don't believe this theory is perfect, we can still imagine that sorting in this way will give good results on many occasions.

So what do we need to change to our simple indexing and search steps above?

Well, when we're indexing, we should keep track of each and every occurrence of a word, even if it was in the same document. This way we keep a count of how often a word appears in a document, not just the fact that it did. Look at the following, which illustrates this:


You can see the word "owl" is noted in the index as appearing three times in document 1, and once in document 2. The word "tree" only appears once in both documents.

If we did a search for the word "owl" you can see that putting document 1 above document 2 would be helpful.

So we have a very simple, but fairly effective, way of ranking (sorting) the results by relevance.

Some Simple Code

Let's look at some simple code, and talk about it.

Building the Index
The following shows the Python code that creates the index. It's very simple .. it really is just 2 lines of code!!

# start with empty index
index = collections.defaultdict(list)
# update index
# (word, [document_name]) dictionary, there can be many [document_names] in list
[index[word].append(document_name) for word in doc_words_list]

So what's going on here?
Let's start with the idea of a Python dictionary which allows us to store values and associate them with keys. Here's an example:

d = {"John": "London", "Mary": "New York"}

which associates John with London, and Mary with New York.  We can query the dictionary like this:

d["John"]
"London"

You can see how a dictionary could work as an index .. instead of John we have the word, and instead of London we have the document where it exists. In fact we need a list because there may be many documents that a word exists in.

So that second line, works through every word in the list of words supplied, and adds it to the index, using the word as the key, and the document name as the value. Just like a normal Python dictionary. So what's the append? Well, a normal Python dictionary would complain with an error if we tried to append a value if the key didn't exist. This is where the defaultdict comes in handy. It allows us to create a brand new entry with a blank list as a value, ready for us to append to. You can read more about Python's defaultdict here.

That first line simply sets the type for the index and tells Python that we want the values to be lists, rather than, say, numbers.

All that in just 2 lines of elegant code!

Querying the Index
The following shows the code for querying the index.

# do query
matching_documents = index[search_query]
# count occurrences in matching documents
matching_documents_counter = collections.Counter(matching_documents)
# return list of matching documents ordered by those with most occurences
return list(matching_documents_counter.most_common())

The first line is easy .. we simply query our index using the search query string. Remember the index is basically a dictionary, so we provide the key, and get back the value ... which is the list of matching documents. Remember the reason why it is a list - a word could be found in many documents.
The next line uses another cool tool from the Python collections toolset. The Counter() simply counts the number of times items appear in the list. In this case the matching documents. So if document 2 appears 5 times, we'd get a count of 5. Then we return that tally, ordered by the most common first. Simple!

Here's an example output:

tmt.index_search.search_index(cr.content_directory, "rice")
[('06.txt', 5), ('05.txt', 4), ('04.txt', 1)]

Here's we've searched for the word "rice". The results show that it occurs in documents 06.txt, 05.txt and 04.txt. You can also see that it occurs 5 times in 06.txt, 4 times in 05.txt and only once in 04.txt.
So the first result 06.txt is most likely to actually be about rice. Yummy!


Github

The full code is on github, as well as a notebook illustrating how these functions can be used:



Enjoy!

No comments:

Post a Comment