Saturday 8 October 2016

Co-occurrence .. Refined ... Part 1/2

Much of our recent experiments have been based on the idea of co-occurrence .. two words being next to each other.

We assumed that if two words co-occurred they must be related.

That's a powerful theory. .... but it isn't perfect. Two key problems are:

  • Interesting words are often sat next to boring words. Apple in .. Slice the ... John and ...
  • Related words are often not right next to each other, but instead, 2 to 3 words apart .. Apple and banana ...  ... Hillary spoke to Obama ... 


What do we do?



Longer Reach Co-Occurrence

We don't need to throw away our notion of co-occurrence. All we need to do is make it a tiny bit more sophisticated to take into account the above observation that related words are sometimes a short distance apart.

There are loads of ways we can do this .. but as always, let's start simple, and only make things more complex if we need to.

Option 1: One or Two Words Apart
The simplest idea is to extend the reach of co-occurrence from word being right next to each other to having a word between them:


What we're saying here is that if two words are right next to each other, or they have one word between them, then we consider them to be co-occurrent.

That would help with the problem of some boring words, and also interesting related words being once removed.

Why stop at one word between them? ...

Option 2: One, Two, Three, Four ... N Words Apart
Why not have 2 words between co-occurrent words? Or 3? Or 4 .. maybe N words?


That would deal with related words that were further apart... but... that also begins to strain the original theory that related words are close together.

So we need to somehow give more credibility to the idea that words that are closer together are related, and less credibility to the idea that words that are further apart are related.

That's not to say, words that are far apart can't be related .. it just means that, without any other clues, and given a large corpus of text, we're more likely to believe that more closely co-occurrent words are related.

Option 3: Weights N-Words Apart
We can easily translate this idea into a calculation:

  • Set a window, of length N , which limits how far apart words can be to be considered co-occurrent.
  • Use a maths function, like exp(-x2) which can give more numerical weight to closer words than more distant words. 


    That looks like a good way forward ... let's try it ...



    Results

    The following shows the resulting force-directed graph for our Italian Recipes mini-corpus, using the above extended-reach co-occurrence with a window of 3 (words could be 3 apart).

    Here's a plot of the top 500 words by co-occurrence .. click to enlarge ..


    It's a bit busy .. but we can see some nodes which are connected to many others .. but sadly they're the boring stop-words.

    Here's the same data but only showing a smaller set of the words with higher co-occurrence value .. more than 4.0 here.


    Yup - that makes it even clearer that we need to deal with those pesky stop-words.

    As a bit of a short-cut we'll use the rather brutal method of removing stop-words we developed earlier, tmt.word_processing.remove_stop_words(). Here's the result:


    This is really much better now that we've clear out the stop words. We can see words that should be related, for example:

    • bread -- crumbs
    • salt -- pepper
    • tomato -- sauce, tomato -- paste, white -- sauce
    • cut -- pieces
    • cold -- water
    • olive -- oil
    • egg-yolk

    That validates our idea .. and gives us he clearest insights yet ... we just need to deal with those stop-words!


    No comments:

    Post a Comment