Intuitive Text Mining: Theory, Model and Method

This is my third book, and this experience makes it clear to me that text mining is not yet as neatly coherent as a conceptual framework as other fields.

Loads of Tools, But No Shed

A survey of the many guides will give you lots of methods for processing text to give you some insight into its meaning. There will be methods like

word frequency
document clustering
co-occurance matrices
synonym searching
... etc ... etc ..

It really feels like that there are lots of methods, tools, that are offered for you to use but there doesn't seem to be an overall idea or theory which ties them all together. To put this another way, a larger conceptual framework doesn't help us to place each of these tools within it - so we can see which is appropriate to use and when.

Natural Language is Messy

This isn't surprising - because the data, the natural language text, itself isn't a mathematically precise and consistent thing. Human language, and the way we use it, is a messy, organic, incomplete and inefficient scheme ... never designed for crisp complete perfectly precise computation.

This leaves us with conceptual frameworks which fall into two camps:

probabilistic - ignoring any underlying structure and simply working with the likelihood of an answer based on "counting" how often it has previously happened
structural - trying to make use of underlying structure - either known true, or suspected true - to find answers

Theory -> Model -> Method

With this book, which is focused firmly on being accessible to those new to the subject, we won't present methods like the above in an ad-hoc fashion. That would be unsettling, and not really give comfort that we know what we were doing.

Instead we'll try really hard to follow a pattern for each idea:

Start with a Theory - an idea that we think is true, or even know to be true.
Use this theory to define a Model which is useful in a computational sense.
Explain the Method, or algorithm, that we use to do calculations with that model.

This should be much better than simply throwing a load of methods at readers. Even if we fail to completely fit every tool into a nice perfect complete theory of natural language - we can at least explain what model a method is supposed to work with, and which theory (truth or assumption) that model is trying to reflect.

This transparency also allows you to improve a method for a model, and come up with a better one. Or have more than one method for a model (eg gradient descent vs random search). It even allows you to disagree with the model for a theory, so you can come up with your own (eg frequent words vs rare words). It is even possible to have more than one theory, each one humbly not trying to be all-encompassing but targeting a specific part of natural language.

The following shows these choices:

Example

Here's a simple illustration of the above idea.

Theory: The most key concepts are mentioned a lot in a document.
Model: Frequency of words indicates the key concepts.
Method 1: Count the occurrence of words, the most popular words are the most meaningful concepts. Alternative Method 2: Count the words, but also the synonyms too.

The second method is a refinement of the first method.

You may disagree with the model and say that it is not word frequency which best models the theory which you may agree with. You may say that words like "the", "and" and "or" will occur the most frequently, and these words don't indicate any concept at all. This may lead you to an alternative model, such as words which appear rarely in a paragraph, but do appear in many paragraphs - this counteracts the negative effect of boring words like "and".

You may even disagree with the theory ... :)

Conclusion

Instead of throwing a large number of methods willy-nilly at the reader, we'll try to be more disciplined and transparent about what model, and which theory a particular algorithm is intended for.

Intuitive Text Mining

Friday, 17 June 2016

Theory, Model and Method

Loads of Tools, But No Shed

Natural Language is Messy

Theory -> Model -> Method

Example

Conclusion

No comments:

Post a Comment