The Natual Language Challenge
Trying to extract meaning from unstructured natural language text data is different, if not more challenging, than structured numbers or labels.There are several reasons for this.
- There is no strict clear structure in natural language that gives meaning - no columns or rows, lists or arrays, column headers or fields labels.
- Even the small amount of structure (grammar) that exists in natural language is often broken. Human languages have exceptions, ambiguity, multiple spellings, idiomatic phrases, regional expressions, ... and even more difficult things like sarcasm and irony.
That I ain't done nuffin' could mean blue bird has not done nothing, that is, he's done everything!
Natural language wasn't designed to be an efficient, unambiguous and precise language that we could efficiently compute with. 
It grew organically over time, with more chaos than people sometimes expect. It was only until about 400 years ago that spelling for common words started to settle down for English.
What does it mean? Fruit flies, a kind of insect, like a banana as a meal? Fruit, when thrown, has the aerodynamics of a banana ... and lands with a splat!? The meaning is ambiguous.
Do we give up? No! We accept the reality of natural language, and try to develop algorithms that are useful, good enough, even if occasionally the messiness of human language breaks them.
As we said at the top, for structured data this is much easier. Numbers are easy to calculate with. We can sum them, find averages, cluster them, etc. Structured text is easy enough too - the names of candidates people vote for, the class names of flowers in the Iris dataset, the names of regions ... and so on. We can count them, create sets of them, and sometimes order them or group similar ones together. In all these cases, each item of structured data is a number or a label, and each has a precise meaning.
It is worth asking what the unit of computation for natural language text should be. Is it a letter? A word? A sentence? A paragraph? A document? A collection?
We could answer this by looking at what computers actually do themselves. Computers are nothing more than a bunch of electrical switches. The electricity flow is either switched on or off - which naturally represents 0 and 1 in binary numbers. You'll remember these from school, binary 001 is one, 010 is two, 101 is five.
It grew organically over time, with more chaos than people sometimes expect. It was only until about 400 years ago that spelling for common words started to settle down for English.
Messy Ambiguity
Have a look at the following piece of natural language:What does it mean? Fruit flies, a kind of insect, like a banana as a meal? Fruit, when thrown, has the aerodynamics of a banana ... and lands with a splat!? The meaning is ambiguous.
Do we give up? No! We accept the reality of natural language, and try to develop algorithms that are useful, good enough, even if occasionally the messiness of human language breaks them.
What Do We Compute With?
Ok, if we are going to forge ahead and try to compute with natural language, what are the things we manipulate and do calculations with?As we said at the top, for structured data this is much easier. Numbers are easy to calculate with. We can sum them, find averages, cluster them, etc. Structured text is easy enough too - the names of candidates people vote for, the class names of flowers in the Iris dataset, the names of regions ... and so on. We can count them, create sets of them, and sometimes order them or group similar ones together. In all these cases, each item of structured data is a number or a label, and each has a precise meaning.
It is worth asking what the unit of computation for natural language text should be. Is it a letter? A word? A sentence? A paragraph? A document? A collection?
We could answer this by looking at what computers actually do themselves. Computers are nothing more than a bunch of electrical switches. The electricity flow is either switched on or off - which naturally represents 0 and 1 in binary numbers. You'll remember these from school, binary 001 is one, 010 is two, 101 is five.
The letters you see on your screen as you read this, are represented inside your computer or smartphone using a character numbering code that has been in place since the 1960s, called ASCII. In this scheme A is 65, B is 66 and lowercase z is 122. You can see all of the characters at ascii-code.com. If you watch traffic on your network, you'll see these codes flying past as they travel to and from the internet. (Don't do this without permission on a network that isn't yours!)
If ASCII characters are how computers store and transmit text, then maybe we should use letters as the basic unit for computing with natural language?
Let's remind ourselves of what we're trying to do. We're trying to extract meaning from natural language text. When we humans read text, we don't understand a word until we've seen all the letters of it. In fact, our minds can correct the spelling of a word because internally we refer back to an existing notion of what the whole word should be. This suggests it is whole words, not letters, that should be the unit of computation.
Doing simple tasks like changing all the letters to lower case can be done letter by letter. But that's because that task doesn't care what the words mean.
So we're arriving at the conclusion that words are the smallest unit of computation for text mining.
We said smallest unit quite carefully there. In some cases, we humans can't understand the meaning of a word without looking at the words around it. Have a look at the following, which illustrates homonyms - words which are written the same but mean different things.
We can only tell the meaning of the word saw, based on the words around it. They tell us that the first saw is a cutting tool for wood, and the second saw is the past tense of the verb to see. Similarly, the first branch is a part of a tree, and the second branch is a local establishment of a larger bank organisation.
So ... does all this mean we were wrong to think that the unit of computation is a word on its own?
No. Like many things in natural language text mining, the theories are good up to a point. So for us, using a word as the basic unit is good for many cases, but will break down when words need context. In that case, we need to take into account, somehow, the words around the word we're interested in.
In some cases, even a phrase, or even a whole sentence won't make sense on its own, and we'll need to reach further and look at the whole paragraph or even the whole document!
Conclusion
- The smallest unit of computation for text mining is a word, not a letter.
- Sometimes we'll have to extend beyond a word to establish meaning, looking at the words around it.





 

