Intuitive Text Mining

Monday, 4 February 2019

Force Directed Graphs in Jupyter Notebook

We've previously used interactive force directed graphs inside our Jupyter notebooks to show the strength off co-occurrence between words, for example.

It took a fair bit of work to work out how to use d3.js, to transform data from a pandas dataframe to a network graph using networkx, and then render an animated interactive graph that worked inside a Jupyter notebook cell.

We're going to do that again as d3.js has moved on and changed how it works.

This blog post is a way for me to think more clearly, and also to share what I learn with others as there doesn't seem much easy to follow guides or tutorial on using d3.js for newcomers that I could find online.

Drawing In a Jupyter Notebook Cell

We have to first understand how we get our own code to draw something in a jupyter notebook cell.

The following picture shows a jupyter notebook on the right. You can see a cell into which we type our python instructions, following by a cell which contains any output of those instructions.

We want the output cell to contain an interactive animated graph of nodes. That means it can't be a static image but something that is constantly running and reacting to user clicks and dragging. The jupyter notebook is a web page and the only thing that can continue to run inside a web page is javascript. The graph needs to be drawn, and animated, by javascript.

There is a popular and capable javascript framework for drawing interactive graphs, d3js.

We next need to ask how we get that javascript including the d3.js library into that output cell.

Luckily, the jupyter notebook python API includes the ability to emit HTML for display in an output cell. This is key. It means we can call a function from our notebook which results in HTML being generated and rendered in the output cell.

All that's left is to make sure that HTML is suitably minimal and includes a reference to the d3.js library and our own javascript which contains the description of the graph, and functions for dealing with user interaction.

Let's summarise what we've just described:

python code in a notebook cell can call an external imported function
that function can use a jupyter notebook API function to create HTML that is rendered in the output cell
that HTML can be fairly minimal but contains a reference to the d3.js graph drawing library
that HTML also contains our own javascript which describes the graph to be drawn using the d3.js library

d3fdgraph Module

Let's create a module to be imported into our notebook, which provides a function that renders the HTML and javascript.

Let's call it d3fdgraph, for d3 force directed graph. The name isn't amazing, but module names should be short, whereas function names can be fuller and more descriptive.

Here's the file structure of the module:

test_graph.ipynb

d3fdgraph/
__init__.py
d3fdgraph.py

The test_graph.ipynb is our notebook. The directory d3fdgraph is the module. Inside that directory is an __init__.py which all python modules require. The main module code will be in d3fdgraph.py.

The file __init__.py contains a simple instruction to expose a single function plot_force_directed_graph().

from .d3fdgraph import plot_force_directed_graph

Let's now write that function.

Developing A Module

We're going to be developing that d3fdgraph module iteratively, trying things and testing things as we go.

Normally when we write python code, changes take effect immediately. However isn't always true when using python modules in a jupyter notebook, because changing the code doesn't trigger a reload of the module into our notebook.

Using %autoreload at the top of our notebook we can force modules to be reloaded before we run python code in notebook cells.

%load_ext autoreload

%autoreload 2

This means we can edit our code and see the changes take effect in our notebook.

Simple HTML Test

Let's start with a really plot_force_directed_graph() simple to get started and make sure things work.

def plot_force_directed_graph():

html = "Hello World"

# display html in notebook cell

IPython.core.display.display_html(IPython.core.display.HTML(html))

pass

The function is very simple. All it does is create a text string with the words "Hello World", and uses the notebook API to render it in the output cell.

It's actually a two step process. First a display object is created from the html using IPython.core.display.HTML(). This is then rendered using IPython.core.display.display_html().

Let's try it.

That looks like it worked.

Let's add some HTML tags to check that we're not just displaying plain text. Let's wrap bold and italic tags around the text.

html = "<b><i>Hello World</i></b>"

Let's run that function again.

That works, and confirms that HTML tags are actually being rendered.

Loading HTML From a Resource File

We could use a longer HTML string that includes javascript inside <script> tags but that's a bit inelegant. It's neater to have our HTML and javascript in separate files which our python

We can keep our HTML in d3fdgraph.html and our javascript in d3fdgraph.js. Our files now look like this.

test_graph.ipynb

d3fdgraph/
__init__.py
d3fdgraph.py
d3fdgraph.html
d3fdgraph.js

We could load the content directly using the usual with open() as instructions, but because we're making a python module we should follow the guidelines for python packaging and use pkg_resources.

Our d3fdgraph.py now looks like this.

import IPython.core.display

import pkg_resources

def plot_force_directed_graph():

# load html from template files

resource_package = __name__

html_template = 'd3fdgraph.html'

html = pkg_resources.resource_string(resource_package, html_template).decode('utf-8')

# display html in notebook cell

IPython.core.display.display_html(IPython.core.display.HTML(html))

pass

Let's put the following very simple text into the d3fdgraph.html:

<div>force-directed graph</div>

Let's test this works.

Loading html from a resource file worked.

Simple Javascript Test

Let's check that we can run javascript in the rendered notebook cell.

The notebook API provides a similar approach for rendering (running) javascript. Our javascript code is passed to the IPython.core.display.Javascript() function which created an object for IPython.core.display.display_javascript() to render.

# display (run) javascript in notebook cell
IPython.core.display.display_javascript(IPython.core.display.Javascript(data=js_code))

We can load our javascript from the d3fdgraph.js file just like we loaded our html.

Let's change our HTML in d3fdgraph.html so the <div> element has an id which we can find later.

<div id="p314159"></div>
<div>force-directed graph</div>

And let's have some very simple javascript in d3fdgraph.js which finds that <div> by its id and overwrites its content with a new message.

var x = document.getElementById("p314159");

x.innerHTML = "Hello from Javascript";

Let's try it.

That worked.

This shows that our javascript not only runs, but successfully interacts with the html elements rendered previously.

This replacement of existing HTML elements is at the core of how we'll be using d3 so it's worth repeating what happened - we used javascript to locate a HTML element and update it.

Loading D3.js

Jupyter uses require.js to manage javascript libraries. We can use it to pull in d3.js at the top of our d3fdgraph.js like this:

require.config({
paths: {
d3: 'https://d3js.org/d3.v5.min'
}
});

This pull in the d3.js library version 5, and associated the d3 reference to it.

We can the call use the library through the d3 reference in this rather convoluted way:

require(["d3"], function(d3) {
console.log(d3.version);

});

Here we're testing that the d3.js library successfully loaded by getting it to print out its version to the javascript browser console.

That worked.

Simple D3 Graph

Let's now use the d3.js library to create a simple graph.

The data format that d3 expects to start from is JSON.

We need a dictionary with two main entries "nodes" and "links". These are text strings, not variables.

Each of these keys "nodes" and "links" points to a list. You won't be surprised that "nodes" points to a list of nodes, and "links" points to a list of links.

Have a look at the following code which creates this data structure.

// data
const data = {
"nodes" : [
{"id": "apple"},
{"id": "banana"},
{"id": "orange"},
{"id": "mango"}
],
"links" : [
{"source": "apple", "target": "banana", "weight": 1},
{"source": "apple", "target": "orange", "weight": 2},
{"source": "banana", "target": "orange", "weight":3},
{"source": "orange", "target": "mango", "weight":3}
]
};

You can see the list of nodes enclosed in square brackets like all javascript lists. What's more, each note is itself a dictionary with "id" as key and the name of the node as the value, eg "apple" and "banana". The d3 way of working expects this key-value structuring of the data.

The list of links is more interesting. Each link is also a dictionary, and contains a "source", "target" and "weight" key. The source and target keys point to the nodes which are the source (start) and target (end) of that link. That makes sense. The weight is something we've added as an extra attribute to the data which we can choose to use. We'll use the weight to decide how thick the link lines are drawn.

The following diagram summarises the data in both graph form and JSON form.

Let's get started. First let's extract out the nodes and links from that data for neater code later:

// extract links and nodes from data
const links = data.links;
const nodes = data.nodes;

Now we start to use the power of d3. The following code find our <div> element which had the id of p314159, and inside it append an <svg> element of width and height, which we'll set earlier in our javascript code as 800 and 600.

// select HTML element and attach SVG to it
const svg = d3.select('#p314159')
.append("svg")
.attr("width", width)
.attr("height", height);

This has now inserted an svg element in our html into which we can start to draw our graph of nodes and links.

We could write lots of javascript code to worth through the nodes and links data to add an svg circle or line for each one. Luckily, d3 provides a powerful and convenient way to do this. Sadly it's not that well explained so we'll try to talk it through.

const node = svg.append("g")
.attr("stroke", "#aaa")
.attr("stroke-width", 1.5)
.selectAll("circle")
.data(nodes)
.enter().append("circle")
.attr("r", 5)
.attr("fill", "#f00")
.attr("cx", () => Math.random()*width)
.attr("cy", () => Math.random()*height);

That's a lot of code there. The first thing to understand is that we're chaining javascript instructions so the object returned by one is passed to the next. So the first thing that happens is a <g> group element is added to the previously created svg element, with stroke and stroke-width attributes. Those attributes will apply to everything in the group.

The next bit selectAll("circle") looks like it is trying to find a circle in the <g> group, but we haven't added any yet. What's going on? This is a d3 idiom, a way of working. If it doesn't find a circle, it creates a virtual circle selection to be fleshed out by later code.

The next bit is data(nodes) which does a lot despite looking simple. It binds the data in the nodes list to this circle selection. The next few instructions turn parts of that data into visual elements. The enter() is used to enter the previously created virtual selection and append an svg <circle> element for each item in the data list. You can see that attributes for circle radius, fill and location are added. The coordinates aren't set to a specific value but instead to randomly chosen numbers. That random() is wrapped in a function because d3 expects to it, otherwise it won't create new random numbers for each data item.

The arrow operator => is fairly new to javascript. You can find out more at this tutorial.

Let's see the results so far.

That worked. We have four nodes, one for each fruit data item, placed at random on the canvas. The key point here is we used the d3 methods for working through the data, and not writing our own code to do it.

Let's add the fruit names to each node. We could print the name as svg text on the canvas next to the node. Another way is to simply add a <title> element to each node which shows when the pointer is hovering over the node.

// add names to nodes
node.append("title")
.text(d => d.id);

This code looks like it only adds one <title> to the node selection, but it actually iterates over all of the nodes using the arrow function d => d.id. That d is automatically created when the data is bound ot the selection earlier, and refers to each data item. The d.id is therefore the value referred to by the "id" key in the dictionary of data items in node.

Running the code now lets us see a hover-over tooltip for each node, like this one showing "banana".

Now we want to get d3 to draw the nodes so that they are connected as described by links, and arranged so that the stronger link weights pull nodes closer together. To do this we need to create a d3 simulation.

// create simulation
const simulation = d3.forceSimulation(nodes)
.force("link", d3.forceLink(links).id(d => d.id))
.force("charge", d3.forceManyBody())
.force("center", d3.forceCenter(width / 2, height / 2));

You can see this makes use of d3.forceSimulation which is passed the nodes it will be working on. You can also see the simulation of forces is configured with three main settings. We're telling the simulation how the nodes are connected using the links list. We're also telling the simulation what kind of forces to apply, and in this case it is d3.forceManyBody which cause nodes to attract or repel each other. The additional force d3.forceCenter ensures the nodes are centred around the centre of the canvas and don't fly off into the distance.

You can read more about d3 force simulations here.

To use this simulation, we need to apply it somewhere. There are several options but because we want to make our graph interactive, that is allow each node to be dragged, we need to connect it to event handlers for dragging nodes.

Let's first attach the call to the drag handler to our nodes. It's just an extra function added to our previous chain of instructions creating each node. We've also removed the initial random locations as the simulation will calculate its own positions.

const node = svg.append("g")
.attr("stroke", "#aaa")
.attr("stroke-width", 1.5)
.selectAll("circle")
.data(nodes)
.enter().append("circle")
.attr("r", 5)
.attr("fill", "#f00")
.call(drag(simulation));

Now we can define our drag function.

/// dragging nodes
const drag = simulation => {

function dragstarted(d) {
if (!d3.event.active) simulation.alphaTarget(0.3).restart();
d.fx = d.x;
d.fy = d.y;
}

function dragged(d) {
d.fx = d3.event.x;
d.fy = d3.event.y;
}

function dragended(d) {
if (!d3.event.active) simulation.alphaTarget(0);
d.fx = null;
d.fy = null;
}

return d3.drag()
.on("start", dragstarted)
.on("drag", dragged)
.on("end", dragended);
}

There's a lot going on here but essentially we're registering event handlers to each node which are triggered when a drag is started, when a drag is happening, and when a drag has ended. You can see that the dragged() handler sets the fx and fy attribtes of each data point to be the coordinates of the pointer.

Those alphaTarget values are a cooling factor which decreases as the simulation proceeds, to slow down the movement. If it didn't do this, the nodes might continue to wobble forever. So when we do a drag, we reset the alpha target to 0.3, but when we finish a drag we want it to start decreasing back to 0.

You can read more about drag events here.

Finally we need to register a tick handler so we can update the svg nodes and lines as the simulation runs and updates its own positions based on the forces it is working with. A tick is just a unit of time used by the simulation to incrementally update the position of the nodes. The reason we have to catch this and update the svg elements is that the simulation itself doesn't change the svg elements by itself.

// update svg on simulation ticks
simulation.on("tick", () => {
link
.attr("x1", d => d.source.x)
.attr("y1", d => d.source.y)
.attr("x2", d => d.target.x)
.attr("y2", d => d.target.y);

node
.attr("cx", d => d.x)
.attr("cy", d => d.y);
});

Let's run the code to see if all this works.

It does!

This is great but took a lot of work. The d3 system is very capable and granular, but in my opinion isn't that well explained. I still have lots of questions about how it really works, despite goign through this exercise of trying to understand it line by line.

Before we move on, I forgot to use the link weights to decide how close the nodes should be to eahc other. A stronger weight means the nodes need to be closer, that is, a shorter target distance for the layout to try to achieve.

// create simulation
const simulation = d3.forceSimulation(nodes)
.force("link", d3.forceLink(links).id(d => d.id).distance(d => 50 / d.weight))
.force("charge", d3.forceManyBody())
.force("center", d3.forceCenter(width / 2, height / 2));

And the resulting graph now reflects the weights.

The distance between the mango and orange node is shorter because that link has a weight of 3. The distance between the apple and banana nodes is long because the weight is 1.

Data from Python

So far we've used data that was hardcoded in our javascript. What we really want is for our javascript to consume data our python code.

How do we connect the two? One easy way is to actually do a text replacement after we load the javascript code from file. This avoids the need for complicated always-listening servers or fragile message passing mechanisms.

Let's say we will always pass a three column pandas dataframe to our plot_force_directed_graph() function, we can use the names of the columns as the source, target and link weight identifiers.

If we change our javascript template code to use the following placeholders when defining the data, we search for them and replace them later.

// data
const data = {
"nodes" : %%nodes%%,
"links" : %%links%%
};

We'll also change the name of the attribute to to set a link's distance to a placeholder.

// create simulation
const simulation = d3.forceSimulation(nodes)
.force("link", d3.forceLink(links).id(d => d.id).distance(d => 50 / d.%%edge_attribute%%))
.force("charge", d3.forceManyBody())
.force("center", d3.forceCenter(width / 2, height / 2));

In our plot_force_directed_graph() function we can replace those placeholders with actual data and the name of the link weight attribute.

Let's have a look at a minimal dataframe that contains some node and link data.

It's the same data we've been working with. Our plot_force_directed_graph() function can take the name of the third column to replace the %%edge_attribute%%. That's easy enough. But how do we turn the first two columns into the json data needed by d3?

We use the networkx module to convert the dataframe into a graph, and export the json.

# column names for node source and target, and edge attributes
node_source_name = node1_node1_weight.columns.values[0]
node_target_name = node1_node1_weight.columns.values[1]
link_edge_name = node1_node1_weight.columns.values[2]

# convert node1_node1_weight to graph
graph = networkx.from_pandas_edgelist(node1_node1_weight, source=node_source_name, target=node_target_name, edge_attr=link_edge_name)

# convert graph nodes and inks to json, ready for d3
graph_json = networkx.readwrite.json_graph.node_link_data(graph)
graph_json_nodes = graph_json['nodes']
graph_json_links = graph_json['links']

Because we've separated out the nodes and links, we don't need the const data dictionary, we can just have the links and nodes populated directly.

// links and nodes data
const links = %%links%%;
const nodes = %%nodes%%;

Let's try it, and also add more data to the dataframe.

And the result is what we expect, a graph with two loops, because that's what the data contains.

Graph Style

We can move the style of the nodes and links to a CCS stylesheet in the HTML template file and refer to the classes by adding class attributes to the links and nodes groups using d3.

// add links to svg element
const link = svg.append("g")
  .attr("class", "links")
.selectAll("line")
.data(links)
.enter().append("line")
.attr("stroke-width", d => Math.sqrt(d.weight));

To add node name labels that move with the nodes, we need a different svg structure. We need to place the svg circle svg text elelemts together in an svg group, one for each node. That means changing how we build the svg.

const node = svg.append("g")
.attr("class", "nodes")
.selectAll("g")
.data(nodes)
.enter().append("g");


const circle = node.append("circle")
.attr("r", 4.5)
.call(drag(simulation));

// svg text labels for eachnode
const label = node.append("text")
.attr("dx", 10)
.attr("dy", ".35em")
.text(d => d.id);

It's easier to see what's changed as a picture.

Let's see the results applying this to data from the recipes data set for those where the cooccurrence is more than 1.

That's a rather nice graph, showing which words co-occur with other words, and more importantly, we can see clusters of related words.

And here's the interactivity working.

Great!

Unique Plots

A problem emerged when a notebook had more than one of these plots. To keep them unique we add a unique id placeholder to the HTML template which the python function replaces with a unique id generated for every call. This keeps multiple separate from each other, and d3 selections find the right HTML element.

Next Steps

Now that we have the basics working, we'll refine the aesthetics and design of the plot as we use them for actual data.

We'll also package up the module so it can be installed with pip.

More Resources

You can find an a really helpful overview of how d3 expects to be used in these two youtube tutorials: [1], [2].
D3 In Depth tutorial that bridges the gap between the basic introductions and the official documentation: https://d3indepth.com/

Saturday, 15 September 2018

Stop Word Lists - Use With Care

Stop words lists are lists of words which are removed from text before further analysis because they are considered to be uninformative and only add noise and volume to text data.

Many practitioners will simply download a list and use that list without further thought.

Several open source tools come with stop word lists built into the software - popular tools such as nltk, scikit-learn, spacy and gensim.

The following shows a family tree of stop word lists and how we think they are related:

The Problem 1

I came across a paper which explores both the quality issues of stop words lists in open source tools, but also the unsuitability of some lists for the further processing you might be doing.

Stop Word Lists in Free Open-source Software Packages (pdf, original link)

The paper explores how:

Some lists are much shorter or longer than others - with no documentation about how the stop words are created. Various methods exist including manual construction, word frequency, or word tf-idf measures.

Some lists have spelling errors - eg fify / fifty - probably as a result of manual intervention or refinement.

Some words have very obvious words missing - again without explanation of how the list was intended to be created. Scikit-learn contains has but not does, very but not really, and get but not make.

Some lists have words that are surprisingly included - the scikit-learn list contained the words system and cry, for example, without explanation.

We should ideally only use stop word lists if we understand the rational and method by which they were constructed. Too often this documentation is missing.

The Problem 2

Another deep problem is that stop words need to match our method for tokenising text data. Consider the following:

If our tokeniser breaks isn't into isn and t, then the words isn and t should be in the stop word list.
If our tokeniser converts isn't into isnt then our stop word list should contain isnt.

You can see that if there is a mismatch between stop words and how the tokeniser words, then our stop word filtering won't work.

Again, many open source stop word lists don't include this description as part of their documentation.

Conclusions

There are two conclusions:

We need to understand the rationale and method by which stop word lists are constructed - to explain why some words appear and some don't. This documentation is often missing for open source lists.

We need to match the tokenisation method with the stop words. Again, this documentation is often missing.

Tuesday, 31 July 2018

Mind Map of Intuitive Text Mining

I've spend some time over the last month thinking about what content should be in the new book Intuitive Text Mining.

This blog post is a way for me to organise my own thinking - and for readers to provide suggestions for improving it.

Get in touch by email intuitivetextmining at gmail dot com or @intuitivetext.

Click the following mind map to enlarge:

New and Old Ideas

I want to keep things as simple and easy as possible, but still introduce the reader to some of the cool modern ideas - like the application of neural networks, comparative vector embeddings and dimension reducing visualisation - that have fuelled the resurgence of text mining.

Many of the older traditional texts don't cover these cool ideas, so there is a gap in the market.

Dan Jurafsky and James Martin's upcoming book Speech and Language Processing 3rd Edition is looking like an excellent book - you can get a preview here: https://web.stanford.edu/~jurafsky/slp3/

Part 0 - Introduction and Challenge

Big Data, Bigger Text Data
I'll introduce the idea of a world overflowing with data, and that historically a lot of the algorithms developed are for structured numerical data, not the unstructured natural language text data which - although messy, imprecise and ambiguous - reflects the breadth of our human cultural experience - from grand literary heights to informal chat chat, from important evidence in legal cases to collections of scientific knowledge, and even writing going back thousands of years into history.

There are huge rewards for anyone able to extract the meaning, patterns and insights from this messy ocean of text data.

Measuring Similarity and Meaning
I'll use an example to illustrate directly the challenge of analysing text by asking:

which of these three things are closest to each other - 1.1, 1.5 and 9.3?
which of these three things are closest to each other - apple, orange, cake?

We, humans can easily find the answer, but working out how to get a computer to work it out pinpoints the challenge of working with natural language text. What's the distance between apple and cake? Why is it bigger than the distance between apple and orange?

There are deeper challenges. What does "we saw her duck" mean - does it mean we saw her ower herself, or that we saw her water-loving bird? Humour and sarcasm add even more difficulty.

Unsupervised vs Supervised vs Expert Systems
Because natural language is so difficult, many algorithms for working with it are:

supervised - a human that understand language is laboriously labelling data to train a machine learning algorithm
expert systems - make use of human's knowledge of language to create rules that go a long way to deconstructing natural language - parsing sentences into grammatical elements for example, or stemming words to a common root, for example.

I don't like these approaches. They are often a practical necessity, but ultimately they involve humans which means the process of analysis text is not as automatic as it could be. Also, humans make errors, and people also have subjective and inconsistent interpretations of text, making them not a reliable oracle. Also, many of the "rules" developed as expert systems are not perfect, sometime far from perfect. Stemming words is a good example of what I don't like. Sentiment analysis is the cliched application of text analytics but much of it relies on rules created by humans or human labelled training data. So in the book I'll acknowledge this and focus on unsupervised methods as much as possible.

A huge benefit of unsupervised methods is that they can be used in new domains, or even different human languages, because they don't make the assumptions or require as much human preparatory work that supervised and expert methods do. Try using the English Porter stemmer on French. Also, for very large data sets, any human involvement becomes impractical.

Theory > Model > Method
Because natural language isn't precise, consistent and unambiguous - any method of working with it is necessarily going to have to be based on assumptions and simplifications.

Rather than pretend we didn't make assumptions and simplifications, we should be very clear about them. This way we can revisit these assumptions and perhaps improve them.

More than that, we can separate out the theory (assumptions, simplifications, suppositions) from the method we use to implement that theory. This way we can improve or try different methods and keep the same theory. Actually we can separate that out even further, by thinking about the model (data structures) which best represents the theory, and then detailed method (code) that works with that model. This means someone else could come up with a better model, or better code for the same model.

Everyone talks about vector word embeddings .... what are the assumptions, simplifications and suppositions behind why we think they work?

Approach
The book isn't about learning tensorflow or nltk or any other library or toolkit. The firm focus of the book is to understand how and why we do text mining. At the vert least readers will develop an intuition for what the popular algorithms are doing.

Writing our own small code to implement the ideas is a good way for them to sink in, and also for us to understand their limits. In effect we'll develop our own text mining toolkit, but that will be a purely educational process, not something that replaces the much more robust and popular toolkits you might use for

The focus of the book won't be to teach a programming language, unlike my previous books which have devoted lots of space and time to that.

We'll use Python because it is easy, popular and pretty much the de-facto toolset for data science. Python code is really easy to read, unlike many other languages, and this is important for anyone trying to understand what's going on.

We'll need an extension to Python to work with arrays of data, and numpy is the mature standard. Pandas wraps more convenience around numpy.

Something I still need to investigate is how to process large amounts of data. Do we scale up and use a big machine in google's compute cloud? Or do we scale out and use lots of nodes, which only works for algorithms that can be parallelised. Dask seems to be the standard in the python data science toolset for doing this.

Part 1 - First Ideas

Word Frequency & Important Words
We'll start with the simplest possible of text mining. We'll take a passage of text and ask ourselves what it is about - without reading the whole passage. Even better - how would we get a computer to tell us what the text passage is about, without displaying the whole text for us to read.

A really simple idea is that the most frequent word, or words, are going to be the most important words. Instead of reading 100, 1000, or even 10,000 words, we can just read the top 3 most frequent words, and this should give us an idea of what the text was about.

Although the idea is not perfect, it is very powerful given it's simplicity - a feature worth celebrating about algorithms.

In developing the idea of word frequency, we'll need to maintain a count of how often words occur .. which leads us to the idea of an index.

We'll tackle the issue of word count vs frequency which is normalised fo document length. A simple idea but the notion is applicable to more sophisticated ideas later.

We'll start to explore visualisation as a way of understanding the results of a a text mining process. A simple visualisation of frequent words is a word cloud.

This is also the point at which we can start to think about multiple text passages (documents) in a collection (corpus) and applying the algorithm to the whole corpus and visualising the results.

Stop Words - Yuk!
We'll look at an obvious weaknesses of this simple algorithm - words that are frequent but not informative. We'll explore excluding these stop-words, but also discuss why this isn't a purely unsupervised approach - and how we'll come back this later with the idea of word relevance.

Search
We'll look at the idea of finding things in out text - search - and think about how we might make our own simple search engine. Thinking about how we might to it ourselves using a pencil and paper, we'll introduce the idea of an reverse index - just like the index we might find at the back of many books.

We'll experiment with simple searches, and think about how to order the results, because some clearly are better than others. We can use the idea of word frequency as a very simple way of ordering (ranking) search results.

Co-occurrence
After the very simple ideas so far, we'll increase the sophistication just a little bit and look at words that occur together - co-occurrence. The idea is that instead of looking at individual word frequency, we look at word pairs, that might be more informative. Worth a try...

Co-occurrence is an ideal point to introduce a new form of visualisation - a graph of connected nodes. Graphs are great for quickly understanding data where each item is linked in some way to many others.

We'll also look at extending the co-occurence beyond words that are right to each other but co-occur within a longer window. The words apple and orange co-occur but often with another word between them.

We'll also look at co-occurence as a different way of doing search. If we search for word apple, should we also find documents with the word orange (but not apple)? That's a sophisticated idea!

Exploring Langauge Use
This sophisticated idea of co-occurrence allows us to understand how language is used. It can help us discover new terms for themes we're interested in. For example, we might discover the term arachnid which co-occurs with words that also co-occurs with spider. Visualising co-occurrence with graphs makes this exploration much easier than looking at lists of numbers - and demonstrates the power of graphs.

Generating Sentences
We'll have some fun using co-occurrence to automatically generate sentences. We'll see how we do need those boring/stopwords to make sentences more realistic.

Part 2 - More Interesting Ideas

Relevance
In Part 1 we used the idea of word frequency as a very simple and surprisingly effective way of identifying the important themes in some text. But there was a major weakness - that many frequent words are boring, not interesting, and don't convey any insights. Words like the, and, is. We previously hacked a solution by simply excluding them - and that wasn't an unsupervised method.

We'll introduce the idea of balancing words that are frequent in some documents with words that appear in all documents and so are likely not to be that interesting - term frequency vs inverse document frequency, TF-IDF. This isn't perfect but an improvement.

We'll start with our own simple ratio of frequency TF / DF, and then see why some people use logs to take into account probability and information content (Shannon).

We'll also try to apply the idea to words frequent in a smaller window vs appearing in the entire document - something I haven't seen in other text books.

N-Grams
We'll reconsider the basic unit of computation, which we assumed was the word. Perhaps pairs of words, 2-grams is better? or 3-grams .. or n-grams captured using a sliding window. Intuitively pairs of words would better capture meaning that isolated words. We'll experiment to see if this works - and regenerate things like word clouds and relevance rankings.

We'll discuss what's too small an atom - a letter? And what's to big - a whole sentence? We'll approach the idea of the smallest units that can be compared.

Co-Occurrence Sliding Window
Following the idea of n-grams to look beyond one word, we'll revisit the powerful idea of co-occurrence and recognise that words which appear close to together are often related even if they aren't directly next to each other. We can use a sliding window to capture co-occurrence that happens beyond the immediate neighbour.

We'll also discuss how a hard cut-off window would exclude words that should really be considered. We'll try the idea of a sliding window which decays the relative importance of words the further apart they are.

Similarity
We'll ask an apparently innocent question - what is similarity? What makes two words similar? Our discussion up to this point suggests that similar words share commonly co-occurring words are probably similar. "You shall now a word by the company it keeps" is the famous quite.

After some calculations by hand, we'll see that we can use vector calculations to work out the similarity between two words using co-occurrence matrices. This will be the introduction to words as vectors, embeddings.

We can create nice visualisations using force-directed graphs to link similar words, and use stronger/closer links for very similar words.

Clustering
Grouping similar things is a common technique in data mining. Our use of force-directed graphs means we don't necessarily need to perform clustering as it happens as part of the graph visualisation. However, the process of exploring a graph is manual - and a more unsupervised approach is to generate a number of clusters.

We'll look at hierarchical and k-means clustering. The k-means method isn't perfect and has weaknesses - it doesn't know the right number of clusters, and struggles with non-discrete non-circular clusters, which we'll solve later using methods which reduce dimensions and try to encourage discrete grouping.

Part 3 - Even More Interesting Ideas

Topic Extraction and Summarisation
In Part 3 we'll look at slightly more advanced topics. The ability to extract out key - but distinct - themes from a body of text is clearly useful. We'd be able to quickly see what the main topics in a large set of text was without having to read it all.

We'll look at the idea of dimension reduction which forces the high dimensional text data onto a smaller dimensional space, and in the process loses lots of information .. but hopefully keeps the most important features. There are several ways of doing this, and we'll talk about Singular Value Decompostion (SVD) as a way of doing topic extraction fairly easily. We'll compare with similar ideas mentioned often in other guides, ideas like Latent Semantic Indexing (LSI) and principal component analysis (PCA).

Dimension Reduction and Visualisation
Text data, if we think about it in terms of co-occurrence or word frequency, is naturally high dimensional because we have lots of possible terms (words) to consider relationships between.

The task of reducing the massive dimensionality of that data is important, not just for extracting topics, but more generally to explore patterns in that data. Visualisation is key - we can't easily visualise very high dimensional data, which means we can't easily understand it.

We'll look at the popular t-SNE (and successor) methods for squishing data into 2 dimensions, losing the correctness of location in the process, but preserving the notion of nearness and distance between data points.

We'll explore the use of self-organising maps (SOM), a form of neural network training, to see how they can effectively reduce the dimension of data. We'll see that for smaller data sets the results are similar to k-means clustering, but for larger more complex data, the method preserves topological relations.

Topological Data Analysis and Logical Structure
Continuing the theme of analysing high dimensional data, we'll look at topological data analysis (TDA).

This is fairly new field, or at least use of the mathematics, and there are notable examples of its application to text data. I'll have to learn about it first!

The promise of discovering logical structure in data is intriguing - we'll see if it emerges. An example of a logical hierarchy is this - a dog kind of animal, and cat is a kind of animal. Being able to discover such structures in text would be really useful, especially if we could do it without the very-supervised sentence parsing and entity extraction methods.

Which Features? Which Relations?
Having worked through several methods for analysing data, we'll return to the basic question of which feature to use as basic atoms.

Why do we use words as the basic unit of text analysis? What is so special about using a document as the first partition of of a body of text? Are n-grams the ideal unit, skip-grams perhaps?

Which relationships on those atoms are most useful. Frequency and co-occurrence are by far the most popular, but are there others? Perhaps a different approach would help us better separate the interesting atoms from the boring ones?

Part 4 - Fun Projects

Text Generation
We'll have some fun using the machine learned models we've come across to generate text. We'll look at the very simple markov chains, which simply model the probability of the next word in a sequence, to more sophisticated neural networks which learn sequences.

If we can, we'll also look at the generative adversarial network (GAN) approach which pits one generative network against a classifier trying to outwit it. The results from other domains, such as image synthesis are particularly impressive (but pixel/picture errors are more easily forgiven in that domain).

Interesting Data Sets
We'll apply some of the techniques we've developed to interesting data sets, such as the Hillsborough Independent Panel report, or the Iraq War Inquiry's Chillcott Report. We'll uncover interesting links within the text, as well as extract key topics - hoping some of them are valid but perhaps surprising.

Other Languages
As a test of our unsupervised approach, we'll apply our methods to another language. Spanish is a good one (because I know a little bit).

To test our ideas even further, we'll apply it to old spanish, perhaps Don Quixote, again to see if they work, and to see if we can discover anything interesting.

Part 5 - Appendices

We'll have appendices gently explaining the mathematics of simple and more involved ideas we've used in the book, including:

logarithms for probabilistic scaling (tf/idf)
SVD matrix decomposition
k-means clustering and hierarchical clustering
SOM map generation
neural network training
t-SNE method

Wednesday, 10 May 2017

Extracting Topics with SVD at PyData London 2017

I was lucky enough to do a talk at PyData London 2017 on topic extraction with SVD.

The talk was aimed at novice's and tried to gently introduce the idea, it's value, and also an intuition about how the SVD method works.

Here are the slides: https://goo.gl/u3Jg2O (or click the image)

And here's the video from the conference: