Intuitive Text Mining: February 2019

We've previously used interactive force directed graphs inside our Jupyter notebooks to show the strength off co-occurrence between words, for example.

It took a fair bit of work to work out how to use d3.js, to transform data from a pandas dataframe to a network graph using networkx, and then render an animated interactive graph that worked inside a Jupyter notebook cell.

We're going to do that again as d3.js has moved on and changed how it works.

This blog post is a way for me to think more clearly, and also to share what I learn with others as there doesn't seem much easy to follow guides or tutorial on using d3.js for newcomers that I could find online.

Drawing In a Jupyter Notebook Cell

We have to first understand how we get our own code to draw something in a jupyter notebook cell.

The following picture shows a jupyter notebook on the right. You can see a cell into which we type our python instructions, following by a cell which contains any output of those instructions.

We want the output cell to contain an interactive animated graph of nodes. That means it can't be a static image but something that is constantly running and reacting to user clicks and dragging. The jupyter notebook is a web page and the only thing that can continue to run inside a web page is javascript. The graph needs to be drawn, and animated, by javascript.

There is a popular and capable javascript framework for drawing interactive graphs, d3js.

We next need to ask how we get that javascript including the d3.js library into that output cell.

Luckily, the jupyter notebook python API includes the ability to emit HTML for display in an output cell. This is key. It means we can call a function from our notebook which results in HTML being generated and rendered in the output cell.

All that's left is to make sure that HTML is suitably minimal and includes a reference to the d3.js library and our own javascript which contains the description of the graph, and functions for dealing with user interaction.

Let's summarise what we've just described:

python code in a notebook cell can call an external imported function
that function can use a jupyter notebook API function to create HTML that is rendered in the output cell
that HTML can be fairly minimal but contains a reference to the d3.js graph drawing library
that HTML also contains our own javascript which describes the graph to be drawn using the d3.js library

d3fdgraph Module

Let's create a module to be imported into our notebook, which provides a function that renders the HTML and javascript.

Let's call it d3fdgraph, for d3 force directed graph. The name isn't amazing, but module names should be short, whereas function names can be fuller and more descriptive.

Here's the file structure of the module:

test_graph.ipynb

d3fdgraph/
__init__.py
d3fdgraph.py

The test_graph.ipynb is our notebook. The directory d3fdgraph is the module. Inside that directory is an __init__.py which all python modules require. The main module code will be in d3fdgraph.py.

The file __init__.py contains a simple instruction to expose a single function plot_force_directed_graph().

from .d3fdgraph import plot_force_directed_graph

Let's now write that function.

Developing A Module

We're going to be developing that d3fdgraph module iteratively, trying things and testing things as we go.

Normally when we write python code, changes take effect immediately. However isn't always true when using python modules in a jupyter notebook, because changing the code doesn't trigger a reload of the module into our notebook.

Using %autoreload at the top of our notebook we can force modules to be reloaded before we run python code in notebook cells.

%load_ext autoreload

%autoreload 2

This means we can edit our code and see the changes take effect in our notebook.

Simple HTML Test

Let's start with a really plot_force_directed_graph() simple to get started and make sure things work.

def plot_force_directed_graph():

html = "Hello World"

# display html in notebook cell

IPython.core.display.display_html(IPython.core.display.HTML(html))

pass

The function is very simple. All it does is create a text string with the words "Hello World", and uses the notebook API to render it in the output cell.

It's actually a two step process. First a display object is created from the html using IPython.core.display.HTML(). This is then rendered using IPython.core.display.display_html().

Let's try it.

That looks like it worked.

Let's add some HTML tags to check that we're not just displaying plain text. Let's wrap bold and italic tags around the text.

html = "<b><i>Hello World</i></b>"

Let's run that function again.

That works, and confirms that HTML tags are actually being rendered.

Loading HTML From a Resource File

We could use a longer HTML string that includes javascript inside <script> tags but that's a bit inelegant. It's neater to have our HTML and javascript in separate files which our python

We can keep our HTML in d3fdgraph.html and our javascript in d3fdgraph.js. Our files now look like this.

test_graph.ipynb

d3fdgraph/
__init__.py
d3fdgraph.py
d3fdgraph.html
d3fdgraph.js

We could load the content directly using the usual with open() as instructions, but because we're making a python module we should follow the guidelines for python packaging and use pkg_resources.

Our d3fdgraph.py now looks like this.

import IPython.core.display

import pkg_resources

def plot_force_directed_graph():

# load html from template files

resource_package = __name__

html_template = 'd3fdgraph.html'

html = pkg_resources.resource_string(resource_package, html_template).decode('utf-8')

# display html in notebook cell

IPython.core.display.display_html(IPython.core.display.HTML(html))

pass

Let's put the following very simple text into the d3fdgraph.html:

<div>force-directed graph</div>

Let's test this works.

Loading html from a resource file worked.

Simple Javascript Test

Let's check that we can run javascript in the rendered notebook cell.

The notebook API provides a similar approach for rendering (running) javascript. Our javascript code is passed to the IPython.core.display.Javascript() function which created an object for IPython.core.display.display_javascript() to render.

# display (run) javascript in notebook cell
IPython.core.display.display_javascript(IPython.core.display.Javascript(data=js_code))

We can load our javascript from the d3fdgraph.js file just like we loaded our html.

Let's change our HTML in d3fdgraph.html so the <div> element has an id which we can find later.

<div id="p314159"></div>
<div>force-directed graph</div>

And let's have some very simple javascript in d3fdgraph.js which finds that <div> by its id and overwrites its content with a new message.

var x = document.getElementById("p314159");

x.innerHTML = "Hello from Javascript";

Let's try it.

That worked.

This shows that our javascript not only runs, but successfully interacts with the html elements rendered previously.

This replacement of existing HTML elements is at the core of how we'll be using d3 so it's worth repeating what happened - we used javascript to locate a HTML element and update it.

Loading D3.js

Jupyter uses require.js to manage javascript libraries. We can use it to pull in d3.js at the top of our d3fdgraph.js like this:

require.config({
paths: {
d3: 'https://d3js.org/d3.v5.min'
}
});

This pull in the d3.js library version 5, and associated the d3 reference to it.

We can the call use the library through the d3 reference in this rather convoluted way:

require(["d3"], function(d3) {
console.log(d3.version);

});

Here we're testing that the d3.js library successfully loaded by getting it to print out its version to the javascript browser console.

That worked.

Simple D3 Graph

Let's now use the d3.js library to create a simple graph.

The data format that d3 expects to start from is JSON.

We need a dictionary with two main entries "nodes" and "links". These are text strings, not variables.

Each of these keys "nodes" and "links" points to a list. You won't be surprised that "nodes" points to a list of nodes, and "links" points to a list of links.

Have a look at the following code which creates this data structure.

// data
const data = {
"nodes" : [
{"id": "apple"},
{"id": "banana"},
{"id": "orange"},
{"id": "mango"}
],
"links" : [
{"source": "apple", "target": "banana", "weight": 1},
{"source": "apple", "target": "orange", "weight": 2},
{"source": "banana", "target": "orange", "weight":3},
{"source": "orange", "target": "mango", "weight":3}
]
};

You can see the list of nodes enclosed in square brackets like all javascript lists. What's more, each note is itself a dictionary with "id" as key and the name of the node as the value, eg "apple" and "banana". The d3 way of working expects this key-value structuring of the data.

The list of links is more interesting. Each link is also a dictionary, and contains a "source", "target" and "weight" key. The source and target keys point to the nodes which are the source (start) and target (end) of that link. That makes sense. The weight is something we've added as an extra attribute to the data which we can choose to use. We'll use the weight to decide how thick the link lines are drawn.

The following diagram summarises the data in both graph form and JSON form.

Let's get started. First let's extract out the nodes and links from that data for neater code later:

// extract links and nodes from data
const links = data.links;
const nodes = data.nodes;

Now we start to use the power of d3. The following code find our <div> element which had the id of p314159, and inside it append an <svg> element of width and height, which we'll set earlier in our javascript code as 800 and 600.

// select HTML element and attach SVG to it
const svg = d3.select('#p314159')
.append("svg")
.attr("width", width)
.attr("height", height);

This has now inserted an svg element in our html into which we can start to draw our graph of nodes and links.

We could write lots of javascript code to worth through the nodes and links data to add an svg circle or line for each one. Luckily, d3 provides a powerful and convenient way to do this. Sadly it's not that well explained so we'll try to talk it through.

const node = svg.append("g")
.attr("stroke", "#aaa")
.attr("stroke-width", 1.5)
.selectAll("circle")
.data(nodes)
.enter().append("circle")
.attr("r", 5)
.attr("fill", "#f00")
.attr("cx", () => Math.random()*width)
.attr("cy", () => Math.random()*height);

That's a lot of code there. The first thing to understand is that we're chaining javascript instructions so the object returned by one is passed to the next. So the first thing that happens is a <g> group element is added to the previously created svg element, with stroke and stroke-width attributes. Those attributes will apply to everything in the group.

The next bit selectAll("circle") looks like it is trying to find a circle in the <g> group, but we haven't added any yet. What's going on? This is a d3 idiom, a way of working. If it doesn't find a circle, it creates a virtual circle selection to be fleshed out by later code.

The next bit is data(nodes) which does a lot despite looking simple. It binds the data in the nodes list to this circle selection. The next few instructions turn parts of that data into visual elements. The enter() is used to enter the previously created virtual selection and append an svg <circle> element for each item in the data list. You can see that attributes for circle radius, fill and location are added. The coordinates aren't set to a specific value but instead to randomly chosen numbers. That random() is wrapped in a function because d3 expects to it, otherwise it won't create new random numbers for each data item.

The arrow operator => is fairly new to javascript. You can find out more at this tutorial.

Let's see the results so far.

That worked. We have four nodes, one for each fruit data item, placed at random on the canvas. The key point here is we used the d3 methods for working through the data, and not writing our own code to do it.

Let's add the fruit names to each node. We could print the name as svg text on the canvas next to the node. Another way is to simply add a <title> element to each node which shows when the pointer is hovering over the node.

// add names to nodes
node.append("title")
.text(d => d.id);

This code looks like it only adds one <title> to the node selection, but it actually iterates over all of the nodes using the arrow function d => d.id. That d is automatically created when the data is bound ot the selection earlier, and refers to each data item. The d.id is therefore the value referred to by the "id" key in the dictionary of data items in node.

Running the code now lets us see a hover-over tooltip for each node, like this one showing "banana".

Now we want to get d3 to draw the nodes so that they are connected as described by links, and arranged so that the stronger link weights pull nodes closer together. To do this we need to create a d3 simulation.

// create simulation
const simulation = d3.forceSimulation(nodes)
.force("link", d3.forceLink(links).id(d => d.id))
.force("charge", d3.forceManyBody())
.force("center", d3.forceCenter(width / 2, height / 2));

You can see this makes use of d3.forceSimulation which is passed the nodes it will be working on. You can also see the simulation of forces is configured with three main settings. We're telling the simulation how the nodes are connected using the links list. We're also telling the simulation what kind of forces to apply, and in this case it is d3.forceManyBody which cause nodes to attract or repel each other. The additional force d3.forceCenter ensures the nodes are centred around the centre of the canvas and don't fly off into the distance.

You can read more about d3 force simulations here.

To use this simulation, we need to apply it somewhere. There are several options but because we want to make our graph interactive, that is allow each node to be dragged, we need to connect it to event handlers for dragging nodes.

Let's first attach the call to the drag handler to our nodes. It's just an extra function added to our previous chain of instructions creating each node. We've also removed the initial random locations as the simulation will calculate its own positions.

const node = svg.append("g")
.attr("stroke", "#aaa")
.attr("stroke-width", 1.5)
.selectAll("circle")
.data(nodes)
.enter().append("circle")
.attr("r", 5)
.attr("fill", "#f00")
.call(drag(simulation));

Now we can define our drag function.

/// dragging nodes
const drag = simulation => {

function dragstarted(d) {
if (!d3.event.active) simulation.alphaTarget(0.3).restart();
d.fx = d.x;
d.fy = d.y;
}

function dragged(d) {
d.fx = d3.event.x;
d.fy = d3.event.y;
}

function dragended(d) {
if (!d3.event.active) simulation.alphaTarget(0);
d.fx = null;
d.fy = null;
}

return d3.drag()
.on("start", dragstarted)
.on("drag", dragged)
.on("end", dragended);
}

There's a lot going on here but essentially we're registering event handlers to each node which are triggered when a drag is started, when a drag is happening, and when a drag has ended. You can see that the dragged() handler sets the fx and fy attribtes of each data point to be the coordinates of the pointer.

Those alphaTarget values are a cooling factor which decreases as the simulation proceeds, to slow down the movement. If it didn't do this, the nodes might continue to wobble forever. So when we do a drag, we reset the alpha target to 0.3, but when we finish a drag we want it to start decreasing back to 0.

You can read more about drag events here.

Finally we need to register a tick handler so we can update the svg nodes and lines as the simulation runs and updates its own positions based on the forces it is working with. A tick is just a unit of time used by the simulation to incrementally update the position of the nodes. The reason we have to catch this and update the svg elements is that the simulation itself doesn't change the svg elements by itself.

// update svg on simulation ticks
simulation.on("tick", () => {
link
.attr("x1", d => d.source.x)
.attr("y1", d => d.source.y)
.attr("x2", d => d.target.x)
.attr("y2", d => d.target.y);

node
.attr("cx", d => d.x)
.attr("cy", d => d.y);
});

Let's run the code to see if all this works.

It does!

This is great but took a lot of work. The d3 system is very capable and granular, but in my opinion isn't that well explained. I still have lots of questions about how it really works, despite goign through this exercise of trying to understand it line by line.

Before we move on, I forgot to use the link weights to decide how close the nodes should be to eahc other. A stronger weight means the nodes need to be closer, that is, a shorter target distance for the layout to try to achieve.

// create simulation
const simulation = d3.forceSimulation(nodes)
.force("link", d3.forceLink(links).id(d => d.id).distance(d => 50 / d.weight))
.force("charge", d3.forceManyBody())
.force("center", d3.forceCenter(width / 2, height / 2));

And the resulting graph now reflects the weights.

The distance between the mango and orange node is shorter because that link has a weight of 3. The distance between the apple and banana nodes is long because the weight is 1.

Data from Python

So far we've used data that was hardcoded in our javascript. What we really want is for our javascript to consume data our python code.

How do we connect the two? One easy way is to actually do a text replacement after we load the javascript code from file. This avoids the need for complicated always-listening servers or fragile message passing mechanisms.

Let's say we will always pass a three column pandas dataframe to our plot_force_directed_graph() function, we can use the names of the columns as the source, target and link weight identifiers.

If we change our javascript template code to use the following placeholders when defining the data, we search for them and replace them later.

// data
const data = {
"nodes" : %%nodes%%,
"links" : %%links%%
};

We'll also change the name of the attribute to to set a link's distance to a placeholder.

// create simulation
const simulation = d3.forceSimulation(nodes)
.force("link", d3.forceLink(links).id(d => d.id).distance(d => 50 / d.%%edge_attribute%%))
.force("charge", d3.forceManyBody())
.force("center", d3.forceCenter(width / 2, height / 2));

In our plot_force_directed_graph() function we can replace those placeholders with actual data and the name of the link weight attribute.

Let's have a look at a minimal dataframe that contains some node and link data.

It's the same data we've been working with. Our plot_force_directed_graph() function can take the name of the third column to replace the %%edge_attribute%%. That's easy enough. But how do we turn the first two columns into the json data needed by d3?

We use the networkx module to convert the dataframe into a graph, and export the json.

# column names for node source and target, and edge attributes
node_source_name = node1_node1_weight.columns.values[0]
node_target_name = node1_node1_weight.columns.values[1]
link_edge_name = node1_node1_weight.columns.values[2]

# convert node1_node1_weight to graph
graph = networkx.from_pandas_edgelist(node1_node1_weight, source=node_source_name, target=node_target_name, edge_attr=link_edge_name)

# convert graph nodes and inks to json, ready for d3
graph_json = networkx.readwrite.json_graph.node_link_data(graph)
graph_json_nodes = graph_json['nodes']
graph_json_links = graph_json['links']

Because we've separated out the nodes and links, we don't need the const data dictionary, we can just have the links and nodes populated directly.

// links and nodes data
const links = %%links%%;
const nodes = %%nodes%%;

Let's try it, and also add more data to the dataframe.

And the result is what we expect, a graph with two loops, because that's what the data contains.

Graph Style

We can move the style of the nodes and links to a CCS stylesheet in the HTML template file and refer to the classes by adding class attributes to the links and nodes groups using d3.

// add links to svg element
const link = svg.append("g")
  .attr("class", "links")
.selectAll("line")
.data(links)
.enter().append("line")
.attr("stroke-width", d => Math.sqrt(d.weight));

To add node name labels that move with the nodes, we need a different svg structure. We need to place the svg circle svg text elelemts together in an svg group, one for each node. That means changing how we build the svg.

const node = svg.append("g")
.attr("class", "nodes")
.selectAll("g")
.data(nodes)
.enter().append("g");


const circle = node.append("circle")
.attr("r", 4.5)
.call(drag(simulation));

// svg text labels for eachnode
const label = node.append("text")
.attr("dx", 10)
.attr("dy", ".35em")
.text(d => d.id);

It's easier to see what's changed as a picture.

Let's see the results applying this to data from the recipes data set for those where the cooccurrence is more than 1.

That's a rather nice graph, showing which words co-occur with other words, and more importantly, we can see clusters of related words.

And here's the interactivity working.

Great!

Unique Plots

A problem emerged when a notebook had more than one of these plots. To keep them unique we add a unique id placeholder to the HTML template which the python function replaces with a unique id generated for every call. This keeps multiple separate from each other, and d3 selections find the right HTML element.

Next Steps

Now that we have the basics working, we'll refine the aesthetics and design of the plot as we use them for actual data.

We'll also package up the module so it can be installed with pip.

More Resources

You can find an a really helpful overview of how d3 expects to be used in these two youtube tutorials: [1], [2].
D3 In Depth tutorial that bridges the gap between the basic introductions and the official documentation: https://d3indepth.com/

Intuitive Text Mining

Monday 4 February 2019

Force Directed Graphs in Jupyter Notebook