While reading a paper for class, I felt compelled to try my hand at implementing the approach they took. A lot of times I read things, they make some sense, but I don’t really know how much I don’t know about them until I stop reading and try doing. The paper is called *Finding and Evaluating Community structure in networks, *from 2003.

I read the paper a couple months ago, and the other day started thinking about all the uses for a means of picking out communities within a larger network. Basically, their paper says we should calculate the betweenness of each edge in a graph, and then remove those edges with the highest betweenness. If betweenness is a measure of how often an edge is crossed on a path for every pair of nodes in the graph, then we’ll be removing the edges that are most commonly crossed on a shortest path from node a to b. Eventually, the original graph is split up into smaller graphs, which, from their perspective, carry greater similarity between nodes.

So thinking about this in terms of a real community, I figured, two subreddits, a and b, are connected if a user has two comments c1 and c2 that live in a and b. So this constitutes an edge in the graph, where a node is a subreddit. I used the python reddit wrapper to pull some submissions and comments down, where I then constructed a graph. I figured it would be neat to evaluate this in large sets of data, so I committed the graph to a redis instance. When the data is downloaded, a python script loads the entire data set from redis and begins classifying (the fun part!) communities. The following occurs:

- Calculate every shortest path for every pair of nodes in the graph
- For each node in the graph, find the fraction of paths that contain that node vs how many don’t. This is betweenness (as per wikipedia’s definition)
- Remove the edge with the highest betweenness (just the betweenness of that start and end nodes added together…maybe this assumption is flawed)
- Repeat

So then I can draw it all in arbor.js

**You can view one of the results here.** I’m running the classifier on a much larger data set. It’s slow, because of the recalculation step of betweenness at each node removed, so maybe a method like this wouldn’t work as well in production where either you have speed requirements or massive data sets. (or, you know, just profile the code that I left un-profiled). The visualization is a physicsy thing, so if you wait a bit and let it settle, the communities will begin to repel each other, so you can see things better.

tl;dr: graphs and stuff. method of classifying subreddits based on users’ behavior.

Built with python and arbor.js