Open Data Day 2014

4 minute read

The International Open Data Day is almost over (here at least), and I was lucky enough to join the Oxford Hackathon. This was my first full-day hackathon, and although we didn’t achieve everything we set out for it was a great day with great people. Plus I learned a lot and got to play with some interesting technologies.

You should definitely check out the international broadband price comparison visualisation, based on Google’s international broadband pricing study. Well done guys!

The day was run at the Oxford e-Research Centre, and after a leisurely start we went around the table and introduced ourselves. There was a good turnout, although I quickly realised I was the only JVM programmer. A few interesting options emerged, and I joined a group with the aim of graphing tweet counts against water levels for Oxford.

For more information, there’s a page on the opendataday.org wiki about the Oxford event.

Flood level data

The flood topic was put forward with a blog post requesting more open data on flooding, water levels, and precipitation.

Our first find was the Gov.UK Flood Data project and their Github project. This gave us some tab-separated flood data to play with, so off we went.

Mongo and Groovy munging

We weren’t really sure what we wanted to do with the data past our initial graph, so we set about munging it into a MongoDB store. Often the problem at events is there’s no common space to host things. Fortunately now there are loads of hosted service offerings, and we used one of MongoLab’s free hosted sandbox plans. Win!

I set about writing a tab separated file MongoDB importer in Groovy while Artur and Chris built a Python Flask frontend and a downloader for the TSV files. Groovy performed suitably well (even under my fingers) to process and upload the data quickly, although there wasn’t an (obvious) TSV parser (in hindsight a decent CSV parser would do the job quite nicely). Our data was pretty clean, but YMMV.

Pushing to Mongo turned out to be trivial with gmongo:

Update: thanks to Asuraphel for pointing out this needs to close the connection once it’s done.

def credential = MC.createMongoCRCredential( "server", "db", "pass".toCharArray() )
def mongoClient = new MongoClient( new ServerAddress("ds033429.mongolab.com", 33429), [ credential ] )
def mongo = new GMongo( mongoClient )

def db = mongo.getDB("CloudFoundry_8sbeqjqe_6hqm0ehp")

map.each { key, value ->
    db.riverlevelsgroovy << value
}

mongo.close()

Feeling rushed, I pushed into producing code and neglected to add any spock tests after the first 30 minutes. Bad Ryan!

Recently I’ve found Gradle to be great for building, testing, deploying (to https://bintray.com/) and managing dependencies of Groovy projects. The only thing I couldn’t get it to do is run a Groovy script as part of gradle run.

Grails JSON endpoint

Chris was building a frontend using Flask so we could use d3 to visualise the data. I was interested to see how quickly I could flesh out a basic Grails app and investigate the MongoDB support. Removing the Hibernate plugin and adding Mongo for GORM was trivial, and a couple of extra mapping entries sorted out the objects.

Using the new(ish) @Resource annotation on the domain object was great to get a simple REST endpoint to list the data (or at least the first 10 entries), but I couldn’t see a way to easily add filtering on attributes, so a RestfulController was added. Either way, it was a great win for Grails: simple JSON endpoints in 15 minutes.

The difficulty came when we wanted to filter the data. I’ve not used MongoDB before, but adding an index to a value returned instantly without speeding up the lookup, so I’m assuming we did something wrong.

MongoDB performance for the novice

Using gmongo to do an import of a reasonable amount of TSV data took ages. I’d love to hear what the obvious mistake in my code is. I guess this could be expected because of the internet connection, but it took nearly an hour to create a ~20Mb MongoDb container.

Where I hit real problems was using GORM. We wanted to filter the objects by their region attribute, so I just did a simple WaterLevelReading.findAllByRegion(params.region). The result was a ~5 minute request, a little too slow. I’m sure there’s something simple I missed, but it was a surprising problem to hit.

Running out of time

Unfortunately we reached the end of the day before we could get onto the “cool” visualisation bits, but we regrouped and found that mining Twitter is actually quite difficult. Their API only returns a subset of the tweets available, and for the flood-related searches only ~30 tweets could be found, despite it being a hot topic a month ago.

Even though we didn’t complete our project, it was a great day and a chance to work with interesting people. I’ve come away with a load more ideas, things to learn, and can’t wait for the next event. Thanks to the organisers Iain and Jenny for helping the day go smoothly and making us all feel welcome.

You can find the (very rough) code on Github and the hackathon wiki page here.