Doing data in a journalism course

It’s a subject that isn’t going away and it’s also one that generate a huge amount of debate – data journalism. If ever there was a perfect hook to hang all of journalisms best and worst it’s data journalism! But a recent flurry of tweets and a nice ‘there’s no reason not to try this stuff’ post from Matt Waite focussed on one part of the debate – how should we be doing more of this in our j-courses and who should be doing it at.

It was something that Matt kicked off with a tweet:

Quite a few people pitched in (an assortment of tweets below):

There is an interesting point in there about adjunct courses – essentially but not exclusively online courses – which I think is fair. There’s no better way to  put journalists (and students) off than combining maths and computers!

As I said in my response, we do ‘data’ across all of our courses and I thought I’d share an example of the kind of intro practical stuff we are doing with first years (year one of three year degree). It’s done in the context of a broader intro to data and journalism and it’s developed and expanded throughout the three years (more so as we are shifting things around in the courses.) including a dedicated data journalism module.

My take at this stage is that data journalism is worth considering as part of a more structured approach to journalism. The students are no doubt fed up of my Process into content mantra.

Anyway. Two slideshows below are an intro – context lecture and the other is the related workshop. And, yes, I know there is a fair bit of visualization in there – charts and maps – which some data people can get quite sniffy about. We are careful to make the point that not all data is visual but I do think a visual output can be a quick win for capturing peoples interest. It’s just the start.

Again, these are just the slides, there is the usual amount of narrative and discussion that goes with this. They are presented as is:

Let me know what you think if you get a chance.

Data Journalism in Norway

I spent some time in Bergen last week (lovely place, bloody expensive beer!) to talk to some people about a new content management system.  Whilst I was there I dropped in on a seminar about data

…dedicated to the emerging “web of data” and how it could create new possibilities in a deeply disrupted media economy.

The shorthand for this was ‘breaking out of the silos’. To underline that point, some of the organizers  were running around in municipal-workers jackets. That was a bit lost on me other than thinking Norwegian workers are pretty snappy dressers!

It turned out to be a really interesting, mixed bag of people who were fired up by the possibilities of linking open data (LOD).

Pia J.V. Josendal opened the batting with a neat presentation that was kind of a dummies guide to data. A few interesting things in there for me like finding out what a triple is and also the five star rating system for your data.

The next delegate was Hjalmar Gislason from, a nifty website that collects data (time series at the moment) and lets you visualize it like this.

I was struck by what a cool name they had and pondered that it shows just how recent the mainstream interest in this stuff is that you could get a name like that. Hjálmar Gíslason agreed.

@ Yup, when we secured the domain in 2009, the term "data market" had hrdly been coined. Imgine?!

His presentation was quite nifty too.

One presentation I couldn’t stay for but looked really interesting was Rune Smistad’s run through rNews (a proposed standard for using RDFa to annotate news-specific metadata in HTML documents) The slides are interesting but I think I missed out on not hearing the context.

There was a heavy presence of journalists but they were by no means the majority, it wasn’t a data journalism conference. But it was clear that everyone thought that journalism was the place that the concept was getting most traction and most use.

The UK got a lot of love for it’s data-J work during the sessions but I saw a lot of similarities in the approaches. It also showed me that there are a lot of tech people, people who understand all this triples, sparq and data stuff. They can see the use for it and they have a passion for getting it out there. It doesn’t matter that  they are in Norway (or the UK for that matter) they just want journalists to come and do good stuff with the data they are freeing from the silos.

Making an RSS feed where there isn’t one.

I’m very taken with the general move towards more data from primary sources. Councils, government orgs etc. putting stats, facts, figures and information online for us to use and mashup. Those orgs who are savvy enough to drive this stuff through RSS make it even easier for us to harvest this stuff and add an extra dimension to our news gathering.

Of course the public sector moves slowly when it comes to IT and it’s no surprise that there are still a majority of orgs that hide their content away on static pages. No RSS feed to help there. So what do we do?

Well we could resign ourselves to adding them to the list of pages that we bookmark and visit. A bit like those regular calls we make to keep our contacts book fresh; no bad thing. But another solution is to use on of the many RSS services on the web to ‘scrape’ the page for content and convert it in to a feed.

Preston city council (the council nearest to me at work) has a few feeds but none around the basic operation of the council – meetings, decisions etc.  This kind of thing would be great to get a feed of. So I thought I would give it a go with their published decisions page using Feed43

No feed for the dull stuff!
No feed for the dull stuff!

The first thing I did was set the search so that it showed all results. That way any new ones would show up by default. I did this by using an * in the search box. The * is a standard operator for a wild card or ‘any matches’. So it seemed a logical punt to try it.

The next step was to copy the web address to feed my RSS maker. The URL looks complex but it contains all the information needed to drive the search.

Feed43 grabs the whole page for you to explore
Feed43 grabs the whole page for you to explore

The first step with Feed43 is to feed it the URL then click Reload. It pulls in the whole page and then you get the hard bit. The idea with feed scrapers is to give it enough information about the way the stuff you want is presented that it can ‘spot’ the stuff and ignore the rest. This means trawling through some HTML.

You get two options

The global search pattern looks for HTML that ‘wraps’ the content you want to make in to a feed. It could be the whole table that contains the search results. But this doesn’t really help in this case.

Better to go straight to the second option which defines the specific things to look for to define an item to be added to the feed. Here’s what I put.

<td > <a href=”{%}” title=”{*}”>{%}</a></td>

In feed43 language {*} means this could be anything, just ignore it. {%} means this is important so store it.

So I can saw from the HTML that each decision in the list looked like this

<td > <a href=”;displaypref=0″ title=”Link to decision details for North West England Regional Spatial Strategy Partial Review Consultation”>North West England Regional Spatial Strategy Partial Review Consultation</a>

So I told feed43 to look for anything between the <td> </td> tags regardless of what ‘class=’ said. Then I told it to grab the href link as the actual weblink, ignore the title and then grab the text between the <a> tag to use as a title.

Finding the useful bits on the page means working through the HTML
Finding the useful bits on the page means working through the HTML

Clicking extract will filter the content and show you the results. You can see they are split in to {%1} for the link and {%2} for the title of the decision.

The filtered results display in a list
The filtered results display in a list

The last step is to define which of these makes up the key parts of the feed. You can see it’s pretty straightforward to fill the gaps at this point. Your feed is then ready to go. All you need to do is subscribe in the normal way

The filtered results can be added to the feed template
The filtered results can be added to the feed template

Moving beyond the basics

The thing that makes scraping pages difficult is picking through the HTML. Feed43 makes this easier by limiting the number of options to filter by. But if you need to push further in then you will need to explore other options. One to consider is Yahoo pipes which has a page grabber option. But you will also need to invest some time in understanding regular expressions.

I think this kind of stuff is more an more important for orgs and journalists especially when it comes to councils and government orgs. We all know how ‘mundane’ many see this stuff (important as it is). So making it in to a feed would be more conducive to newsgathering by stealth. Encourage more ‘passive aggressive newsgathering’ as Paul Bradshaw once described it.