The Art and Science of Data-Driven Journalism

Digging into the CAR Toolbox

As is true in the trades, the arts, and the sciences, the tools data journalists choose are driven by the needs of a given project, available resources, expertise, training, and time. These can be divided into five rough categories: data collection, cleaning, analysis, presentation, and publishing. Cleaning data “is often the most time consuming part of the data journalism process,” said Jonathan Stray, an instructor at Columbia Journalism school, who has highlighted the widespread problem of governments publishing data locked up in the Portable Document Format (PDF) and the heroic measures needed to deal with the challenge.266data journalism, however, has little to do with tools and technology and everything to do with perspective and critical thinking. “You need a mindset which is about putting this in the context of the story and spotting stories, as well as having creative and interesting ideas about how you can actually collect this material for your own stories,” said Emily Bell. “It’s not a passive kind of processing function if you’re a data journalist: It’s an active speaking, inquiring, and discovery process. I think that that’s something which is actually available to all journalists.”If you look at data journalism and the big picture, more recent technologies are part of a continuum of technologically enhanced storytelling that traces back decades.267canonical suite of tools for computer-assisted reporting ran on desktops and servers, spreadsheets, databases, text editors, and statistics software. Spreadsheets were the first “killer app” for data journalism, just as VisiCalc was the first killer app for the Apple computer. In many ways, they still are, even if the spreadsheets have become Web-based. Chris Amico and Laura Norton Amico’s work on Homicide Watch started as a spreadsheet and expanded over time. “No matter how advanced our tools get, I always find myself coming back to Excel first to do simple work,” said Minkoff, a data journalist at the Associated Press. “It helps us get an overall handle on a data set.”After spreadsheets, the second most common tool applied in the field is database software, in particular, Microsoft Access, MySQL, PostgreSQL, or SQLite. A text editor, like TextMate or BBEdit, and statistics software, like SPSS Statistics, round out the basic suite of tools that have been used for CAR for many years.Today, data journalists leverage Web-based tools for data collection, manipulation, analysis, and visualization, like Open Refine, Google Fusion Tables, and Tableau. They’re also working with modern programming languages, like Python, Ruby, and Javascript, as well as d3, a Javascript library. “We love tools that don’t need a developer every time to create interactive content,” said Momi Peralta. “These are end user’s tools. Google Docs, spreadsheets, Open Refine, Junar’s open data platform, Tableau Public for interactive graphs, and now Javascript or D3.js for reusable interactive graphs tied to updated data sets.”Tool choice brings with it the thorny issue of newsroom culture, as previously referenced, right down to organizational DNA that venerates narrative writing and mistrusts the messy news environment online that is slow to adopt new technologies. It wasn’t so long ago that the people in charge of a newspaper’s website worked in different departments or even buildings than reporters working a story. (That’s still true in some media companies.)The integration of the Internet into the collection and production of the news demonstrates that traditional media institutions can and will adapt and adopt new technologies and practices. That will continue to accelerate globally, once the advantages of data-driven storytelling become apparent.