The Art and Science of Data-Driven Journalism

Embracing Data Transparency

The Datablog and its editor set an important standard that many other data journalists continue to embrace: Show your work and share your data. I profiled the data journalism work of the Los Angeles Times in early 2013, when I interviewed news developer Ben Welsh about the newspaper’s Data Desk.151reporters and Web developers specializes in maps, databases, analysis, and visualization. For instance, its interactive visualization mapped how fast the Los Angeles Fire Department responds to calls.152on 911 breakdowns in the LAFD153investigative journalism with data analysis to create important, compelling narratives that held the government accountable and demonstrated significant issues existed in the city’s data-collection practices.The investigation offered an ageless insight that will endure well beyond the “era of big data”: Poor collection practices and aging IT will derail any institutional efforts to use data analysis to improve performance.The Los Angeles Times found that poor recordkeeping is holding back state government efforts to upgrade California’s 911 system. As with any database project, beware of “garbage in, garbage out,” or “GIGO.”As Ben Welsh and Robert J. Lopez reported for the L.A. Times in December of 2012, California’s Emergency Medical Services Authority has been working to centralize performance data since 2009. Unfortunately, it’s difficult to achieve data-driven improvements or manage against perceived issues by applying big data to the public sector if the data collection itself is flawed. 154 The L.A. Times reported quality issues stemming from how response times were measured to record keeping on paper to a failure to keep records at all. When I profiled Ben Welsh’s work in 2012, he told me this kind of project was exactly the sort of work he’s most proud of doing. “As we all know, there’s a lot of data out there,” said Welsh, “and, as anyone who works with it knows, most of it is crap. The projects I’m most proud of have taken large, ugly data sets and refined them into something worth knowing: a nut graf in an investigative story or a data-driven app that gives the reader some new insight into the world around them.”155applying data journalism to local government accountability in Oakland, at a website called Oakland Police Beat that went live in the spring of 2014.156Local and the Center for Media Change”and funded by the Ethics and Excellence in Journalism Foundation and the Fund for Investigative Journalism”was co-founded by Susan Mernit and Abraham Hyatt, the former managing editor of ReadWrite. (Disclosure: Hyatt edited my posts there.)Oakland Police Beat is squarely aimed at shining sunlight on the practices of Oakland’s law enforcement officers. Its first story out of the gate pulled no punches, finding that Oakland’s most decorated officers were responsible for a high number of brutality lawsuits and shootings.The site also demonstrated two important practices that deserve to become standard in data journalism: explaining the methodology behind its analysis, including source notes, and (eventually) publishing the data behind the investigation. ProPublica does it, the Datablog does it, and so does the Los Angeles Times. The Times Data Desk set a high bar in its investigation of ambulance response times by not only making sense of the data, but also publishing the data behind the open source maps of California’s emergency medical agencies as part of the series into the public domain.157This wasn’t the first time the team made code available, nor the last. (Just visit the Data Desk’s Github account for proof.)158As Welsh noted in a post about the series, 159 the Data Desk has “previously written about the technical methods160to conduct [the] investigation, released the base layer created for an interactive map of response times, 161 and contributed the location of LAFD’s 106 fire station to the Open Street Map.162Scott Klein:If it’s done well, people have a really big appetite to see the data for themselves. Look how many people understand”and love”incredibly sophisticated and arcane sports statistics. We ought to be able to trust our readers to understand data in other contexts too. If we’ve done our jobs right, most people should be able to go to our Prescriber Checkup news application,163doctors, and see how their prescribing patterns compare to their peers”and understand what’s at play and what to do with the information they find.Follow-through on this kind of thinking is what really made me sit up and take notice of The Upshot164the New York Times new data-driven website. It made editorial decisions to share how reporters found the income data165link to the data set, and share both the methodology166behind the forecasting model and the code for it on Github.167journalism168is practiced in 2014, and sets a high standard right out of the gate for future interactives at The Upshot and for other sites that might seek to compete with its predictions. I was not alone in my positive assessment of the content, presentation, and strategy of the Times’ new site: Over at the Guardian Datablog, James Ball published an interesting analysis of data journalism, as seen through the initial foray of The Upshot; FiveThirtyEight; and Vox, the “explanatory journalism” site Ezra Klein, Melissa Bell and Matt Yglesias, among others, launched in the spring of 2014.169with respect to his points about audience, diversity, and personalization. The point that is particularly important is the one I’ve made repeatedly above, that data journalists should try to be open about the difficult, complicated process of reporting on data as a source:Doing original research on data is hard: It’s the core of scientific analysis, and that’s why academics have to go through peer-review to get their figures, methods, and approaches double-checked. Journalism is meant to be about transparency, and so should hold itself to this standard”at the very least.This standard is especially true for data-driven journalism, but, sadly, it’s not always lived up to: Nate Silver (for understandable reasons) won’t release how his model works, while FiveThirtyEight hasn’t released the figures or work behind some of their most high-profile articles.That’s a shame, and a missed opportunity: Sharing this stuff is good, accountable journalism, and gives the world a chance to find more stories or angles that a writer might have missed.Counterintuitively, old media is doing better at this than the startups: The Upshot has released the code driving its forecasting model, as well as the data on its launch inequality article. And the Guardian has at least tried to release the raw data behind its data-driven journalism since our Datablog launched five years ago.In May of 2014, the backlash to data journalism is still growing, as more academics, economists, and statisticians read and react to the style and format of the pieces published at Vox, FiveThirtyEight, and The Upshot. The reaction, however, is to the brand and from of data journalism practiced there, in which data, available research, and charts are consulted by an author to examine a question or story, combined relatively rapidly, and presented in in a series of charts or maps wrapped in narrative text. This form departs from the slower moving investigative features and news applications produced in proceeding years. In a survey comparing the data publishing habits of these three sites, none is meeting the standard set by the Guardian Datablog or ProPublica.Of the 290 items published in the catch-all FiveThirtyEight RSS feed, available since the site launched in March of 2013, 114 are features.170of these stories has been uploaded to its data directory on Github; that’s a transparency of only 3.4percent.171and transparency regarding a story on inequality, publishing the data and the model used to analyze it to its Github account. The Times has since published data for another story and open sourced code for a Ruby gem that extracts press releases and statements by members of Congress.172data used in its stories, although Vox Media has updated the code for Chorus, its content management system, over 62,000times.173released in raw form, particularly if it contains personal or private details. Resource constraints may mean that scrubbing data properly isn’t possible, which would argue against release. Practices can change too: The Guardian Datablog stopped publishing open data into its data store in 2014.174