The Art and Science of Data-Driven Journalism

Data and Ethics

In recent years, more local, state, and national governments have begun proactively releasing public sector data in hopes of stimulated economic effects, improving services, or enhancing transparency and accountability. When these data sets detail performance, spending, budgeting, or services, if they do not include deliberations or policy decisions”which is to say how power or influence is exercised”journalists have to keep digging, scraping, and investigating.There are good reasons for journalists to be careful about a complete embrace of open government data, at least with respect to the data’s relationship to government transparency. There’s now considerable ambiguity regarding open government, as a 2012 paper on “The New Ambiguity of ”Open Government’ ” by Princeton scholars David Robinson and Harlan Yu explored. From their abstract:Open technologies involve sharing data over the Internet, and all kinds of governments can use them, for all kinds of reasons. Recent public policies have stretched the label “open government” to reach any public sector use of these technologies. Thus, “open government data” might refer to data that makes the government as a whole more open (that is, more transparent), but might equally well refer to politically neutral public sector disclosures that are easy to reuse, but that may have nothing to do with public accountability. Today a regime can call itself “open” if it builds the right kind of website”even if it does not become more accountable or transparent. This shift in vocabulary makes it harder for policymakers and activists to articulate clear priorities and make cogent demands. 299As skeptical data journalists know, there’s a difference between open data that’s proactively disclosed by governments and data buried in PDFs released in response to the Freedom of Information Act or lawsuits by media companies and advocates.That said, there’s much to be gained by pitching a big tent for open government, as Joshua Goldstein and Jeremy Weinstein argued in a response to Yu and Robinson in the UCLA Law Review, including benefits for data journalists. They wrote:It is difficult to disagree with Yu and Robinson’s narrowest claim. Greater clarity about the complementary but distinct objectives of these different movements”and the likely impact of the specific governmental policies they advocate”is undoubtedly a good thing.But saying that open data and open government can exist without the other, is not the same as saying that they should. Drawing on our respective experiences as a partner in Kenya’s Open Data effort and as a key architect of President Obama’s multilateral Open Government Partnership, we argue that the growing ties between the open data and open government movements, particularly in developing countries, can benefit both agendas.300releases was prevalent among most data journalists interviewed for this report, although it was coupled with ample caution and caveats.“I can’t find any downsides of more data rather than less,” said Sarah Cohen, of the New York Times, “but I worry about a few things.” First, emphasized Cohen, there’s an issue of whether data is created open from the beginning”and the consequences of “sanitizing” it before release. “The demand for structured, nicely scrubbed data for the purpose of building apps can result in fake records rather than real records being released,” Cohen said. “USASpending.gov is a good example of that”we don’t get access to the actual spending records like invoices and purchase orders that agencies use, or the systems they use to actually do their business. Instead we have a side system whose only purpose is to make it public, so it’s not a high priority inside agencies and there’s no natural audit trail on it. It’s not used to spend money, so mistakes aren’t likely to be caught.”Second, there’s the question of whether information relevant to an investigation has been scrubbed for release, said Cohen:We get the lowest common denominator of information. There are a lot of records used for accountability that depend on our ability to see personally identifiable information (as opposed to private or personal information, which isn’t the same thing). For instance, if you want to do stories on how farm subsidies are paid, you kind of have to know who gets them. If you want to do something on fraud in Federal Emergency Management Agency claims, you have to be able to find the people and businesses who get the aid. But when it gets pushed out as open government data, it often gets scrubbed of important details and then we have a harder time getting them under the Freedom of Information Act because the agencies say the records are already public.To address those issues, Cohen recommends getting more source documents, as a historian would. “I think what we can do is to push harder for actual records, and to not settle for what the White House wants to give us,” she said. “We also have to get better at using records that aren’t held in nice, neat forms”they’re not born that way, and we should get better at using records in whatever form they exist.”Much of the time, government data is often “dirty,” with missing metadata, incorrect fields, or gaps in collection. Journalists have to extract data from PDFs, validate it, and clean up data sets301out, and then present it in context.If the capacity to practice is there, data journalism can deliver notable results. For instance, ProPublica’s “Recovery Tracker”302projects is one of the best examples of the practice in action. Another gold standard for data journalism is the Pulitzer Prize-winning “Toxic Waters”303that project makes it a difficult act to follow, though Times developers are working hard with projects like “Inside Congress.”304doing and what data journalists are working on is inescapable. Both are focused on putting data to work for the public good, whether it’s in the public interest, for profit, in the service of civic utility or, in the biggest crossover, government accountability.305Peralta:The open data movement and hacktivism can accelerate the application of technology to ingest large sets of documents, complex documents, or large volumes of structured data. This will accelerate and help journalism extract and tell better stories, but also bring tons of information to the light, so everyone can see, process, and keep governments accountable.The way to go for us now is use data for journalism but then open that data. We are building blocks of knowledge and, at the same time, putting this data closer to the people, the experts and the ones who can do better work than ourselves to extract another story or detect spots of corruption.It makes lots of sense for us to make the effort of typing, building data sets, cleaning, converting, and sharing data in open formats, even organizing our own “datafest” to expose data to experts. Open data will help in the fight against corruption. That is a real need, as here corruption is killing people.To do so will require that data journalists and civic coders alike apply the powerful tools to the explosion of digital bits and bytes from government, business, and our fellow citizens. The need for data journalism, in the context of massive amounts of government data being released, could not be any more timely, particularly given persistent quality issues.“Open government data means that more people can access and reuse official information published by government bodies,” said Bounegru. “This in itself is not enough. It is increasingly important that journalists can keep up and are equipped with skills and resources to understand open government data. Journalists need to know what official data means, what it says, and what it leaves out.”That requires journalists to possess both numeracy and digital literacy, if they’re going to interrogate the data. “Only by equipping journalists with the skills to use data more effectively can we break the current asymmetry, where our understanding of the information that matters is mediated by governments, companies, and other experts,” said Bounegru. “In a nutshell, open data advocates push for more data, and data journalists help the public to use, explore, and evaluate it.”Open data needs to find people, not vice versa. For that to happen, supporting and extending the capacity of the media to practice data-driven journalism is a fundamental part of the equation. The role that the Fourth Estate plays in holding governments to account in the 21st century is no less pressing than in decades past. If anything, given how power is gathered and exercised in secret around the world, it’s more so. There’s a long history of elected officials or government staff who want to prevent information that shows fraud, undue influence, embarrassing behavior, or outright criminality from coming to the public’s attention. That’s true today as well. To preserve such evidence, data journalists will also need to securely protect data, just as editors have historically protected human sources. When great investigative work is paired with data journalism, remarkable outcomes bloom.“We took narrative reports from nursing home inspections and made them searchable306government doesn’t allow,” said Ornstein, a senior reporter at ProPublica. The resulting data-driven tool, which enables people to shop for nursing homes online,307service journalism, giving people a way to make more informed decisions and adding an accountability mechanism for businesses and government in the process.At ProPublica, the data journalism team is conscious of deep linking into news applications, with the perspective that the visualizations produced from such apps are themselves a form of narrative journalism. With great data visualizations, readers can find their own way and interrogate the data themselves. Moreover, distinctions between a news story and a news app are dissolving as readers increasingly consume media on mobile devices and tablets. One approach to providing useful context is the “Ion” format at ProPublica.org, where a project like “Eye on the Stimulus” 308 is a hybrid between a blog and an application. On one side of the Web page, there’s a news river. On the other, there are entry points into the data itself. The challenge to this approach is that a media outlet will need data specialists to work closely with the investigators”or that they become one and the same.While that’s true regardless of the context, building data-driven capacity will necessarily start in different levels in different media cultures and climates. “Investigative journalism in Africa, like in many other places, tends to be scoop-driven, which means that someone has leaked you a set of documents,” said Justin Arenstein, a Knight International fellow embedded with the African Media Initiative (AMI)309very few systematic, analytical approaches to analyzing broader societal trends,” he said. “You’re still getting a lot of hit-and-run reporting. That doesn’t help us analyze the societies we’re in, and it doesn’t help us, more importantly, build the tools to make decisions.”The strategy that Arenstein and the AMI is pursuing diverges from the news applications and data visualizations that are common outcomes of data journalism in Europe and the United States. They don’t just tell a story but give people a tool to understand a specific area, make a decision, and then take action. Arenstein emphasized the need to think deeply about how journalists use data in investigations, as opposed to raw material for a visualization. The strongest commonalities between the work Code for Kenya is doing and ProPublica in the United States, in fact, lie in their use of data to support and augment investigative work, mapping the relationships of the powerful, and funding projects on extractive industries.“We’re finding something that maybe you’re starting to see inklings of elsewhere as well: Data journalism doesn’t have to be the product,” he said. “Data journalism can also be the route that you follow to get to a final story. It doesn’t have to produce an infographic or a map.”