The Art and Science of Data-Driven Journalism

New Tools to Wrangle Unstructured Data

The rapid expansion in the amount of unstructured data,268need for this kind of expertise in-house. When the Guardian’s data team was faced with making sense of the Wikileaks cables, it took months to work through them.269hammering governments to give us data in columns and rows,” said Cohen. “I think we’re increasingly seeing that stories just as likely (if not more likely) come from the unstructured information that comes from documents, audio and video, tweets, other social media”from government and non-government sources.”Making sense of all of that data is both a huge opportunity and an immense challenge for newsrooms. Once upon a time, it was difficult for investigators to find information relevant to answering a question. Today, in many (if not all) scenarios, the opposite is true, particularly in a world where readers have access to search engines. That has shifted the value that journalists can add”from finding information to making sense of what’s actually happening, processing, analyzing and vetting data, and finding signal in the digital noise.That new landscape is precisely why the Knight News Challenge gave $1.5examine data.270Project,271newsroom with a set of open source,272oriented at making it easier for journalists to use and analyze data, and Overview,273cleaning, visualizing, and interactively exploring large documents and data sets, acting as a kind of “editorial search engine.”274Stray, Overview’s project manager and a research fellow at the Tow Center, describes it as a organizational structure for data.275bread-and-butter issues for newsrooms struggling to manage data. As of March 2014, PANDA has been installed in 25 newsrooms around the United States.“It’s a pain to search across data sets, but we also have this general newsroom content management issue,” said Brian Boyer, the product manager for PANDA and head of NPR’s News Applications team. “The data stuck on your hard drive is sad data. Knowledge management isn’t a sexy problem to solve, but it’s a real business problem. People could be doing better reporting if they knew what was available. Data should be visible internally.”Boyer thinks the trends toward big data in media are clear, and that he and other hacker journalists can help their colleagues to not only understand it, but to thrive. “There’s a lot more of it, with government releasing its stuff more rapidly,” he said in

  1. “This city of Chicago is releasing a lot of it. We’re going for increased efficiency, to help people work faster and write better stories. Every major news org in the country is hiring a news app developer right now. Or two. For smaller news organizations, it really works for them. Their data apps account for the majority of their traffic.” Once such databases are up and running, journalists can apply analytical tools to produce evidence-driven reporting. The difficulty ProPublica had with building the “Dollars for Docs” project puts the scale of that work into perspective, from converting PDFs to dirty data, to fact-checking correlations within the massive databases.276read Dan Nguyen’s guide to scraping data,277Klein’s style guide for news apps,278exploration of “how data sausage is made.”279journalists start working more with data, they have more choices for tools than ever before. There is also powerful new data-journalism software coming online, from analysis to visualization tools. As Eric Newton highlighted at the Knight Foundation, many of these new tools help journalists gather, clean, analyze, and publish data and do not require sophisticated programming knowledge to use.280the head of the Knight-Mozilla News Technology Partnership for Mozilla, wrote last year, journo-coders are now taking social coding “to a whole new level.” 281 Just as civic software282baked into government, open source is playing a pivotal role in the practice of data journalism. 283 While many news developers are agnostic with respect to which tools they use to get a job done, the people who are building and sharing tools for data journalism are often doing it with open source code.While some of that open source development has been driven by the requirements of the Knight News Challenge, which funded the PANDA and Overview projects, there’s a broader collaborative spirit evidenced in the interstitial communication on Twitter, GitHub, and mailing lists that connect the data-driven journalism community around the world.Members of newsrooms that compete on beats are working together on code. For instance, New York Times and Washington Post developers are teaming up284database. 285 Data journalists from WNYC, the Chicago Tribune and the Spokesman-Review are collaborating on building a better interface for Census data.286helped build the Internet are building out civic infrastructure.287newsroom stack,288be fiercely committed to “showing your work.” For data journalists, that means sharing your source data, methodology, and code, not just a notebook. To put it another way, “code, don’t tell.”289