The Art and Science of Data-Driven Journalism

Shifting Context

While it’s easy to get excited about gorgeous data visualizations or a national budget that’s now more comprehensible to citizens, the use of data journalism in investigations that stretch over months or years is one of the most important trends in media today. Powerful Web-based tools for scraping, cleaning, analyzing, storing, and visualizing data have transformed what small newsrooms can do with limited resources. The embrace of open source software and agile development practices, coupled with a growing open data movement, have breathed new life into traditional computer-assisted reporting. Collaboration across newsrooms and a focus on publishing data and code that show your work differentiate the best of today’s data journalism from the CAR of decades ago. By automating tasks, one data journalist can increase the capacity of those with whom she works in a newsroom and create databases that may be used for future reporting. That’s one reason (among many) that ProPublica can win Pulitzer prizes without employing hundreds of staff. “We live in an age where information is plentiful,” said Derek Willis, a journalist and developer at the New York Times. “Tools that can help distill and make sense of it are valuable. They save time and convey important insights. News organizations can’t afford to cede that role.”Data journalism can be created quickly or slowly, over weeks, months, or years. Either way, journalists still have to confirm their sources, whether they’re people or data sets, and present them in context. Using data as a source won’t eliminate the need for fact-checking, adding context, or reporting that confirms the ground truth. Just the opposite, in fact. Data journalism empowers watchdogs and journalists with new tools. It’s integral to a global strategy to support investigative journalism that holds the most powerful institutions and entities in the world accountable, from the wealthiest people on Earth, to those involved in organized crime, multinational corporations, legislators, and presidents.37explosion in data creation and the need to understand how governments and corporations wield power has put a premium upon the adoption of new digital technologies and development of related skills in the media. Data and journalism have become deeply intertwined, with increased prominence given to presentation, availability, and publishing. Unfortunately, during recent years, attacks on the press have also grown,^38^ while global press freedoms39have diminished to the lowest levels in a decade.40Around the world, a growing number of data journalists are doing much more than publishing data visualizations or interactive maps. They’re using these tools to find corruption and hold the powerful to account. The most talented members of this journalism tribe are engaged in multi-year investigations that look for evidence that supports or disproves the most fundamental question journalists can ask: Why is something happening? What can data, married to narrative structure and expert human knowledge, tell us about the way the world is changing? Along with delivering the accountability journalism that democracies need to provide checks and balances”speaking truth to and about the powerful”data journalists are also, in some cases, building the next generation of civic infrastructure out of public domain code and data. Such code might include open source survey tools,41open election database,42Census data,43playgrounds.44principles that built the Internet and World Wide Web,45and strengthened by peer networks46between data journalists and civil society. The data and code in these efforts”small pieces, loosely joined by the social Web and application programming interfaces”will extend the plumbing of digital democracy in the 21st century.“I’m really hopeful that by making data about these facets of our communities more accessible to journalists, we’ll make it easier for them to report stories that help readers unpack the complexity,” said Ryan Pitts, a developer journalist at Census Reporter, in an interview. “Narrative along with this kind of data is a really powerful combination. I think it’s the kind of thing a community needs before it can get at the really important question: So what do we do about this?”47practitioners, data journalism is a powerful tool that integrates computer science, statistics, and decades of learning from the social sciences in making sense of huge databases. At that level, data journalists write algorithms to look for trends and map the relationships of influence, power, or sources.As they find patterns in the data, journalists can compare the signals and trends they discover to the shoe-leather reporting and expert sources that investigative journalists have been using for many decades, adding critical thinking and context as they go. In addition to asking hard questions of people, journalists can now interrogate data as a source. “What’s different about practicing data journalism today, versus 10 or 20 years ago, was that from the early 1990s to mid 2000s, the tools didn’t really change all that much,” said Matt Waite, a journalism professor at the University of Nebraska who co-created, the Pulitzer Prize-winning website: The big change was we switched from FoxPro to Access for databases. Around 2000, with the [U.S.] Census, more people got into GIS. But really, the tools and techniques were pretty confined to that tool chain: spreadsheet, database, GIS. Now you can do really, really sophisticated data journalism and never leave Python. There’s so many tools now to do the job that it’s really expanding the universe of ideas and possibilities in ways that just didn’t happen in the early days. Newsrooms, nonprofits, and developers across the public and private sector are all grappling with managing and getting insight from the vast amounts of data generated daily. Notably, all of those parties are tapping into the same statistical software, Web-based applications, and open source tools and frameworks to tame, manage, and analyze this data. “Five years ago, this kind of thing was still seen in a lot of places at best as a curiosity, and at worst as something threatening or frivolous,” said Chase Davis. He continued:Some newsrooms got it, but most data journalists I knew still had to beg, borrow, and steal for simple things like access to servers. Solid programming practices were unheard of”version control? What’s that? If newsroom developers today saw Matt Waite’s code when he first launched PolitiFact, their faces would melt like Raiders of the Lost Ark.Now, our team at the Times runs dozens of servers. Being able to code is table stakes. Reporters are talking about machine-frickin’-learning, and newsroom devs are inventing pieces of software that power huge chunks of the Web. The game done changed.48market for data journalists is booming. New media outlets like and are competing for eyeballs with from the Mirror, from the Atlantic Media Group, The Economist’s Data Blog, the Guardian Datablog, The Upshot from the New York Times, and a forthcoming data-driven site from the Washington Post. A growing number of tools, online platforms, and development practices have transformed the field, from the use of Google and Amazon’s clouds, to the creation and maturation of open source software and the proliferation of open data resources around the globe.