The Art and Science of Data-Driven Journalism

Recommendations and Predictions

The world needs journalists with these skills more than ever. The same trends changing journalism and society322potential to create significant social change throughout the world, as nation states move from conditions of information scarcity to abundance, causing vast disruptions to governance and governments.Journalists have always needed to be able to write, interview, and fact-check their work. Today, photography, social media, video editing, and mobile devices have already become integral elements of the toolkits of many journalists. Whether news developers are rendering data in real time,323improving news coverage with data,325journalism still must tell a story, solve a problem, or speak truth to power. Smartphones, notebooks, cameras, social media, and data sets can extend investigations in important ways.In the near future, expect basic data-science skills to become baked into how investigative journalists gather sources, find evidence, and present their findings”from building databases, to creating visualization, to applying powerful analytical software. Along with those skills, journalists will still need to apply critical thinking and show how they reached conclusions. While the need is acute and journalism schools are responding, significant cultural, fiscal, and technical barriers to the adoption of data journalism and digital skills remain. In May of 2014, a new report326the Duke Reporters’ Lab at the DeWitt Wallace Center for Media and Democracy in the Sanford School of Public Policy surveyed 20 newsrooms to find which digital tools are still missing. The top-line conclusions from Mark Stencel, Bill Adair, and Prashanth Kamalakanthan painted a sobering picture of an industry in flux. The report found that many U.S. newsrooms aren’t taking advantage of new, low-cost digital tools for reporting and presenting journalism, instead continuing to use familiar methods and practices. Its authors suggest that journalism awards and popular media conferences have created the perception that the adoption of digital tools and data journalism is more prevalent than it is. While local newsroom leaders told the researchers that budget, time, and people were their primary constraints, deeper infrastructure and cultural issues are hindering adoption. The report describes an industry with a gap between “have and have-nots,” with national organizations experimenting with data journalism and new digital tools while local newsrooms are not. “The local newsrooms that have made smart use of digital tools have leaders who are willing to make difficult trade-offs in their coverage,” write the authors. They prioritize stories that reveal the meaning and implications of the news over an overwhelming focus on chasing incremental developments. They also think of the work they can do with digital tools as ways to tell untold stories”not “bells and whistles,” wrote the authors.Writing at,327report’s conclusions support findings of Poynter’s recent “Core Skills for the Future of Journalism” report, which was based on a broader sample of the industry”that is, more than 2,900from media organization professionals, independent or freelance journalists, educators, and students. “Professional journalists in legacy media rated new digital skills as much less important than traditional skills,” he wrote. “Educators, students, and independent journalists rated digital skills as much more important than the professionals.”Finberg’s discussion of the report’s finding and data journalism is a reality check on the challenges that remain for its adoption, revealing a schism between educators and professionals: The ability to find and make sense of information is almost the definition of newsgathering, so it seems safe to call this an essential skill for the beginning journalist. We asked professionals and educators to rate the importance of two key aspects of newsgathering that require this ability. Both the ability to analyze and synthesize large amounts of data and the ability to interpret statistical data were rated as more important by educators than by professionals.When it comes to the ability to analyze and synthesize large amounts of data, a little more than half (55that this was important to very important. Almost three-fourths (73important to very important.The response to the question about the ability to “interpret statistical data and graphics” was similar: 59 percent of professionals and 80 percent of educators called this skill important to very important.Given the large amounts of data available on the Internet and the growing importance of presenting information in a pleasing and informative visual manner, the gap between educators and professionals is disturbing. The ability to make sense of our complex world by distilling meaningful information from the vast river of data is one of the great values professional journalists can offer their audience.The third report, on innovation at the New York Times,328an internal audience, not public consumption. After the document leaked online in May of 2014 to Buzzfeed and Mashable, however, it was hailed by Joshua Benton, the director of Harvard’s Nieman Journalism Lab, as “one of the key documents of this media age.”329tremendous amount of insight and introspection in the 97-page report, which surveyed the media landscape of today in depth, drawing on interviews with dozens of staff at the New York Times and dozens more with outside observers, including this author. I spoke with a researcher from the Times’ team last year about the paper’s approach to digital journalism, editorial analytics, social media and data, along with my own reading, sharing, and commenting habits. The report paints a picture of an extraordinary organization housed within an institution and business grappling with the same fundamental shifts that broader society is enduring in the 21st century, struggling at times to escape a 20th century legacy of tools, infrastructure, and culture. Even though the digital audience of the New York Times is larger than its print readership 31 million unique visitors a month to versus 1.6 million total daily circulation), the daily editorial workflow described remains focused on the paper, not the pixel. The report described the routine of a newsroom focused upon Page One and an incentive structure in which reporters are measured against their A1 stories. Instead of going “digital first” over the last decade, the publisher and leadership have continued to focus on the print edition. As the report notes, the paper currently derives three quarters of its revenues from print. That focus, however, cites a failure to convert the 14.7 million articles in the Times’ archive into structured data. Not doing so has meant that the newspaper is not capitalizing on one of its primary assets by making it more discoverable through search, sharing through social media, and data mining. There are many reasons to think that “The Gray Lady” could become much more than she used to be in the years ahead. The first redesign of in eight years went live in January of 2014, optimized for mobile devices and integrating native advertising. The parent company was profitable in the first quarter of 2014. In March of 2014, the Times expanded its digital offerings330include NYT Now, a lower-priced mobile app sold to iPhone users that summarizes the day’s top stories, and Premier, which offers expanded access to behind-the-scenes stories, ebooks, videos, and crosswords. The Times may also explore events, a lucrative concern for other media companies. As noted earlier, The Upshot launched in April of 2014, to general acclaim. The Upshot’s team includes the graduate student in statistics who helped to build the news quiz on dialect while he was an intern at the Times.331most read and shared content in the history of In May of 2014, the Times launched a lovely closed beta332cooking Web application with more than 16,000 recipes. If the outlet can build a personalized recipe recommendation engine on top of its decades of dining and cooking archives, the platform could have tremendous potential. The new executive editor of the New York Times, Dean Baquet, endorsed the report and the digital-first strategy contained in it, both internally and publicly, once it leaked online. Whether he and his colleagues can execute against its recommendations remains to be seen.The conclusions of these three reports, however, should still be sobering. The Times may be fine, but other papers will not be. Newsrooms face tight budgets, deep set cultural challenges, liabilities and debt, and historic lows in public trust. On the positive side, there is a tremendous upside for adoption and use of current tools and vast green fields for digitally native media organizations to experiment, create, and find audiences, as billions of people come online for the first time globally.So what should we watch for next, and where? The following list of recommendations and predictions sketch out what to expect in the next decade and where publishers will need to adjust.1) Data will become even more of a strategic resource for media.If text is the next frontier in data journalism,333telling stories more effectively, enabling digital journalism and digital humanities to merge in the service of a more informed society.334sources for trusted data.335will be hosted by media organizations and leveraged as an asset. In some cases, media companies may be able to sell access to their archives and APIs. Given the sensitivity of some data sets and the responsibility news organizations hold to confidential sources and whistleblowers, the media will need to improve its security practices. Recent widespread hacking incidents at major newspapers around the United States highlight the need for improvement.336that democratize data skills.Even though the resources to learn data journalism are improving daily, there’s still a high barrier to entry for people with no experience practicing it. That’s changing as more powerful resources come online. Many of these tools for creating or presenting data-driven journalism will come from startups or nonprofits, like CartoDB, DocumentCloud, Timeline.js, Mapbox, Frontline SMS, Zeega, Kimono,, Amara,, DataWrapper, and Other tools will be provided by technology giants, like Google, Amazon, and Esri, as free Web services and open source code, or with enterprise licensing and API fees. Uncertainty about sustainability will drive foundations to fund tools and platforms, including pilot projects, entrepreneurial ventures, or components of open source civic infrastructure. The rest of the tools will be built by independent news hackers, university students, and data journalists as passion projects aimed at scratching someone’s itch; these may well end up helping many other people solve similar problems as well. Just as publishing text and editing photography or videos became accessible to hundreds of millions of people, analyzing and presenting data in maps, apps, and visualizations will become easier to do as well.3) News apps will explode as a primary way for people to consume data journalism.There have been hundreds of millions of iPhones, iPads, and Android devices sold in recent years, with billions of lower cost devices to follow as more of humanity goes online on mobile broadband networks. According to the Pew Internet and Life Project, 42 percent of American adults over 18 years old owned a tablet in January of 2014.337stories, videos, and news applications for the growing number of readers using smartphones, phablets (a new class of mobile phones designed to straddle the functionality of phone and tablet), tablets, and laptops will only become more important to media organizations. That puts a premium on data journalists who can create apps, lightweight data visualizations, and story presentations that are optimized for mobile devices. Increasing demand for apps, quizzes, and interactive games will make news application developer a highly sought-after specialty at media companies. Despite the growth in news apps, the narrative story format will endure as a complement to the news app, the summary for a blog, and access to the underlying data and model. 4) Being digital first means being data-centric and mobile-friendly.As more and more people access the Internet and consume media on mobile devices, adopting a data-centric approach to collecting and publishing journalism will only grow in importance. The need to flexibly deliver content to multiple platforms and formats means that applications’ programming interfaces that can supply data to any platform will continue to be a smart investment for organizations, particularly if they seek to be digital first. The Washington Post, NPR, and the New York Times have already moved in this direction. Others will follow, or lead. Media companies will be competing for attention, and advertising and subscription dollars, with technology giants like Google, Yahoo, Facebook, and startups that publish or curate user-generated content, along with vast amounts of data underpinning information services like mapping, shopping, or search. Facebook’s Paper app, Google Play, Yahoo’s News Digest, Narrative Science, Flipboard, and the automated information services yet to be created will be strong competition for media companies in the future.5) Expect more robo-journalism, but know that human relationships and storytelling still matter.We will see wunderkinds apply computational journalism to finding secrets and creating knowledge at vast scale, just as data scientists do in Silicon Valley, quants do on Wall Street, or spooks do at the National Security Agency. “Robo-journalism” for commodity news from services like Narrative Science is already in the wild and will grow in use, particularly for areas that might have been previously uncovered by a beat reporter or for which a full time journalist is no longer economically viable. Wearable computers, drones, sensors, and algorithms are going to play a bigger role in the gathering of data and consumption of media.Despite changes in technology, humans will still matter in building relationships and making data into stories relatable to people. While the platforms and toolkits for journalism are evolving and the sources of data are expanding rapidly, many things haven’t changed. The ethics that have long guided the choices of the profession remain central to the journalists working today, as NPR’s new ethics guide makes clear.338learn how to hide their actions from open data,” said Stray. “Personal relationships and skepticism will continue to be extremely important.”6) More journalists will need to study the social sciences and statistics. “Philosophically, I think data journalism shares something with social science and also there’s a real connection with the digital humanities,” said Jonathan Stray, who teaches the subject at Columbia. “The emphasis is not just algorithms, but what do these algorithms tell us? How should we interpret all this fancy output?” These questions have been integral to how sociologists, anthropologists, and ethnographers have conducted research for decades, particularly with respect to data collection and statistics. This means that if members of the media seek to practice data journalism, they’ll need to be numerate, ethical, and thoughtful about the biases embedded in the data they’re interrogating. This is not a new idea, given how deeply Philip Meyer’s “precision journalism” is grounded in applying social science to investigative reporting, but everyone who wishes to practice and publish sound data journalism is going to need to understand it. Social scientists and biologists alike know that the sources for data and conditions under which it is collected will shape and bias any subsequent research conclusions made from it. To serve broad audiences, data journalists have to go beyond acquiring and cleaning data to understanding its provenance and source. Then, they’ll need to make sure that its presentation doesn’t tell a different story than the data itself allows.None of that is easy for people trained as scientists, much less journalists. Some projects and analyses may exceed technical competence or subject matter expertise of select members of the media. Collaborating with academia and technologists will be preferable to flawed journalism, analyses, maps or visualizations that mislead readers, given the impact that inaccurate conclusions would have upon trust in the authors or publications.7) There will be higher standards for accuracy and corrections.Getting a fact wrong or screwing up a quote can sink a news story, leading to a correction or even retraction. Making a mistake in an algorithm or interpretation of data can similarly undermine the entire premise of an act of data journalism. The mistakes and errors made in a post at that sought to map kidnappings in Nigeria offer an instructive case study.339sourced from the Global Database of Events, Language and Tone (GDELT). As the correction to the story acknowledged, the journalism that was published was fundamentally flawed because the journalist failed to see that the data represented the rate of media stories as a proxy for the rate of kidnappings, did not account for duplicated reports, and used a default location if none was given. Decontextualizing the GDELT data led to a flawed post.340numerate readers who are not only interested in the data behind stories and the analysis used to arrive at conclusions, but with the interest to try to reproduce them. For instance, a FiveThirtyEight story on the Bechdel test in movies earned in-depth scrutiny from Brendan Keegan, who was able to replicate the findings. What that means in practice is that any media company that publishes such work should have a corrections policy in place for data journalism.341economist at Nesta, upon encountering examples of bad data journalism,342improve upon the form:1. Choose the right stories: In cases like this, a well-written review of the scholarly literature is likely to better inform public debate. Otherwise, stick to (a) lightweight but fun topics or (b) fast-moving topics yet to attract academic attention. 2. Embrace complexity: No interesting causal relationship involves only two variables.3. Use statistics intelligently: A scatterplot of two variables with a least-squares regression line is not “doing statistics.” Bad statistics is worse than no statistics. 4. Finally, be modest: If you have so many caveats as to completely undermine any conclusion, then don’t offer a conclusion.8) Competency in security and data protection will become more important. In the United States, email hosted on private sector servers outside of a media company’s control does not have the same legal protections as email within an office. Until the Electronic Communications Privacy Act is reformed, journalists should be cautious about hosting sensitive email or data on other platforms. People practicing data journalism or civic hacking need to know about the Computer Fraud and Abuse Act (CFAA),343along with proposals for its reform.344members of the public who are unsure of the legality of data access or use, and don’t have the legal resources of major media organizations behind them, should think twice or thrice before clicking.In general, journalists must consider when it’s appropriate to scrape data, access data, store it”or not. Does the story require storing personal information? If so, such sensitive data will need to be protected with the same vigor that journalists have protected confidential sources. Unfortunately, the information security practices of many media companies are not as robust as they will need to be to prevent determined intrusions by organized crime or nation states. For more on data security, ethics, privacy, and journalism, consult the Tow Center’s white paper on the subjects.9) Audiences will demand more transparency on reader data collection and use.Automated, personalized advertising or native advertising will be part of some living stories and news apps. The creators of these platforms it will have to carefully consider the context for matching ads with content. Editorial and business departments are going to run up against difficult conversations about data access and sharing, with respect to audience analytics. Nonprofit organizations may not rely on advertising, instead taking underwriting or sponsorships, but they too will face pressure from funders and foundations to quantify their audiences and the impact of their journalism with data. As editors, reporters, and publishers learn more about who is reading, sharing, and commenting on journalism through gathering data, they’ll have to decide how transparent they’ll be with readers about data collection and usage. 10) Conflicts over public records, data scraping, and ethics will surely arise.For good or ill, we’re likely to see more controversial online maps and interactive apps that show donations, votes, contributions, permits, convictions, and other public records. Along with voluntary disclosures, the data will be scraped, FOIA’ed or otherwise sourced from government publications, agencies, and websites. Over time, much more of this data will end up in private hands, along with media, nonprofits, foundations, snarky online media outlets, and hacker collectives like Anonymous. Some of the resulting maps and charts will no doubt be found to be incorrect, made so by incompetence or malicious intent, resulting in misidentified people who will be subject to harassment or worse.In turn, governments will try to deny access to data, heavily redacted documents, demand takedowns, and criminalize scraping or API calls. They will apply filtering, or extra-legal censorship through pressure on payment processors, seize servers or even direct denial of service attacks. Companies may deny access to their platforms for apps or services that use controversial data, similar to when Apple rejected an app showing drones strikes,345hackers if they find data breaches or unprotected data online.346with more closed governments and constricted information flows is likely to be explosive. Open data is not enough:347Investigative journalism will remain essential.In the United States we’ll run into more difficult First and Fourth Amendment issues as a result of all of this. It’s going to be be extremely messy. The chilling effects of mass surveillance on digital journalism will continue to be an issue for years to come. Just as sources may not trust the idea of a private conversation with a reporter, the provenance of data may be difficult to mask. As a public comment348Group on Intelligence and Communication Technologies convened by President Barack Obama from Columbia Journalism School and the MIT Center for Civic Media highlighted, mass surveillance makes investigative journalism much harder:Put plainly, what the NSA is doing is incompatible with the existing law and policy protecting the confidentiality of journalist-source communications. This is not merely an incompatibility in spirit, but a series of specific and serious discrepancies between the activities of the intelligence community and existing law, policy, and practice in the rest of the government. Further, the climate of secrecy around mass surveillance activities is itself actively harmful to journalism, as sources cannot know when they might be monitored, or how intercepted information might be used against them.11) Collaborate with libraries and universities as archives, hosts, and educators.The government shutdown in the United States in the fall of 2013 demonstrated the need for media organizations and civil society to back up government data. At the time, many nonprofits, foundations, and individuals acted to preserve and mirror what they could. Around the rest of the globe, data sources may be even more tenuous. In the years to come, journalists, universities, tech companies, businesses, and local governments will share a messy ecosystem of APIs, public, and private databases. There’s already an emerging geocommons around OpenStreetMap, supported by rapidly improving open source tools and an emerging geojournalism speciality. One strategy that may be fruitful is for city, county, and state governments to engage local media, universities, and libraries in public or civic data hosting and preservation.349been stewards of knowledge, in the forms of books and periodicals. As such, they and their institutions are well placed to host data for the public good, although legislators and executives will have to think through the economics of them doing so.12) Expect data-driven personalization and predictive news in wearable interfaces.In 2013, the most popular online content at the BBC was an economic class calculator. User-centric apps and services will enable people to understand how a given story or policy applies to them, their children, or their business. These kinds of news apps and data-driven platforms like Homicide Watch hint at what lies ahead. The current state of the art only scratches the surface of the ways that data will be personalized for individual readers as the use of analytics grows in media companies, helping editors get smarter. As people express their interests through searches, clicks, saves, and shares, algorithms will use the data generated to suggest related editorial content and match advertising algorithms for relevant businesses or services with it. Recommendation engines will improve, across media companies, and be followed by predictive news that using social network analysis to suggest stories to users. Over the next decade, a new wave of mobile computing will provide new platforms for nimble media companies to publish stories, from iWatches, to Google Glass, to smart appliances and wearable interfaces connected to an Internet of Things. Some of these wearables won’t just display data: They’ll collect it. Such will include health data, geolocation, and air quality, which can then be used in citizen science and monitoring projects. They’ll be part of a rich fabric of connected devices that, when combined with people, cellphones, and civic media, will enable citizens to monitor infrastructure350water quality in China, extending into networked civil society. The data generated from them will be rich source material for journalists to investigate and share. Drones and sensors are both part of this picture and represent rich topics for more experimentation and inquiry, as explored by my colleague Fergus Pitt in his own research and workshops at the Tow Center.13) More diverse newsrooms will produce better data journalism.Diversity has been a challenge in the media for decades. Although far more minorities and women work in professional journalism than a century ago, a 2013 survey of American Society of News Editors (ASNE) found that of the 38,000working at 1,^400^ U.S. newspapers, 4,700are minorities.351organizations found that 63 percent of them had no minorities at all.352First Look Media, and other news startups garnered criticism in the spring of 2014,353National Association of Black Journalists expressing concern regarding the lack of diversity.354particularly relevant in the data journalism space, given the broader issues with women in technology that have become evident in recent years. Online and off, misogyny and discrimination endure in the industry, along with subtler sexism and racism. The challenge that editors face in hiring a diverse team of data journalists is structural, reflecting broader societal issues. As of 2010, 18 percent of undergraduates receiving degrees in computer science were women, according to the National Center for Women & Information Technology.^355^ In 2013, just 0.4of all female college freshmen said they intended to major in computer science.356come as a surprise when Nate Silver said that 85 percent of the applicants to FiveThirtyEight were men. There are reasons, however, to be cautiously optimistic about diversity in data journalism: Interviews with women and minorities in the United States suggest that the communities that have grown up around computer-assisted reporting over the decades may be more accepting of different faces than others in the technology world, perhaps because of the culture focused on peer-to-peer learning that celebrates mentorship. “NICAR is a pretty healthy place to be a non-white, non-male person working in journalism,” said Tasneem Raja. “I can’t speak to issues of class, ability, gender identity, and other types of difference, other than to say we’re almost definitely less good at them, and that needs to change.” She went on:I don’t have experience with the way folks in this community handle issues of inclusion issues when they come up, but I have seen evidence of folks working preemptively to create environments that are less exclusionary than the norm in Web development, quantitative analysis, the visual arts, or journalism. Maybe it’s because there haven’t been that many of us webby data journos till recently. Data journalists are pragmatic by nature, and maybe it just didn’t make sense to alienate potential swaths of new recruits.That’s not to say everything is rainbows and sunshine, but I’m gonna take a rare moment of optimism here and say that I’m proud to represent this community, because in my experience, it’s genuinely committed to inclusion.No matter the country in which a media company operates, making an effort to include more women; minorities; gay, lesbian, bisexual, and transgender individuals; and people from multiple socioeconomic backgrounds will improve the work product and work environment. A diverse staff diminishes stereotypes and produces second-order reflection on unconscious biases, which in turn can lead to improved, more equitable evaluation of work, performance, promotion, and compensation. The absence of women, minorities, or GLBT persons in startups, media organizations, development teams, and in editorial or product leadership positions can signal to others that they aren’t welcome. Recruiting and hiring differently pays off: Media organizations that have diverse staffs are likely to produce better journalism, from story choice to source selection. Research suggests that teams with both men and women on them are more profitable and innovative. According to the National Center for Women and Information Technology, mixed gender teams produced information technology patents that are cited 26 percent to 35 percent more often than the norm. As the demographics of the United States shifts, stories and data that focus upon minorities, women, and the GLBT community will also gain more audience share, which in turn will create a business opportunity for media companies. That’s true around the globe as well. Given the opportunity, women and minorities have produced world-class data journalism. The world needs more of them, along with anyone else who wants to treat data as a source.14) Be mindful of data-ism and bad data. Embrace skepticism.Journalism will survive the death or diminishment of its institutions, as the Tow Center’s report on post-industrial journalism explored.357technology, data, and narrative skills into their work will play critical roles in societies around the world, from holding the powerful accountable to connecting people with information. As people struggle to make sense of what matters or is true in a tsunami of new media, data journalism will be held up as a way to provide trustworthy insights to debunk pseudoscience, propaganda, misinformation, and online rumor. Just as yellow journalism, penny papers, and tabloids created a market opportunity that led to the creation of a more rigorous, ostensibly objective brand of journalism at the New York Times 160 years ago, today’s fast-moving, chaotic media environment creates opportunities to publish data journalism as a corrective to punditry.There are rocks and stormy waters ahead here, however, created by bad data journalism. The early 21st century has seen the growth of “data-ism,”358where knowledge can be derived through analysis of huge amounts of data now generated by various sources.359antecedents in variants of positivism, the philosophy of science that holds information derived from logical (algorithmic) and mathematical analysis of data and sensory experience is the source of authoritative knowledge; and scientism, the belief that that the scientific method can be applied universally. All have a critical weakness: Bad data, biased data, and flawed experiments can and will be used ignorantly or cynically to twist the truth, mislead, or misinform, even by journalists who wish to do the opposite. Even good data and solid research may be misrepresented or mistaken, a risk that will grow if journalists are pushed to create data visualizations or analyses without training in information design, statistics, and social science. Data has led many numbers-driven executives astray, in business, government, media, or academia.360journalists to interrogate data just as they would human sources, checking facts and assumptions, comparing results, and documenting the process and results of their investigations as a social scientist or biologist would. Complemented by human wisdom and intuition, data journalism still won’t save the world or news, but it will help us all understand it better.