By Lucas Graves Introduction Questions of evidence and truth are at the center of journalism, and also strangely invisible in day-to-day news routines. Reporters sometimes speak as if the truth “is something that rises up by itself, like bread dough,” write Bill Kovach and Tom Rosenstiel in The Elements of Journalism. “Rather than defend our techniques and methods for finding truth, journalists have tended to deny that they exist.”1problem of sourcing can mean anything from locating the documents that will make or break a story, to finding someone willing to be quoted saying what everybody knows to be true. While Bob Woodward and Carl Bernstein’s book, All the Presidents Men, made the two-source rule part of the basic vocabulary of American journalism, in practice, standards of evidence have varied widely from outlet to outlet, and from story to story. (Even at the Washington Post of the 1970s the standards of the Watergate investigation were an exception, not the rule. At least one Post editor has said the two-source requirement was a myth.)2of evidence have eroded in the “hamster wheel” of contemporary journalism.3 Around-the-clock news on cable and the Internet have quickened the news cycle, and invited formats that defy tight editorial control. The practice of holding back untested claims can seem not just infeasible, but antiquated, as professional journalists cede their gatekeeping role. When applying this trend to sensor journalism, upon first consideration the turn toward data- and sensor-driven reporting seems to ease some of these tensions. Reporters who gather raw data about the social or natural worlds appear less dependent on human sources, less vulnerable to manipulation or to their own biases, and less tempted to fall back on “he said, she said” story construction. The kinds of questions asked and answered in sensor journalism seem to fall outside personal interests and ideologies. In practice, of course, data is never raw, and sensors don’t speak for themselves. Reporters working on these kinds of stories may depend on experts, not just to interpret results, but also to conceive and plan investigations. Most important, errors, biases, and assumptions can be trickier to account for as they are pushed farther upstream into the design of field tests, the calibration of sensors, and the definition of what counts as data. Sensor Evidence and Reporting Work The field of sensor journalism is new and open enough to encompass reporting projects that are otherwise very different, from aerial surveillance with unmanned drones, to measuring the shifts in soil temperature that awaken seventeen-year dormant cicadas. One useful starting point is identifying the use of evidence in stories built around various ways of monitoring the human and natural environment; then investigating the role sensed data plays in reporting and story construction. A sensor may be a hardware device monitoring changes in water levels in a Wisconsin stream, or noise levels in New York City. It may be a simple computer script tracking volumes of BitTorrent traffic, for instance among Academy Awards Best Picture contenders. In each of these cases the sensor acts as an instrument for transforming sensed information into a predefined category of data used
to buttress reported claims—about the environmental effects of frac-sand mining, for instance, or socioeconomic patterns in noise pollution, or even a correlation between Oscar nominations and illicit online film swapping. In other words, the sensor testifies to facts about the world, which in another story would rely on some human or institutional authority, or on the reporter’s own observation. Usually the sensor is tailor-made to offer a very particular kind of testimony and nothing more. In this vein, sensors share important characteristics with one tool with which reporters are much more familiar: the opinion poll. In the United States, newspapers and magazines helped to pioneer opinion polling in the early 20th century; independent pollsters such as Gallup and Roper began to develop more rigorous techniques in the 1930s. Polls are not sensors, of course. A CBS News/New York Times phone survey run out of a North Carolina call center is not automated in the same way as a soil temperature gauge. But from the reporter’s point of view the poll and the thermometer do a similar sort of work. Each transforms some aspect of the world into newsworthy data. Structurally, a front-page story on the latest Quinnipiac survey may have less in common with other campaign-trail coverage than it does with a piece reporting on environmental data. In both cases, the numbers themselves often supply the news value and the headline. Interviews and other reporting work serve mainly to interpret or contextualize those findings. And wellestablished categories measured regularly over time—whether consumer confidence or mean annual temperature—become story genres in their own right, deserving coverage whenever the latest data becomes available. These examples underscore the link between sensor journalism and the wider turn toward data-driven reporting, today and in the past. The lines between sensing data, mining data, and analyzing or interpreting data can be difficult to draw. Consider two recent landmarks of data journalism: the 2005 series on Florida’s disappearing wetlands in the St. Petersburg Times, and the 2011 reports by San Francisco’s Center for Investigative Reporting (CIR) on substandard school construction in California’s seismic dan ger zones. In an obvious way these reports relied on data from two of the most advanced, and expensive, sensor networks developed over the 20th century: orbital imaging satellites and seismographic monitoring stations. But in each case, obtaining data from these sensors marked only the beginning of a long and difficult reporting process. The St. Petersburg Times (now Tampa Bay Times) investigation began with the straightforward idea of using software to analyze and compare satellite imagery from 1990 and 2003 to show exactly how many acres of Florida wetlands were being lost to development—a figure neither the state nor federal government could supply. “I went into it with a simplistic idea of glorified subtraction,” the story’s co-author Matt Waite explains in an interview. In the end the analysis took ten months. The satellite images were handcoded into 100 separate pixel groups, based on three bands of electromagnetic spectrum that are especially sensitive to moisture in plants, and then compared using a custom-built algorithm. This yielded the shocking figure of one million acres lost—until Waite realized he had made a mistake: The tide was lower in the later image. “I had just wasted tons of time and money to show when the tide was out,” he recalled. “I thought I was fired.” But drawing on other datasets allowed the investigation to continue. The final published analysis revealed 84,000 acres of wetlands lost, based on reconciling the satellite imagery with data such as land-use records and maps of urban land cover. The results were further verified against higher resolution aerial photographs, available for some areas, and by sending reporters out to visit a random sample of individual sites. Three professors were asked to review the analysis before the story went live, and the paper published a detailed methodology online. The “On Shaky Ground” series at the Center for Investigative Reporting unfolded in a similar fashion. The investigation began in 2009 when reporter Cory Johnson, working on a story pegged to the 20th anniversary of the Loma Prieta earthquake, placed a records request with California’s Division of the State Architect (DSA), which enforces safety standards in school buildings. To his surprise, the DSA returned a spreadsheet listing more than 9,000 out-of-compliance schools. This launched a nineteen-month effort by CIR to build a comprehensive database of unsafe schools in seismic hazard zones, drawing on three separate datasets: the DSA’s school safety records, the index of school building locations maintained by the Department of Education, and seismic data from the California Geological Survey. As in the wetlands investigation, messy official data—misspelled school names, inaccurate building locations, and even shifting definitions of seismic hazards— had to be painstakingly cleaned up with traditional reporting tools like interviews and site visits. An academic analysis of the investigation, by sociologist Sylvain Parasie, found that the journalists involved came to think of building the database the same way they thought of writing a story: as a “reporting process” subject to the usual techniques and judgments of journalism. Once the team had confidence in the database, though, the reporters could use it to suggest new story angles and “drive” their investigation.4matching sensed data, from satellites and seismic stations, with other public databases. Crucially, in each case it was correlations across these datasets— buildings where wetlands used to be and schools with poor safety records near a fault line—that produced the key reported findings. This suggests an interesting conclusion: The most important sensors, from the point of view of reporting work, were the algorithms that looked for patterns in the data. In both cases these algorithmic sensors required constant fine-tuning based on information supplied by traditional reporting tools. The journalists’ confidence in these stories ultimately came from the work that went into cleaning up messy data and calibrating the algorithms used to interpret it. “In the end we felt like we could be pretty declarative about it,” Waite says. “The night before a big exposé you’re usually worrying about whether your sources will hold up and what the reaction will be. The night before the wetlands story I slept like a baby.”
Objectivity and Interpretation in the Age of Big Data Seen in this light, sensor journalism falls into a long history of efforts to build sophisticated, computer-driven analysis into reporting work—what in the 1970s came to be called “computer-assisted reporting” (CAR). The lure of CAR and so-called “precision journalism” rests at least partly with the promise of a kind of news that depends less on subjective interpretation and unreliable human sources. As Philip Meyer argued in his 1973 guide, Precision Journalism, “We journalists would be wrong less often if we adapted to our own use some of the research tools of the social scientists.”5pull the reporter onto firmer evidentiary ground—something visible in the signal examples of this kind of journalism, and in the two cases described above. It is hard not to hear echoes of that promise in the contemporary rhetoric of big data in journalism, just as we have in the fields of business, government, and academia. As advertising campaigns like IBM’s “Smarter Planet” make clear, the big data revolution is being fueled by sensors of every imaginable kind: medical, environmental, industrial, urban, and so on, from Fitbit bracelets to RFID tags on shipping containers. This ubiquitous real-world surveillance comprises the much-hyped “Internet of things,” erasing the barriers between offline and online networks. The unprecedented volumes of data available from sensing threaten to make the scientific method itself obsolete: “This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear,” former Wired editor Chris Anderson wrote in a 2008 article. “Out with every theory of human behavior, from linguistics to sociology... With enough data, the numbers speak for themselves.”6 This sort of breathless pronouncement is easy to lampoon, but it reflects a general sense that relentless measurement of the world around us has made new kinds of objective insights possible. In a 2012 article, media scholars Danah Boyd and Kate Crawford called this the “mythology” of big data, “the widespread belief that large datasets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.” They warn of the tendency for even academics to behave as if questions of interpretation belong strictly to the humanities and other “qualitative” domains, while “quantitative research is in the business of producing facts.”7worries. In a recent editorial, Mike Fienen, a research hydrologist with the U.S. Geological Survey, argues that “in an era of big data” it is all the more important for scientists to be frank about the subjective interpretations and judgments involved at each stage of research, from designing experiments to reading the results. “Should we ignore this veiled subjectivity and pretend that cold hard data provide all the information we need?” he asks. “Or is it time to recognize and embrace the value of subjectivity?… It would be foolish to ignore the data, but it may be equally foolish to assume there is only a single way for the data to ’speak.”’8environmental research, hydrology has made eager use of ever-cheaper sensor technology. Fienen models groundwater systems and is involved in a just-launched, five-year effort to track groundwater changes in Wisconsin’s Chippewa county with a network of monitoring wells and stream gauges. These measurements are vital to understanding the potential environmental effects of frac-sand mining, and Fienen works closely with journalists covering the controversial topic. “It’s an important question and there’s a big vacuum of information,” Fienen explains in an interview. At the same time, he adds, “It can be really hard to interpret the data. I deal with a lot of uncertainty, with conveying uncertainty in my research, and it’s always a challenge.” Journalists reporting on sensed data must appreciate three separate sources of uncertainty, according to Fienen. The most obvious, but generally least important, is the degree of accuracy attached to the sensing equipment. More difficult to grasp is what an accurate sensor reading actually represents. “Is it telling you what’s happening in one square centimeter of ground, or across the entire state?” Fienen asks. That depends on the number and distribution of sensors but also on the scientist’s model, a third major source of uncertainty. Scientists interpret their data according to a model of the world or of society, even as they try to improve that model based on an intuitive sense of what the results should look like. Subjective judgment always comes into play, whether reporters deploy their own sensors or rely on experts to provide data. What is vital to understand is that in practice, for reporters as for scientists, the subjective judgments and hands-on work involved in producing reliable data and making it meaningful become a source of confidence, rather than one of doubt. This suggests that the right way to think of sensor journalism is not as a source of self-evident facts and hard-coded mechanical objectivity, but rather as a continuation of mainstream reporting’s five-decade shift toward a more interpretive, analytical, “meaning-making” style of news.9