By Beth A. Stauffer Introduction The Internet of Things is changing how people interact with their surroundings, the environment, and each other. New technologies are allowing anyone with a question to attempt answering it using low-cost sensors, online datasets, and the power of cloud computing. The use of such sensors and information by journalists comes with slightly higher expectations for accuracy, reliability, and credibility. As a result, journalists must develop a sufficient understanding of how to design a data-collection campaign, choose appropriate tools for their questions, and assess data for quality and consistency. For scientists engaged in basic research, questions about experimental design, data quality, and reproducibility are typically dealt with in the peer-review process and through the copious use of statistical analyses. For nonprofessional participants in the scientific process, upfront planning and development of quality assurance plans can help identify the relevant sources of error and highlight effective ways of controlling for data quality. This essay is meant to serve as a guide to the minimum requirements journalists should consider when pursuing data collection and/or including analysis and interpretation of existing datasets in their work. In addition to assuring that a campaign is well-designed or that an existing dataset is
of high quality, a small investment of time in the initial stages can save a significant amount of effort down the line in the interpretation and communication of credible results. Sampling Design—How to Design for Answers There are potential sources of uncertainty at every step along the way in the collection and analysis of data. These must be considered both in the design of new sensor-driven data collection campaigns and in efforts to extract meaning from existing datasets. Sources of uncertainty can be encountered in the design of sampling campaigns and choice of measurement technique, input of data to centralized repositories, and analysis and interpretation of data by both providers and end users. While minimizing uncertainty at all stages is the ideal approach, proper documentation can also serve the purpose of providing context for data that might be otherwise scrutinized into oblivion. There are a few basic principles that should be considered when designing a sensor journalism campaign. These principles largely provide direction for how often and where in space samples should be taken.1signal-processing theory suggests that you should sample at least twice as often as the scale of time you’re interested in understanding. For example, if you think air quality is worse in summer than in spring, you need to make measurements every one-and-a-half months to resolve this seasonal variability (approximately three-month timescale). Recent criticism of an effort to map New Yorkers’ running routes using publicly available RunKeeper data2the routes seem heavily biased to the period of time around the New York Marathon and did not take into account seasonal variability in outdoor recreation.3timescales runs the risk of missing critical data and aliasing signals relative to other processes. On the spatial scale, it is important to be systematic when determining sampling locations. The U.S. Environmental Protection Agency (EPA) suggests randomly selecting sites that are representative of the area you are trying to study (probabilistic design).4important when little is known about the study area prior to beginning the campaign and when primarily investigating broad trends over space and/or time. However, to highlight disparities between locations, sampling schemes that employ a combination of selected sites and control sites beforehand can be powerful. For example, an investigation into how well green infrastructure projects contribute to overall improvements in water quality in a lake or stream should use probabilistic sampling since a spatially distributed, longterm trend is the objective. Conversely, The Washington Post maps gun violence in the District of Columbia using data from the ShotSpotter program in which gunfire-detection sensors are concentrated in parts of the city with the highest rates.5of Detection, Accuracy, and Precision Once you have decided on a sample design to generate the most easily interpreted data, you must consider how you will actually collect data. From approved and established methods, to innovative, DIY, sensor technologies, the possibilities are numerous. Most sensors and methods come with some set of technical specifications, which often include the lower and upper limits of detection (LODs). These limits, however, are unfortunately not the straightforward edicts that they would seem; rather they are often based on statistical analyses of different types of measurement error.6represent the best case scenarios obtained by technical personnel under highly controlled conditions with properly calibrated, maintained, and operated instrumentation. Taken together, these caveats further underscore the need to choose an appropriately conservative tool for the ranges you expect to encounter and to try to validate performance over time.7typically presented as a technical specification for a given technique, methodology, or instrument. Accuracy is an estimate of how close a measurement comes to a known or standard value. This differs from precision, which indicates how close repeat measurements of the same sample come to one another. It is possible to be both inaccurate but precise and accurate but imprecise; however, the goal is to be both.8or a percentage of the reading for the measured parameter. For example, a commercially available laboratory thermometer with a stated accuracy of 0.5 degrees Celsius across the range of -10 to 70 degrees Celsius should yield values between 14.5 and 15.5 degrees Celsius for a sample that is 15 degrees Celsius. Other instruments may use a percentage of the reading to communicate expected accuracy. Again, it should be kept in mind that accuracy and precision specifications are only estimates given proper maintenance, calibration, and use of instruments and methods for measurement. The need for highest accuracy and precision will necessarily scale with the intended use of the data, a concept formalized in, for example, Ohio EPA’s “3 Levels of Credible Data.”9legal defensibility, levels of accuracy and precision should be consistent with approved methods for measurement. The EPAapproved Hach Method 10360 for measurement of dissolved oxygen in water, for example, requires initial accuracy and precision of 5 percent and 2 percent, respectively.10not ensure legal defensibility, and dissolved oxygen data with lower levels of accuracy and precision (e.g., 5–10 percent) may still be appropriate for other, less rigorous uses. There have already been examples of good and bad tool choices in sensor- and data-driven journalism. In the RunKeeper example previously highlighted, an additional source of criticism was around the “fuzziness” of the location information.11units are surprisingly decent when compared to more expense GPS units, their ability to detect your location down to a few meters while running quickly diminishes with reduced access to clear sky,12city such as New York. On the other hand, journalists in Florida enlisted the help of a well-known GPS company to provide tools with the specifications necessary to accurately locate interstate tollbooths in their investigative report on speeding cops.13 Metadata—The Benefits of Reconstructing the Who, When, Where, and How When collecting your own or considering whether to use data available from existing programs, it is always advisable to find out as much information as possible about techniques, equipment, and reporting standards used to collect the data. Often described as “data about data,” this metadata is essential to understanding, locating, using, and managing data.14recording who took a sample or measurement, weather conditions at the time of sampling, and any interesting observations or oddities noticed at the time and place of sampling. Additional information about what equipment was used to make the measurement, how recently an instrument was calibrated, and any issues during the measurement/analysis process can also add value. Collection of metadata is an integral part of many assurance project plans (QAPPs), which many nonprofessional participants in science develop with federal agency guidance and have approved by local agencies. This approval process combined with the transparency of the protocols themselves, lends a higher degree of credibility to programs and the data they produce. Such plans ensure a high level of data quality throughout the design, collection, measurement, and analysis processes. Methods for data quality assurance can range from personal knowledge of a participant’s skill/expertise to algorithm- or statistics-based automatic filtering of data as it is reported.15data quality assurance might be sufficient for volunteer programs, ethical considerations and concerns about defensibility in sensor journalism may require using datasets with more robust filtering approaches built into the data quality assurance process.
Together, this metadata can help trace anomalous data in datasets and allow for attribution to real differences versus errors in collection, measurement, or analysis of samples. Leveraging Experts—How to Collaborate for Maximum Results Currently, few journalists are scientists, statisticians, or theoreticians. Over their careers journalists may investigate a range of stories that specialists have spent years researching. Given that, in most cases, they cannot be expected to obtain the advanced degrees to be able to responsibly and credibly use sensors and data in their work. Partnering with trusted experts or legacy institutions, which can include state, local, and federal agencies; universities; and colleges; can facilitate the identification of important considerations when designing a sampling program and provide valuable validation for techniques, instruments, or methods used in analyses. An excellent example of such a partnership can be seen in the Alliance for Aquatic Resource Monitoring (ALLARM) program between Dickinson College and Trout Unlimited, a nonprofit organization dedicated to conserving coldwater fisheries and their watersheds.16technical, and analytical support for Marcellus Shale water monitoring programs run through the Coldwater Conservation Corps of Trout Unlimited, and in turn the volunteer monitoring program receives higher quality data than possible through their limited in-house capabilities. While the relationship between scientists and journalists has not always been one of mutual respect and cooperation, scientists’ increased focus on skills for communication with the media and public and journalists’ interest in sensor- and datadriven work may provide an opportunity to rewrite this relationship.
Conclusions Journalists planning to collect their own or incorporate existing data into their work should utilize the wealth resources available for other nonprofessional endeavors in scientific research. These resources include valuable guidance for choosing the most appropriate sampling designs and tools, what to include or look for as metadata, and considerations for verifying data quality throughout the process. However, as journalists, the ethical and legal implications for data use are likely higher than for many citizen-science or volunteer monitoring programs. Sensor journalists should therefore strive to also consider requirements for legal defensibility in designing their campaigns or judging the quality of existing datasets. By keeping these considerations in mind, journalists should be able to confidently and ethically collect and incorporate data-driven material into their work.