Finding the story

Once you’ve got the input-output relationship of your black box mapped out, the next step is to search and filter for newsworthy insights. In some sense this goes back to expectations that define whether the algorithm is missing the mark somehow, or is exhibiting some behavior that has implications for the audience. These expectations could be statistically based, built on an understanding of social and legal norms, or defined by comparing similar vendors of the technology like Google and Bing autocompletions, or iPhone and Android autocorrections. It can be useful to look at the false positives and false negatives for ideas about how and where the algorithm is failing.

At the WSJ the first filter used for narrowing-in on e-commerce sites was a statistical one: the variance of prices returned from a site for a given item across a variety of geographies. If any non-random variance was observed, the site was marked for a more rigorous and in-depth analysis. Similarly, Rob Barry, who worked on the executive trading plans story for the WSJ, described to me a sophisticated data-mining technique involving clustering and Monte Carlo simulation to find newsworthy cases by trying to identify trading plans that fell outside of the norms of expectation.

In my own projects I have used social and legal norms to help zero-in on stories inside the collected data.39 In the case of the autocomplete algorithms, both Google and Bing had publicly expressed a desire to filter suggestions relating to pornography. Taking that a step further, child pornography is indeed a violation of the legal code, so searching for instances of that became a starting point for filtering the data I had collected. Knowing where the algorithm violates the designers’ expectations (e.g., it lets through child pornography when the stated intent is not to do so), or where it may have unintended side effects can both make for interesting stories.

Another editorial criterion that Google uses in its autocomplete results relates to blocking violence. As part of my analysis I also queried the algorithm using 348 words from the Random House “violent actions” list to see whether Google was steering users toward knowledge of how to act violently. Since violence becomes a more interesting story if it’s being suggested toward other people or living things I filtered my results against man-, woman-, per- son-, and animal-related word lists, essentially creating a newsworthiness filter. This sped up my ability to go through the results. Rather than reading through 14,000 results, I was reviewing fewer than 1,000.

Still, even with newsworthiness filters helping to identify possible stories, it’s absolutely essential to have reporters in the loop digging deeper. For every site that was flagged as a statistical hit, Singer-Vine’s team did a much more comprehensive analysis, writing custom code to analyze each. “There’s an incredible role for traditional reporting to play in a story like that,” said Singer-Vine. Knowing what makes something a story is perhaps less about a filter for statistical, social, or legal deviance than it is about understanding the context of the phenomenon, including historical, cultural, and social expectations related to the issue—all things with which traditional reporting and investigation can help. Sure it can be hard to get the companies running these algorithms to open up in detail about how their algorithms work, but reaching out for interviews can still be valuable. Even a trickle of information about the larger goals and objectives of the algorithms can help you better situate your reverse-engineering analysis. Understanding intent and motives is an important piece of the puzzle. In covering the redistricting story last year, Scott Klein, the news applications editor at ProPublica, considered using some computational means to detect gerrymandering, but quickly decided that, “it [gerrymandering] is a motive, not a shape,” which ultimately made traditional reporting techniques much more effective for investigating the story.

results matching ""

    No results matching ""