After choosing an algorithm on which to focus, the challenge then becomes how to sample the input-output relationship of the algorithm in some meaningful way. As indicated in the last section, there are many scenarios with varying degrees of observability as related to algorithmic inputs and outputs. Sometimes everything is out in the open and there are APIs that can be sampled, whereas other times inputs are obfuscated. Figuring out how to observe or simulate those inputs is a key part of a practical investigation involving reverse engineering. Reporting techniques and talking to sources are two ways to try to understand what inputs are being fed into an algorithm, but when trade secrets obscure the process we’re often reduced to guessing (e.g., “Targeting Political Emails” and “Executive Stock Trading Plans”). Figuring out what the algorithm pays attention to input-wise becomes as intriguing a question as how the algorithm transforms input into output.
Given a potentially infinite sampling space, we must define what is interesting and important for us to feed into an algorithm. For my story on search- engine autocompletions I wanted to know which sex-related words were blocked by Google and Bing, as well as whether adding “child” to the query led to any difference in output; were child sex-related queries leading to child pornography as well? The sampling strategy followed from these questions. I constructed a list of 110 sex-related words drawn from both academic linguists and Urban Dictionary slang to act as the basis for queries. Of course there are many other words and permutations of query templates that I might have used—the richness of language and diversity of expression mean that it will always be hard to come up with the “right” queries when working with algorithms that deal in human language.
Similarly, for Jeremy Singer-Vine working on the price discrimination story at the WSJ, an initial hurdle for the project was getting a representative sample from enough different and dispersed geographies. There are proxy servers that you can rent in different zip codes to do this, but they’re not available in every area, nor are they often in the same zip codes as residential neighborhoods. Deciding how to sample the input-output relationship of an algorithm is the first key challenge, and a difficult dance between what you can sample and what you would like to sample in order to answer your question.
Of course it’s not just about getting any valid sample either. You also have to make sure that the sample simulates the reality of importance to your audience. This was a key difficulty for Michael Keller’s project on iPhone autocorrections, which eventually demanded he simulate the iPhone with scripts that mimic how a human uses the phone. I had a similar experience using an API to do my analysis of Google and Bing autocompletions— the API results don’t perfectly line up with what the user experiences. For instance, the Google API returns 20 results, but only shows four or 10 in the user interface (UI) depending on how preferences are set. The Bing API returns 12 results but only shows eight in the UI. Data returned from the API that never appears in the UI is less significant since users will never encounter it in their daily usage–so I didn’t report on it even though I had it collected.
In some cases, someone else has already done the sampling of an algorithm and left you with a large dataset that might represent some input-output relationship. Or you may not have any control of inputs because those inputs are actually individual people you’re unable or not ethically willing to simulate, such as in ProPublica’s Message Machine. Though observation of such data can still be useful, I would argue that an experimental methodology is more powerful as it allows you to directly sample the input-output relationship in ways that let you assess particular questions you may have about how the algorithm is functioning. Indeed, there is a strong connection between the reverse engineering I’m espousing here and the scientific method. Some computer scientists have even called computing “the fourth great scientific domain” (after physical, biological, and social domains) due to the sheer complexity of the artificial computing systems humankind has built, so big that their understanding demands study in the same ways as other natural disciplines.38