Classification
Classification decisions involve categorizing a particular entity as a constituent of a given class by looking at any number of that entity’s features. Classifications can be built off of a prioritization step by setting a threshold (e.g., anyone with a GPA above X is classified as being on the honor roll), or through more sophisticated computing procedures involving machine learning or clustering.
Google’s Content ID is a good example of an algorithm that makes consequential classification decisions that feed into filtering decisions12. Content ID is an algorithm that automatically scans all videos uploaded to YouTube, identifying and classifying them according to whether or not they have a bit of copyrighted music playing during the video. If the algorithm classifies your video as an infringer it can automatically remove (i.e., filter) that video from the site, or it can initiate a dialogue with the content owner of that music to see if they want to enforce a copyright. Forget the idea of fair use, or a lawyer considering some nuanced and context-sensitive definition of infringement, the algorithm makes a cut-and-dry classification decision for you.
Classification algorithms can have biases and make mistakes though; there can be uncertainty in the algorithm’s decision to classify one way or another13. Depending on how the classification algorithm is implemented there may be different sources of error. For example, in a supervised machine-learning algorithm, training data is used to teach the algorithm how to place a dividing line to separate classes. Falling on either side of that dividing line determines to which class an entity belongs. That training data is often gathered from people who manually inspect thousands of examples and tag each instance according to its category. The algorithm learns how to classify based on the definitions and criteria humans used to produce the training data, potentially introducing human bias into the classifier.
In general, there are two kinds of mistakes a classification algorithm can make—often referred to as false positives and false negatives. Suppose Google is trying to classify a video into one of two categories: “infringing” or “fair use.” A false positive is a video classified as “infringing” when it is actually “fair use.” A false negative, on the other hand, is a video classified as “fair use” when it is in fact “infringing.” Classification algorithms can be tuned to make fewer of either of those mistakes. However, as false positives are tuned down, false negatives will often increase, and vice versa. Tuned all the way toward false positives, the algorithm will mark a lot of fair use videos as infringing; tuned the other way it will miss a lot of infringing videos altogether. You get the sense that tuning one way or the other can privilege different stakeholders in a decision, implying an essential value judgment by the designer of such an algorithm14. The consequences or risks may vary for different stakeholders depending on the choice of how to balance false positive and false negative errors. To understand the power of classification algorithms we need to ask: Are there errors that may be acceptable to the algorithm creator, but do a disservice to the public? And if so, why was the algorithm tuned that way?