By Julia Stoyanovich (Assistant Professor of Computer Science, Drexel University) and Ellen P. Goodman (Professor, Rutgers Law School). This post is derived from their recent Freedom to Tinker post.
ProPublica’s story on “machine bias” in an algorithm used for sentencing defendants amplified calls to make algorithms more transparent and accountable. It has never been more clear that algorithms are political (Gillespie) and embody contested choices (Crawford), and that these choices are largely obscured from public scrutiny (Pasquale and Citron). We see it in controversies over Facebook’s newsfeed, or Google’s search results, or Twitter’s trending topics. Policymakers are considering how to operationalize “algorithmic ethics” and scholars are calling for accountable algorithms (Kroll, et al.).
One kind of algorithm that is at once especially obscure, powerful, and common is the ranking algorithm (Diakopoulos). Algorithms rank individuals to determine credit worthiness, desirability for college admissions and employment, and compatibility as dating partners. They rank countries and companies for sustainability, human rights, transparency and freedom of expression. They encode norms for what counts as the best schools, neighborhoods, societies, businesses, and technologies. Despite their importance, we often know very little about why high-rankers are coming out on top. Stakeholders are in the dark: those who are ranked, those who use the rankings, and the public whose world the rankings may shape.
Many rankers, such as Google’s page rank, do not disclose what precisely they are seeking to measure or what methods they use to do it. Rankers justify this kind of intentional opacity as a defense against manipulation and gaming.
Some rankers partially reveal the logic of their algorithms by disclosing factors and relative weights. An example is the US News ranking of colleges. These rankers engage in what we might call syntactic transparency. But even with this degree of transparency, there remain significant degrees of opacity, as explained below. In cases where syntactic transparency is considered impossible, and even where it exists, we advocate for an alternative goal of interpretability, which rests on making explicit the interactions between the program and the data on which it acts. An interpretable algorithm allows stakeholders to understand the outcomes, not merely the process by which outcomes were produced.
Opacity in Algorithmic Rankers
The simplest kind of a ranker is a score-based ranker, which applies a scoring function independently to each item and then sorts the items on their scores. These rankers can produce relatively opaque results for the following reasons.
Source 1: The scoring formula alone does not indicate the relative rank of an item. Rankings are, by definition, relative, while scores are absolute. Knowing how the score of an item is computed says little about the outcome — the position of a particular item in the ranking, relative to other items. Is 10.5 a high score or a low score? That depends on how 10.5 compares to the scores of other items, for example to the highest attainable score and to the highest score of some actual item in the input.
Source 2: The weight of an attribute in the scoring formula does not determine its impact on the outcome. For example, consider the ranking of academic programs that weighs faculty size, average publication count, and GREs, among other factors. The algorithm might allocate least weight to faculty size, and even disclose that weight, but that factor could end up being the deciding factor that sets apart top-ranked departments from those in lower ranks. This would happen if that factor were the most variable or was correlated to other factors that enhanced its weight. In other words, what actually turns out to be important may not be what syntactic transparency would reveal.
Source 3: The ranking output may be unstable. A ranking may be unstable because of the scores generated on a particular dataset. An example would be tied scores, where the tie is not reflected in the ranking. Syntactic transparency and accessible data allow us to see the instability, but this is unusual.
Source 4: The ranking methodology may be unstable. The scoring function may produce vastly different rankings with small changes in attribute weights. This is difficult to detect even with syntactic transparency, and even if the data is public. Malcolm Gladwell discusses this issue and gives compelling examples in his 2011 piece, The Order of Things.
The opacity concerns described here are all due to the interaction between the scoring formula (or, more generally, an a priori postulated model) and the actual dataset being ranked. In a recent paper, one of us observed that structured datasets show rich correlations between item attributes in the presence of ranking, and that such correlations are often local (i.e., are present in some parts of the dataset but not in others). To be clear, this kind of opacity is present whether or not there is syntactic transparency.
Recent scholarship on the issue of algorithmic accountability has devalued transparency in favor of verification. The claim is that because algorithmic processes are protean and extremely complex (due to machine learning) or secret (due to trade secrets or privacy concerns), we need to rely on retrospective checks to ensure that the algorithm is performing as promised. Among these checks would be cryptographic techniques like zero knowledge proofs (Kroll, et al.) to confirm particular features, audits (Sandvig) to assess performance, or reverse engineering (Perel & Elkin-Koren) to test cases.
These are valid methods of interrogation, but we do not want to give up on disclosure. Retrospective testing puts a significant burden on users. Proofs are useful only when you know what you are looking for. Reverse engineering with test cases can lead to confirmation bias. All these techniques put the burden of inquiry exclusively on individuals for whom interrogation may be expensive and ultimately fruitless. The burden instead should fall more squarely on the least cost avoider, which will be the vendor who is in a better position to reveal how the algorithm works (even if only partially). What if food manufacturers resisted disclosing ingredients or nutritional values, and instead we were put to the trouble of testing their products or asking them to prove the absence of a substance? That kind of disclosure by verification is very different from having a nutritional label.
What would it take to provide the equivalent of a nutritional label for the process and the outputs of algorithmic rankers? What suffices as an appropriate and feasible explanation depends on the target audience.
For an individual being ranked, a useful description would explain his specific ranked outcome and suggest ways to improve the outcome. What attributes turned out to be most important to the individual’s ranking? When working with data that is not public (e.g., involving credit or medical information about individuals), an explanation mechanism must be mindful of any privacy considerations. Individually-responsive disclosures could be offered in a widget that allows ranked entities to experiment with the results by changing the inputs.
An individual consumer of a ranked output would benefit from a concise and intuitive description of the properties of the ranking. Based on this explanation, users will get a glimpse of, e.g., the diversity (or lack thereof) that the ranking exhibits in terms of attribute values. Both attributes that comprise the scoring function, if known (or, more generally, features that make part of the model), and attributes that co-occur or even correlate with the scoring attributes, can be described explicitly. Let’s again consider a ranking of academic departments in a field that places a huge emphasis on faculty size. We might want to understand how a ranking on average publication count will over-represent large departments (with large faculties) at the top of the list, while GRE does not strongly influence rank.
Figure 1: A hypothetical Ranking Facts label.
Figure 1 presents a hypothetical “nutritional label” for rankings. Inspired by Nutrition Facts, our Ranking Facts label is aimed at the consumer, such as a prospective program applicant, and addresses three of the four opacity sources described above: relativity, impact, and output stability. We do not address methodological stability in the label. How this dimension should be quantified and presented to the user is an open technical problem.
The Ranking Facts show how the properties of the 10 highest-ranked items compare to the entire dataset (Relativity), making explicit cases where the ranges of values, and the median value, are different at the top-10 vs. overall (median is marked with red triangles for faculty size and average publication count). The label lists the attributes that have most impact on the ranking (Impact), presents the scoring formula (if known), and explains which attributes correlate with the computed score. Finally, the label graphically shows the distribution of scores (Stability), explaining that scores differ significantly up to top-10 but are nearly indistinguishable in later positions.
Something like the Rankings Facts makes the process and outcome of algorithmic ranking interpretable for consumers, and reduces the likelihood of opacity harms, discussed above. Beyond Ranking Facts, it is important to develop Interpretability tools that enable vendors to design fair, meaningful and stable ranking processes, and that support external auditing. Promising technical directions include, e.g., quantifying the influence of various features on the outcome under different assumptions about availability of data and code, and investigating whether provenance techniques can be used to generate explanations.