Criminal Sentencing Algorithm No More Accurate Than Random People on the Internet

An “unbiased” computer algorithm used for informing judicial decisions appears to be no better than the assessments of a random group of people, according to a recent study. What’s more, the algorithm appears to issue racially biased recommendations.

COMPAS, the software that many judges use to inform their sentencing decisions, tends to classify black people as higher risk and white people as lower risk, despite not including explicit information about race. In practice, this translates into more lenient rehabilitation suggestions for white defendant and more rigorous programs black defendant of the same recidivism risk.

A study of 1,000 defendants suggests that an algorithm used by judges in several states may be wrong a third of the time.

The results—and the bias—were statistically indistinguishable from judgement calls made by human volunteers randomly selected over the internet.

Here’s John Timmer writing at Ars Technica:

The significance of that discrepancy is still the subject of some debate, but two Dartmouth College researchers have asked a more fundamental question: is the software any good? The answer they came up with is “not especially,” as its performance could be matched by recruiting people on Mechanical Turk or performing a simple analysis that only took two factors into account.

The racial bias likely creeps into the COMPAS algorithm through data on arrest rates, which in some cities and counties are skewed. Equivalent, the developer of COMPAS, says it relies on 127 different data points when determining rehabilitation programs, but only six when assessing whether an individual is at risk of reoffending.

The new study builds on an investigation by Pro Publica, which analyzed COMPAS’s performance in Broward County, Florida, between 2013–2014. Researchers at Dartmouth College took data on age, sex, and criminal history for 1,000 defendants and handed it to volunteer “judges” who were recruited over the internet via Amazon’s Mechanical Turk service.

COMPAS was no better than the study’s participants in assessing the risk of a defendant reoffending. The study’s authors compared the recommendations made by COMPAS and the participants with real-world data on reoffenders. In practice, more white people who were predicted not to reoffend did (40.3% humans, 47.9% COMPAS) compared with black people (29.2% humans, 30.9% COMPAS). Moreover, a larger proportion of black criminals were wrongly predicted to reoffend (37.1% humans, 40.4% COMPAS) compared with white defendants (27.2% humans, 25.4% COMPAS).

While COMPAS and human judgements were similar, failed judgement calls tended to favor white defendants and disadvantage black defendants. False positives are instances where criminals do not reoffend, but were predicted to. False negatives are instances where criminals were predicted to reform but did not.

The Dartmouth researchers also managed to reproduce the software’s predictions by consulting only 5% of the information the algorithm purportedly considers.

“The widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice expertise,” write the study’s authors. “A simple linear predictor provided with only two features is nearly equivalent to COMPAS with its 137 features.”