Does Crime-Predicting Software Bias Judges? Unfortunately, There’s No Data

For centuries judges have had to make guesses about the people in front of them. Will this person commit a crime again? Or is this punishment enough to deter them? Do they have the support they need at home to stay safe and healthy and away from crime? Or will they be thrust back into a situation that drives them to their old ways? Ultimately, judges have to guess.

But recently, judges in states including California and Florida have been given a new piece of information to aid in that guess work: a “risk assessment score” determined by an algorithm. These algorithms take a whole suite of variables into account, and spit out a number (usually between 1 and 10) that estimates the risk that the person in question will wind up back in jail.

Videos by VICE

If you’ve read this column before, you probably know where this is going. Algorithms aren’t unbiased, and a recent ProPublica investigation suggests what researchers have long been worried about: that these algorithms might contain latent racial prejudice. According to ProPublica’s evaluation of a particular scoring method called the COMPAS system, which was created by a company called Northpointe, people of color are more likely to get higher scores than white people for essentially the same crimes.

“I am not aware if we have any research on the comparison of judges who do and don’t have access to the scores”

Bias against folks of color isn’t a new phenomenon in the judicial system. (This might be the understatement of the year.) There’s a huge body of research that shows that judges, like all humans, are biased. Plenty of studies have shown that for the same crime, judges are more likely to sentence a black person more harshly than a white person. It’s important to question biases of all kinds, both human and algorithmic, but it’s also important to question them in relation to one another. And nobody has done that.

I’ve been doing some research of my own into these recidivism algorithms, and when I read the ProPublica story, I came out with the same question I’ve had since I started looking into this: these algorithms are likely biased against people of color. But so are judges. So how do they compare? How does the bias present in humans stack up against the bias programmed into algorithms?

This shouldn’t be hard to find out: ideally you would divide judges in a single county in half, and give one half access to a scoring system, and have the other half carry on as usual. If you don’t want to A/B test within a county—and there are some questions about whether that’s an ethical thing to do—then simply compare two counties with similar crime rates, in which one county uses rating systems and the other doesn’t. In either case, it’s essential to test whether these algorithmic recidivism scores exacerbate, reduce, or otherwise change existing bias.

Most of the stories I’ve read about these sentencing algorithms don’t mention any such studies. But I assumed that they existed, they just didn’t make the cut in editing.

I was wrong. As far as I can find, and according to everybody I’ve talked to in the field, nobody has done this work, or anything like it. These scores are being used by judges to help them sentence defendants and nobody knows whether the scores exacerbate existing racial bias or not. “I am not aware if we have any research on the comparison of judges who do and don’t have access to the scores,” Kris Hoy, the marketing director of Northpointe, told me.

I tried to reach Sharon Lansing, a researcher who worked on the validation study of COMPAS for New York State. I was told by the Director of Public Information for the New York State Division of Criminal Justice Services that Dr. Lansing wouldn’t be speaking with me, and that “New York State has not done any studies to compare the outcomes and sentences of judges who have seen the results of a COMPAS assessment.” The COMPAS validation report done by California in 2010 makes no mention of any such comparison either.

All the researchers I talked to who study sentencing, risk assessment and these algorithms said they didn’t know of a single study that compared the sentencing patterns judges who do and don’t use these scores. There are studies out there on a variety of risk-assessment tools that look at questions of accuracy and reliability. There are plenty of studies that compare the algorithms’ guesses about recidivism with who really did return to jail. But there’s nothing that compares judges with and without the scores. Which means that states are using these scores in a variety of contexts without having any idea how they might impact decisions that impact people’s lives.

This lack of research is especially troublesome when you consider just how long risk assessment scores have been around. Today, people are focused on algorithms, but the field of risk assessment using statistics has been around for a century. These scores have historically been used by judges to help them decide which prisoners should get parole, but they’re now creeping into sentencing decisions as well. And in all that time, it seems like no one has tried to find out how having mathematical scores impacts the biases that already exist in judges.

Sure, there are some hurdles when it comes to trying to set up a study like this. “I’d imagine there’d be some concern about doing essentially A/B testing in real life, as it were,” said Suresh Venkatasubramanian, a computer scientist who works on a project called Algorithmic Fairness. And there are questions about how amenable judges might be to testing. “Judges don’t want to have their courtrooms turned upside down for the sake of social scientists trying to isolate variables,” said Kirk Helbrun, who has compared different risk assessment tools like COMPAS. But there are ways to study this that don’t involve either of those things, and those studies still haven’t been done.

(There are plenty of arguments against using algorithms to score defendants that have nothing to do with this question of judges with and without the scores. Some people argue for example that since the Northpointe system is proprietary, and the defendant cannot see how it works and what’s it’s doing, it’s unethical to use it in sentencing. Others say that the effectiveness of these scores haven’t been vetted on enough populations yet. Other stories have gotten into the ethical quandary of using a proprietary algorithm to sentence defendants. They’re all worthy questions, but we won’t get into them here.)

Despite being wary of A/B testing on real cases, Venkatasubramanian agreed that the lack of research on this is strange. “Yes, it’s a problem that no one is even asking the question,” he said. “I agree with you that it’s bewildering how no one has even tried.”

David Abrams, an economist who’s studied the racial bias in courtrooms, agreed. “If I were making a decision whether to adopt it, I’d want to see some studies of this type already done,” he said.

So why haven’t these studies been done? There are several possible explanations. “I often wonder if it’s because part of the rationale for using automated methods is ease of use and speed, and having to do elaborate studies on their efficacy defeats the purpose,” said Venkatasubramanian.

It takes time and money to do studies like this, and Abrams argues that there’s little pushing states and law enforcement agencies to spend that time and money validating these things. “The incentives aren’t very powerful here to get things right,” he said. “It’s really rare that judges ever get called to account for individual cases or sentencing patterns. Judicial retention is 98 percent in Cook County, there’s just not a ton to fear.”

And when you think about who these technologies harm, it’s generally the most marginalized. Who is going to call out a biased sentencing pattern and call for more research to compare judges? It’s likely not the black folks being more harshly sentenced by judges.

So we know that these algorithms can be, and often are, biased. But we don’t know how that bias actually impacts the sentencing decisions of judges compared with those who don’t use them. And that’s a huge question that should be answered before these scores become a default part of the American courtroom.