On Monday, the Supreme Court narrowly upheld a Maryland law allowing the state to collect DNA samples from people arrested for violent crimes and burglary. The decision vastly expands law enforcement’s power to collect Americans’ genetic data, to the openly expressed horror of Antonin Scalia and the other dissenters.
It’s nice to imagine a world in which cracking a case means grabbing a fabric swatch from the crime scene, scanning it with the help of CheekSwab.gov, and then getting a report with the criminal’s name, address, photo, and last 10 tweets. But it’s not going to be that easy. Simple example: You get DNA from a hair found at the scene of the crime and find six usable places in the genome to test. The chance that any given person is a genetic match at those six places is pretty small, say 1 in 5 million. Now you run the sample through your database and you’re a happy detective because you find just one match. We got him! And when you try the case, the number “1 in 5 million” is going to be front and center. When the DA rips open his dress shirt at the culminating moment of his closing statement, “1 in 5 million” is what’s printed on the tank top underneath.
That’s how I imagine it, anyway.
But that number, impressive as it is, isn’t the right one. What the DA is telling the jury is that there’s a 1 in 5 million chance that an innocent person would have DNA that matched the sample. In other words:
1. If a person has nothing to do with the crime, what’s the chance that person’s genes match the ones in the sample?
But that’s not what we want to know, is it? We want to know the chance that the defendant before us, the guy who matched the sample, is innocent. And that’s a different question:
2. If a person’s genes match the ones in the sample, what’s the chance she has nothing to do with the crime?
Flipping a probability question like this is apt to change the answer. For instance, if a person is from China, the chance they’re from Yunnan Province is pretty small. But if a person is from Yunnan Province, the chance they’re from China is 100 percent.
The formal way to traverse the gap between these two questions is Bayes’ theorem. But I want to do this a bit more informally.
Remember that our hypothetical DNA database is pretty big; say it includes genetic material from 10 million people. That any individual will match the DNA sample is fantastically improbable, but given 10 million chances, the odds that somebody in the database matches the sample are pretty good. In fact, on average, there should be two matches, at least one of whom is definitely innocent of the crime! The bigger the database, the more poor innocent saps are likely to get fingered by the matching algorithm. That means the answer to question 2 can be big (like 1 in 2) even when the answer to question 1 is really, really small (like 1 in 5 million).