# On the accuracy of testing

Jan 31 JDN 2459246

One of the most important tools we have for controlling the spread of a pandemic is testing to see who is infected. But no test is perfectly reliable. Currently we have tests that are about 80% accurate. But what does it mean to say that a test is “80% accurate”? Many people get this wrong.

First of all, it certainly does not mean that if you have a positive result, you have an 80% chance of having the virus. Yet this is probably what most people think when they hear “80% accurate”.

So I thought it was worthwhile to demystify this a little bit, an explain just what we are talking about when we discuss the accuracy of a test—which turns out to have deep implications not only for pandemics, but for knowledge in general.

There are really two key measures of a test’s accuracy, called sensitivity and specificity, The sensitivity is the probability that, if the true answer is positive (you have the virus), the test result will be positive. This is the sense in which our tests are 80% accurate. The specificity is the probability that, if the true answer is negative (you don’t have the virus), the test result is negative. The terms make sense: A test is sensitive if it always picks up what’s there, and specific if it doesn’t pick up what isn’t there.

These two measures need not be the same, and typically are quite different. In fact, there is often a tradeoff between them: Increasing the sensitivity will often decrease the specificity.

This is easiest to see with an extreme example: I can create a COVID test that has “100% accuracy” in the sense of sensitivity. How do I accomplish this miracle? I simply assume that everyone in the world has COVID. Then it is absolutely guaranteed that I will have zero false negatives.

I will of course have many false positives—indeed the vast majority of my “positive results” will be me assuming that COVID is present without any evidence. But I can guarantee a 100% true positive rate, so long as I am prepared to accept a 0% true negative rate.

It’s possible to combine tests in ways that make them more than the sum of their parts. You can first run a test with a high specificity, and then re-test with a test that has a high sensitivity. The result will have both rates higher than either test alone.

For example, suppose test A has a sensitivity of 70% and a specificity of 90%, while test B has the reverse.

Then, if the true answer is positive, test A will return true 70% of the time, while test B will return true 90% of the time. So there is a 70% + (30%)(90%) = 97% chance of getting a positive result on the combined test.

If the true answer is negative, test A will return false 90% of the time, while test B will return false 70% of the time. So there is a 90% + (10%)(70%) = 97% chance of getting a negative result on the combined test.

Actually if we are going to specify the accuracy of a test in a single number, I think it would be better to use a much more obscure term, the informedness. Informedness is sensitivity plus specificity, minus one. It ranges between -1 and 1, where 1 is a perfect test, and 0 is a test that tells you absolutely nothing. -1 isn’t the worst possible test; it’s a test that’s simply calibrated backwards! Re-label it, and you’ve got a perfect test. So really maybe we should talk about the absolute value of the informedness.

It’s much harder to play tricks with informedness: My “miracle test” that just assumes everyone has the virus actually has an informedness of zero. This makes sense: The “test” actually provides no information you didn’t already have.

Surprisingly, I was not able to quickly find any references to this really neat mathematical result for informedness, but I find it unlikely that I am the only one who came up with it: The informedness of a test is the non-unit eigenvalue of a Markov matrix representing the test. (If you don’t know what all that means, don’t worry about it; it’s not important for this post. I just found it a rather satisfying mathematical result that I couldn’t find anyone else talking about.)

But there’s another problem as well: Even if we know everything about the accuracy of a test, we still can’t infer the probability of actually having the virus from the test result. For that, we need to know the baseline prevalence. Failing to account for that is the very common base rate fallacy.

Here’s a quick example to help you see what the problem is. Suppose that 1% of the population has the virus. And suppose that the tests have 90% sensitivity and 95% specificity. If I get a positive result, what is the probability I have the virus?

If you guessed something like 90%, you have committed the base rate fallacy. It’s actually much smaller than that. In fact, the true probability you have the virus is only 15%.

In a population of 10000 people, 100 (1%) will have the virus while 9900 (99%) will not. Of the 100 who have the virus, 90 (90%) will test positive and 10 (10%) will test negative. Of the 9900 who do not have the virus, 495 (5%) will test positive and 9405 (95%) will test negative.

This means that out of 585 positive test results, only 90 will actually be true positives!

If we wanted to improve the test so that we could say that someone who tests positive is probably actually positive, would it be better to increase sensitivity or specificity? Well, let’s see.

If we increased the sensitivity to 95% and left the specificity at 95%, we’d get 95 true positives and 495 false positives. This raises the probability to only 16%.

But if we increased the specificity to 97% and left the sensitivity at 90%, we’d get 90 true positives and 297 false positives. This raises the probability all the way to 23%.

But suppose instead we care about the probability that you don’t have the virus, given that you test negative. Our original test had 9900 true negatives and 10 false negatives, so it was quite good in this regard; if you test negative, you only have a 0.1% chance of having the virus.

Which approach is better really depends on what we care about. When dealing with a pandemic, false negatives are much worse than false positives, so we care most about sensitivity. (Though my example should show why specificity also matters.) But there are other contexts in which false positives are more harmful—such as convicting a defendant in a court of law—and then we want to choose a test which has a high true negative rate, even if it means accepting a low true positive rate.

In science in general, we seem to care a lot about false positives; a p-value is simply one minus the specificity of the statistical test, and as we all know, low p-values are highly sought after. But the sensitivity of statistical tests is often quite unclear. This means that we can be reasonably confident of our positive results (provided the baseline probability wasn’t too low, the statistics weren’t p-hacked, etc.); but we really don’t know how confident to be in our negative results. Personally I think negative results are undervalued, and part of how we got a replication crisis and p-hacking was by undervaluing those negative results. I think it would be better in general for us to report 95% confidence intervals (or better yet, 95% Bayesian prediction intervals) for all of our effects, rather than worrying about whether they meet some arbitrary threshold probability of not being exactly zero. Nobody really cares whether the effect is exactly zero (and it almost never is!); we care how big the effect is. I think the long-run trend has been toward this kind of analysis, but it’s still far from the norm in the social sciences. We’ve become utterly obsessed with specificity, and basically forgot that sensitivity exists.

Above all, be careful when you encounter a statement like “the test is 80% accurate”; what does that mean? 80% sensitivity? 80% specificity? 80% informedness? 80% probability that an observed positive is true? These are all different things, and the difference can matter a great deal.