# On the accuracy of testing

Jan 31 JDN 2459246

One of the most important tools we have for controlling the spread of a pandemic is testing to see who is infected. But no test is perfectly reliable. Currently we have tests that are about 80% accurate. But what does it mean to say that a test is “80% accurate”? Many people get this wrong.

First of all, it certainly does not mean that if you have a positive result, you have an 80% chance of having the virus. Yet this is probably what most people think when they hear “80% accurate”.

So I thought it was worthwhile to demystify this a little bit, an explain just what we are talking about when we discuss the accuracy of a test—which turns out to have deep implications not only for pandemics, but for knowledge in general.

There are really two key measures of a test’s accuracy, called sensitivity and specificity, The sensitivity is the probability that, if the true answer is positive (you have the virus), the test result will be positive. This is the sense in which our tests are 80% accurate. The specificity is the probability that, if the true answer is negative (you don’t have the virus), the test result is negative. The terms make sense: A test is sensitive if it always picks up what’s there, and specific if it doesn’t pick up what isn’t there.

These two measures need not be the same, and typically are quite different. In fact, there is often a tradeoff between them: Increasing the sensitivity will often decrease the specificity.

This is easiest to see with an extreme example: I can create a COVID test that has “100% accuracy” in the sense of sensitivity. How do I accomplish this miracle? I simply assume that everyone in the world has COVID. Then it is absolutely guaranteed that I will have zero false negatives.

I will of course have many false positives—indeed the vast majority of my “positive results” will be me assuming that COVID is present without any evidence. But I can guarantee a 100% true positive rate, so long as I am prepared to accept a 0% true negative rate.

It’s possible to combine tests in ways that make them more than the sum of their parts. You can first run a test with a high specificity, and then re-test with a test that has a high sensitivity. The result will have both rates higher than either test alone.

For example, suppose test A has a sensitivity of 70% and a specificity of 90%, while test B has the reverse.

Then, if the true answer is positive, test A will return true 70% of the time, while test B will return true 90% of the time. So there is a 70% + (30%)(90%) = 97% chance of getting a positive result on the combined test.

If the true answer is negative, test A will return false 90% of the time, while test B will return false 70% of the time. So there is a 90% + (10%)(70%) = 97% chance of getting a negative result on the combined test.

Actually if we are going to specify the accuracy of a test in a single number, I think it would be better to use a much more obscure term, the informedness. Informedness is sensitivity plus specificity, minus one. It ranges between -1 and 1, where 1 is a perfect test, and 0 is a test that tells you absolutely nothing. -1 isn’t the worst possible test; it’s a test that’s simply calibrated backwards! Re-label it, and you’ve got a perfect test. So really maybe we should talk about the absolute value of the informedness.

It’s much harder to play tricks with informedness: My “miracle test” that just assumes everyone has the virus actually has an informedness of zero. This makes sense: The “test” actually provides no information you didn’t already have.

Surprisingly, I was not able to quickly find any references to this really neat mathematical result for informedness, but I find it unlikely that I am the only one who came up with it: The informedness of a test is the non-unit eigenvalue of a Markov matrix representing the test. (If you don’t know what all that means, don’t worry about it; it’s not important for this post. I just found it a rather satisfying mathematical result that I couldn’t find anyone else talking about.)

But there’s another problem as well: Even if we know everything about the accuracy of a test, we still can’t infer the probability of actually having the virus from the test result. For that, we need to know the baseline prevalence. Failing to account for that is the very common base rate fallacy.

Here’s a quick example to help you see what the problem is. Suppose that 1% of the population has the virus. And suppose that the tests have 90% sensitivity and 95% specificity. If I get a positive result, what is the probability I have the virus?

If you guessed something like 90%, you have committed the base rate fallacy. It’s actually much smaller than that. In fact, the true probability you have the virus is only 15%.

In a population of 10000 people, 100 (1%) will have the virus while 9900 (99%) will not. Of the 100 who have the virus, 90 (90%) will test positive and 10 (10%) will test negative. Of the 9900 who do not have the virus, 495 (5%) will test positive and 9405 (95%) will test negative.

This means that out of 585 positive test results, only 90 will actually be true positives!

If we wanted to improve the test so that we could say that someone who tests positive is probably actually positive, would it be better to increase sensitivity or specificity? Well, let’s see.

If we increased the sensitivity to 95% and left the specificity at 95%, we’d get 95 true positives and 495 false positives. This raises the probability to only 16%.

But if we increased the specificity to 97% and left the sensitivity at 90%, we’d get 90 true positives and 297 false positives. This raises the probability all the way to 23%.

But suppose instead we care about the probability that you don’t have the virus, given that you test negative. Our original test had 9900 true negatives and 10 false negatives, so it was quite good in this regard; if you test negative, you only have a 0.1% chance of having the virus.

Which approach is better really depends on what we care about. When dealing with a pandemic, false negatives are much worse than false positives, so we care most about sensitivity. (Though my example should show why specificity also matters.) But there are other contexts in which false positives are more harmful—such as convicting a defendant in a court of law—and then we want to choose a test which has a high true negative rate, even if it means accepting a low true positive rate.

In science in general, we seem to care a lot about false positives; a p-value is simply one minus the specificity of the statistical test, and as we all know, low p-values are highly sought after. But the sensitivity of statistical tests is often quite unclear. This means that we can be reasonably confident of our positive results (provided the baseline probability wasn’t too low, the statistics weren’t p-hacked, etc.); but we really don’t know how confident to be in our negative results. Personally I think negative results are undervalued, and part of how we got a replication crisis and p-hacking was by undervaluing those negative results. I think it would be better in general for us to report 95% confidence intervals (or better yet, 95% Bayesian prediction intervals) for all of our effects, rather than worrying about whether they meet some arbitrary threshold probability of not being exactly zero. Nobody really cares whether the effect is exactly zero (and it almost never is!); we care how big the effect is. I think the long-run trend has been toward this kind of analysis, but it’s still far from the norm in the social sciences. We’ve become utterly obsessed with specificity, and basically forgot that sensitivity exists.

Above all, be careful when you encounter a statement like “the test is 80% accurate”; what does that mean? 80% sensitivity? 80% specificity? 80% informedness? 80% probability that an observed positive is true? These are all different things, and the difference can matter a great deal.

# Are humans rational?

JDN 2456928 PDT 11:21.

The central point of contention between cognitive economists and neoclassical economists hinges upon the word “rational”: Are humans rational? What do we mean by “rational”?

Neoclassicists are very keen to insist that they think humans are rational, and often characterize the cognitivist view as saying that humans are irrational. (Daniel Ariely has a habit of feeding this view, titling books things like Predictably Irrational and The Upside of Irrationality.) But I really don’t think this is the right way to characterize the difference.

Daniel Kahneman has a somewhat better formulation (from Thinking, Fast and Slow): “I often cringe when my work is credited as demonstrating that human choices are irrational, when in fact our research only shows that Humans are not well described by the rational-agent model.” (Yes, he capitalizes the word “Humans” throughout, which is annoying; but in general it is a great book.)

The problem is that saying “humans are irrational” has the connotation of a universal statement; it seems to be saying that everything we do, all the time, is always and everywhere utterly irrational. And this of course could hardly be further from the truth; we would not have even survived in the savannah, let alone invented the Internet, if we were that irrational. If we simply lurched about randomly without any concept of goals or response to information in the environment, we would have starved to death millions of years ago.

But at the same time, the neoclassical definition of “rational” obviously does not describe human beings. We aren’t infinite identical psychopaths. Particularly bizarre (and frustrating) is the continued insistence that rationality entails selfishness; apparently economists are getting all their philosophy from Ayn Rand (who barely even qualifies as such), rather than the greats such as Immanuel Kant and John Stuart Mill or even the best contemporary philosophers such as Thomas Pogge and John Rawls. All of these latter would be baffled by the notion that selfless compassion is irrational.

Indeed, Kant argued that rationality implies altruism, that a truly coherent worldview requires assent to universal principles that are morally binding on yourself and every other rational being in the universe. (I am not entirely sure he is correct on this point, and in any case it is clear to me that neither you nor I are anywhere near advanced enough beings to seriously attempt such a worldview. Where neoclassicists envision infinite identical psychopaths, Kant envisions infinite identical altruists. In reality we are finite diverse tribalists.)

But even if you drop selfishness, the requirements of perfect information and expected utility maximization are still far too strong to apply to real human beings. If that’s your standard for rationality, then indeed humans—like all beings in the real world—are irrational.

The confusion, I think, comes from the huge gap between ideal rationality and total irrationality. Our behavior is neither perfectly optimal nor hopelessly random, but somewhere in between.

In fact, we are much closer to the side of perfect rationality! Our brains are limited, so they operate according to heuristics: simplified, approximate rules that are correct most of the time. Clever experiments—or complex environments very different from how we evolved—can cause those heuristics to fail, but we must not forget that the reason we have them is that they work extremely well in most cases in the environment in which we evolved. We are about 90% rational—but woe betide that other 10%.

The most obvious example is phobias: Why are people all over the world afraid of snakes, spiders, falling, and drowning? Because those used to be leading causes of death. In the African savannah 200,000 years ago, you weren’t going to be hit by a car, shot with a rifle bullet or poisoned by carbon monoxide. (You’d probably die of malaria, actually; for that one, instead of evolving to be afraid of mosquitoes we evolved a biological defense mechanism—sickle-cell red blood cells.) Death in general was actually much more likely then, particularly for children.

A similar case can be made for other heuristics we use: We are tribal because the proper functioning of our 100-person tribe used to be the most important factor in our survival. We are racist because people physically different from us were usually part of rival tribes and hence potential enemies. We hoard resources even when our technology allows abundance, because a million years ago no such abundance was possible and every meal might be our last.

When asked how common something is, we don’t calculate a posterior probability based upon Bayesian inference—that’s hard. Instead we try to think of examples—that’s easy. That’s the availability heuristic. And if we didn’t have mass media constantly giving us examples of rare events we wouldn’t otherwise have known about, the availability heuristic would actually be quite accurate. Right now, people think of terrorism as common (even though it’s astoundingly rare) because it’s always all over the news; but if you imagine living in an ancient tribe—or even an medieval village!—anything you heard about that often would almost certainly be something actually worth worrying about. Our level of panic over Ebola is totally disproportionate; but in the 14th century that same level of panic about the Black Death would be entirely justified.

When we want to know whether something is a member of a category, again we don’t try to calculate the actual probability; instead we think about how well it seems to fit a model we have of the paradigmatic example of that category—the representativeness heuristic. You see a Black man on a street corner in New York City at night; how likely is it that he will mug you? Pretty small actually, because there were less than 200,000 crimes in all of New York City last year in a city of 8,000,000 people—meaning the probability any given person committed a crime in the previous year was only 2.5%; the probability on any given day would then be less than 0.01%. Maybe having those attributes raises the probability somewhat, but you can still be about 99% sure that this guy isn’t going to mug you tonight. But since he seemed representative of the category in your mind “criminals”, your mind didn’t bother asking how many criminals there are in the first place—an effect called base rate neglect. Even 200 years ago—let alone 1 million—you didn’t have these sorts of reliable statistics, so what else would you use? You basically had no choice but to assess based upon representative traits.

As you probably know, people have trouble dealing with big numbers, and this is a problem in our modern economy where we actually need to keep track of millions or billions or even trillions of dollars moving around. And really I shouldn’t say it that way, because \$1 million (\$1,000,000) is an amount of money an upper-middle class person could have in a retirement fund, while \$1 billion (\$1,000,000,000) would make you in the top 1000 richest people in the world, and \$1 trillion (\$1,000,000,000,000) is enough to end world hunger for at least the next 15 years (it would only take about \$1.5 trillion to do it forever, by paying only the interest on the endowment). It’s important to keep this in mind, because otherwise the natural tendency of the human mind is to say “big number” and ignore these enormous differences—it’s called scope neglect. But how often do you really deal with numbers that big? In ancient times, never. Even in the 21st century, not very often. You’ll probably never have \$1 billion, and even \$1 million is a stretch—so it seems a bit odd to say that you’re irrational if you can’t tell the difference. I guess technically you are, but it’s an error that is unlikely to come up in your daily life.