# On the accuracy of testing

Jan 31 JDN 2459246

One of the most important tools we have for controlling the spread of a pandemic is testing to see who is infected. But no test is perfectly reliable. Currently we have tests that are about 80% accurate. But what does it mean to say that a test is “80% accurate”? Many people get this wrong.

First of all, it certainly does not mean that if you have a positive result, you have an 80% chance of having the virus. Yet this is probably what most people think when they hear “80% accurate”.

So I thought it was worthwhile to demystify this a little bit, an explain just what we are talking about when we discuss the accuracy of a test—which turns out to have deep implications not only for pandemics, but for knowledge in general.

There are really two key measures of a test’s accuracy, called sensitivity and specificity, The sensitivity is the probability that, if the true answer is positive (you have the virus), the test result will be positive. This is the sense in which our tests are 80% accurate. The specificity is the probability that, if the true answer is negative (you don’t have the virus), the test result is negative. The terms make sense: A test is sensitive if it always picks up what’s there, and specific if it doesn’t pick up what isn’t there.

These two measures need not be the same, and typically are quite different. In fact, there is often a tradeoff between them: Increasing the sensitivity will often decrease the specificity.

This is easiest to see with an extreme example: I can create a COVID test that has “100% accuracy” in the sense of sensitivity. How do I accomplish this miracle? I simply assume that everyone in the world has COVID. Then it is absolutely guaranteed that I will have zero false negatives.

I will of course have many false positives—indeed the vast majority of my “positive results” will be me assuming that COVID is present without any evidence. But I can guarantee a 100% true positive rate, so long as I am prepared to accept a 0% true negative rate.

It’s possible to combine tests in ways that make them more than the sum of their parts. You can first run a test with a high specificity, and then re-test with a test that has a high sensitivity. The result will have both rates higher than either test alone.

For example, suppose test A has a sensitivity of 70% and a specificity of 90%, while test B has the reverse.

Then, if the true answer is positive, test A will return true 70% of the time, while test B will return true 90% of the time. So there is a 70% + (30%)(90%) = 97% chance of getting a positive result on the combined test.

If the true answer is negative, test A will return false 90% of the time, while test B will return false 70% of the time. So there is a 90% + (10%)(70%) = 97% chance of getting a negative result on the combined test.

Actually if we are going to specify the accuracy of a test in a single number, I think it would be better to use a much more obscure term, the informedness. Informedness is sensitivity plus specificity, minus one. It ranges between -1 and 1, where 1 is a perfect test, and 0 is a test that tells you absolutely nothing. -1 isn’t the worst possible test; it’s a test that’s simply calibrated backwards! Re-label it, and you’ve got a perfect test. So really maybe we should talk about the absolute value of the informedness.

It’s much harder to play tricks with informedness: My “miracle test” that just assumes everyone has the virus actually has an informedness of zero. This makes sense: The “test” actually provides no information you didn’t already have.

Surprisingly, I was not able to quickly find any references to this really neat mathematical result for informedness, but I find it unlikely that I am the only one who came up with it: The informedness of a test is the non-unit eigenvalue of a Markov matrix representing the test. (If you don’t know what all that means, don’t worry about it; it’s not important for this post. I just found it a rather satisfying mathematical result that I couldn’t find anyone else talking about.)

But there’s another problem as well: Even if we know everything about the accuracy of a test, we still can’t infer the probability of actually having the virus from the test result. For that, we need to know the baseline prevalence. Failing to account for that is the very common base rate fallacy.

Here’s a quick example to help you see what the problem is. Suppose that 1% of the population has the virus. And suppose that the tests have 90% sensitivity and 95% specificity. If I get a positive result, what is the probability I have the virus?

If you guessed something like 90%, you have committed the base rate fallacy. It’s actually much smaller than that. In fact, the true probability you have the virus is only 15%.

In a population of 10000 people, 100 (1%) will have the virus while 9900 (99%) will not. Of the 100 who have the virus, 90 (90%) will test positive and 10 (10%) will test negative. Of the 9900 who do not have the virus, 495 (5%) will test positive and 9405 (95%) will test negative.

This means that out of 585 positive test results, only 90 will actually be true positives!

If we wanted to improve the test so that we could say that someone who tests positive is probably actually positive, would it be better to increase sensitivity or specificity? Well, let’s see.

If we increased the sensitivity to 95% and left the specificity at 95%, we’d get 95 true positives and 495 false positives. This raises the probability to only 16%.

But if we increased the specificity to 97% and left the sensitivity at 90%, we’d get 90 true positives and 297 false positives. This raises the probability all the way to 23%.

But suppose instead we care about the probability that you don’t have the virus, given that you test negative. Our original test had 9900 true negatives and 10 false negatives, so it was quite good in this regard; if you test negative, you only have a 0.1% chance of having the virus.

Which approach is better really depends on what we care about. When dealing with a pandemic, false negatives are much worse than false positives, so we care most about sensitivity. (Though my example should show why specificity also matters.) But there are other contexts in which false positives are more harmful—such as convicting a defendant in a court of law—and then we want to choose a test which has a high true negative rate, even if it means accepting a low true positive rate.

In science in general, we seem to care a lot about false positives; a p-value is simply one minus the specificity of the statistical test, and as we all know, low p-values are highly sought after. But the sensitivity of statistical tests is often quite unclear. This means that we can be reasonably confident of our positive results (provided the baseline probability wasn’t too low, the statistics weren’t p-hacked, etc.); but we really don’t know how confident to be in our negative results. Personally I think negative results are undervalued, and part of how we got a replication crisis and p-hacking was by undervaluing those negative results. I think it would be better in general for us to report 95% confidence intervals (or better yet, 95% Bayesian prediction intervals) for all of our effects, rather than worrying about whether they meet some arbitrary threshold probability of not being exactly zero. Nobody really cares whether the effect is exactly zero (and it almost never is!); we care how big the effect is. I think the long-run trend has been toward this kind of analysis, but it’s still far from the norm in the social sciences. We’ve become utterly obsessed with specificity, and basically forgot that sensitivity exists.

Above all, be careful when you encounter a statement like “the test is 80% accurate”; what does that mean? 80% sensitivity? 80% specificity? 80% informedness? 80% probability that an observed positive is true? These are all different things, and the difference can matter a great deal.

# Information theory proves that multiple-choice is stupid

Mar 19, JDN 2457832

This post is a bit of a departure from my usual topics, but it’s something that has bothered me for a long time, and I think it fits broadly into the scope of uniting economics with the broader realm of human knowledge.

Multiple-choice questions are inherently and objectively poor methods of assessing learning.

Consider the following question, which is adapted from actual tests I have been required to administer and grade as a teaching assistant (that is, the style of question is the same; I’ve changed the details so that it wouldn’t be possible to just memorize the response—though in a moment I’ll get to why all this paranoia about students seeing test questions beforehand would also be defused if we stopped using multiple-choice):

The demand for apples follows the equation Q = 100 – 5 P.
The supply of apples follows the equation Q = 10 P.
If a tax of \$2 per apple is imposed, what is the equilibrium price, quantity, tax revenue, consumer surplus, and producer surplus?

A. Price = \$5, Quantity = 10, Tax revenue = \$50, Consumer Surplus = \$360, Producer Surplus = \$100

B. Price = \$6, Quantity = 20, Tax revenue = \$40, Consumer Surplus = \$200, Producer Surplus = \$300

C. Price = \$6, Quantity = 60, Tax revenue = \$120, Consumer Surplus = \$360, Producer Surplus = \$300

D. Price = \$5, Quantity = 60, Tax revenue = \$120, Consumer Surplus = \$280, Producer Surplus = \$500

You could try solving this properly, setting supply equal to demand, adjusting for the tax, finding the equilibrium, and calculating the surplus, but don’t bother. If I were tutoring a student in preparing for this test, I’d tell them not to bother. You can get the right answer in only two steps, because of the multiple-choice format.

Step 1: Does tax revenue equal \$2 times quantity? We said the tax was \$2 per apple.
So that rules out everything except C and D. Welp, quantity must be 60 then.

Step 2: Is quantity 10 times price as the supply curve says? For C they are, for D they aren’t; guess it must be C then.

Now, to do that, you need to have at least a basic understanding of the economics underlying the question (How is tax revenue calculated? What does the supply curve equation mean?). But there’s an even easier technique you can use that doesn’t even require that; it’s called Answer Splicing.

Here’s how it works: You look for repeated values in the answer choices, and you choose the one that has the most repeated values. Prices \$5 and \$6 are repeated equally, so that’s not helpful (maybe the test designer planned at least that far). Quantity 60 is repeated, other quantities aren’t, so it’s probably that. Likewise with tax revenue \$120. Consumer surplus \$360 and Producer Surplus \$300 are both repeated, so those are probably it. Oh, look, we’ve selected a unique answer choice C, the correct answer!

You could have done answer splicing even if the question were about 18th century German philosophy, or even if the question were written in Arabic or Japanese. In fact you even do it if it were written in a cipher, as long as the cipher was a consistent substitution cipher.

But of course, all of this could be completely avoided if I had just presented the question as an open-ended free-response. Then you’d actually have to write down the equations, show me some algebra solving them, and then interpret your results in a coherent way to answer the question I asked. What’s more, if you made a minor mistake somewhere (carried a minus sign over wrong, forgot to divide by 2 when calculating the area of the consumer surplus triangle), I can take off a few points for that error, rather than all the points just because you didn’t get the right answer. At the other extreme, if you just randomly guess, your odds of getting the right answer are miniscule, but even if you did—or copied from someone else—if you don’t show me the algebra you won’t get credit.

So the free-response question is telling me a lot more about what the student actually knows, in a much more reliable way, that is much harder to cheat or strategize against.

Moreover, this isn’t a matter of opinion. This is a theorem of information theory.

The information that is carried over a message channel can be quantitatively measured as its Shannon entropy. It is usually measured in bits, which you may already be familiar with as a unit of data storage and transmission rate in computers—and yes, those are all fundamentally the same thing. A proper formal treatment of information theory would be way too complicated for this blog, but the basic concepts are fairly straightforward: think in terms of how long a sequence of 1s and 0s it would take to convey the message. That is, roughly speaking, the Shannon entropy of that message.

How many bits are conveyed by a multiple-choice response with four choices? 2. Always. At maximum. No exceptions. It is fundamentally, provably, mathematically impossible to convey more than 2 bits of information via a channel that only has 4 possible states. Any multiple-choice response—any multiple-choice response—of four choices can be reduced to the sequence 00, 01, 10, 11.

True-false questions are a bit worse—literally, they convey 1 bit instead of 2. It’s possible to fully encode the entire response to a true-false question as simply 0 or 1.

For comparison, how many bits can I get from the free-response question? Well, in principle the answer to any mathematical question has the cardinality of the real numbers, which is infinite (in some sense beyond infinite, in fact—more infinite than mere “ordinary” infinity); but in reality you can only write down a small number of possible symbols on a page. I can’t actually write down the infinite diversity of numbers between 3.14159 and the true value of pi; in 10 digits or less, I can only (“only”) write down a few billion of them. So let’s suppose that handwritten text has about the same information density as typing, which in ASCII or Unicode has 8 bits—one byte—per character. If the response to this free-response question is 300 characters (note that this paragraph itself is over 800 characters), then the total number of bits conveyed is about 2400.

That is to say, one free-response question conveys six hundred times as much information as a multiple-choice question. Of course, a lot of that information is redundant; there are many possible correct ways to write the answer to a problem (if the answer is 1.5 you could say 3/2 or 6/4 or 1.500, etc.), and many problems have multiple valid approaches to them, and it’s often safe to skip certain steps of algebra when they are very basic, and so on. But it’s really not at all unrealistic to say that I am getting between 10 and 100 times as much useful information about a student from reading one free response than I would from one multiple-choice question.

Indeed, it’s actually a bigger difference than it appears, because when evaluating a student’s performance I’m not actually interested in the information density of the message itself; I’m interested in the product of that information density and its correlation with the true latent variable I’m trying to measure, namely the student’s actual understanding of the content. (A sequence of 500 random symbols would have a very high information density, but would be quite useless in evaluating a student!) Free-response questions aren’t just more information, they are also better information, because they are closer to the real-world problems we are training for, harder to cheat, harder to strategize, nearly impossible to guess, and provided detailed feedback about exactly what the student is struggling with (for instance, maybe they could solve the equilibrium just fine, but got hung up on calculating the consumer surplus).

As I alluded to earlier, free-response questions would also remove most of the danger of students seeing your tests beforehand. If they saw it beforehand, learned how to solve it, memorized the steps, and then were able to carry them out on the test… well, that’s actually pretty close to what you were trying to teach them. It would be better for them to learn a whole class of related problems and then be able to solve any problem from that broader class—but the first step in learning to solve a whole class of problems is in fact learning to solve one problem from that class. Just change a few details each year so that the questions aren’t identical, and you will find that any student who tried to “cheat” by seeing last year’s exam would inadvertently be studying properly for this year’s exam. And then perhaps we could stop making students literally sign nondisclosure agreements when they take college entrance exams. Listen to this Orwellian line from the SAT nondisclosure agreement:

Misconduct includes,but is not limited to:

Taking any test questions or essay topics from the testing room, including through memorization, giving them to anyone else, or discussing them with anyone else through anymeans, including, but not limited to, email, text messages or the Internet

Including through memorization. You are not allowed to memorize SAT questions, because God forbid you actually learn something when we are here to make money off evaluating you.

Multiple-choice tests fail in another way as well; by definition they cannot possibly test generation or recall of knowledge, they can only test recognition. You don’t need to come up with an answer; you know for a fact that the correct answer must be in front of you, and all you need to do is recognize it. Recall and recognition are fundamentally different memory processes, and recall is both more difficult and more important.

Indeed, the real mystery here is why we use multiple-choice exams at all.
There are a few types of very basic questions where multiple-choice is forgivable, because there are just aren’t that many possible valid answers. If I ask whether demand for apples has increased, you can pretty much say “it increased”, “it decreased”, “it stayed the same”, or “it’s impossible to determine”. So a multiple-choice format isn’t losing too much in such a case. But most really interesting and meaningful questions aren’t going to work in this format.

I don’t think it’s even particularly controversial among educators that multiple-choice questions are awful. (Though I do recall an “educational training” seminar a few weeks back that was basically an apologia for multiple choice, claiming that it is totally possible to test “higher-order cognitive skills” using multiple-choice, for reals, believe me.) So why do we still keep using them?

Well, the obvious reason is grading time. The one thing multiple-choice does have over a true free response is that it can be graded efficiently and reliably by machines, which really does make a big difference when you have 300 students in a class. But there are a couple reasons why even this isn’t a sufficient argument.

First of all, why do we have classes that big? It’s absurd. At that point you should just email the students video lectures. You’ve already foreclosed any possibility of genuine student-teacher interaction, so why are you bothering with having an actual teacher? It seems to be that universities have tried to work out what is the absolute maximum rent they can extract by structuring a class so that it is just good enough that students won’t revolt against the tuition, but they can still spend as little as possible by hiring only one adjunct or lecturer when they should have been paying 10 professors.

And don’t tell me they can’t afford to spend more on faculty—first of all, supporting faculty is why you exist. If you can’t afford to spend enough providing the primary service that you exist as an institution to provide, then you don’t deserve to exist as an institution. Moreover, they clearly can afford it—they simply prefer to spend on hiring more and more administrators and raising the pay of athletic coaches. PhD comics visualized it quite well; the average pay for administrators is three times that of even tenured faculty, and athletic coaches make ten times as much as faculty. (And here I think the mean is the relevant figure, as the mean income is what can be redistributed. Firing one administrator making \$300,000 does actually free up enough to hire three faculty making \$100,000 or ten grad students making \$30,000.)

But even supposing that the institutional incentives here are just too strong, and we will continue to have ludicrously-huge lecture classes into the foreseeable future, there are still alternatives to multiple-choice testing.

Ironically, the College Board appears to have stumbled upon one themselves! About half the SAT math exam is organized into a format where instead of bubbling in one circle to give your 2 bits of answer, you bubble in numbers and symbols corresponding to a more complicated mathematical answer, such as entering “3/4” as “0”, “3”, “/”, “4” or “1.28” as “1”, “.”, “2”, “8”. This could easily be generalized to things like “e^2” as “e”, “^”, “2” and “sin(3pi/2)” as “sin”, “3” “pi”, “/”, “2”. There are 12 possible symbols currently allowed by the SAT, and each response is up to 4 characters, so we have already increased our possible responses from 4 to over 20,000—which is to say from 2 bits to 14. If we generalize it to include symbols like “pi” and “e” and “sin”, and allow a few more characters per response, we could easily get it over 20 bits—10 times as much information as a multiple-choice question.

But we can do better still! Even if we insist upon automation, high-end text-recognition software (of the sort any university could surely afford) is now getting to the point where it could realistically recognize a properly-formatted algebraic formula, so you’d at least know if the student remembered the formula correctly. Sentences could be transcribed into typed text, checked for grammar, and sorted for keywords—which is not nearly as good as a proper reading by an expert professor, but is still orders of magnitude better than filling circle “C”. Eventually AI will make even more detailed grading possible, though at that point we may have AIs just taking over the whole process of teaching. (Leaving professors entirely for research, presumably. Not sure if this would be good or bad.)

Automation isn’t the only answer either. You could hire more graders and teaching assistants—say one for every 30 or 40 students instead of one for every 100 students. (And then the TAs might actually be able to get to know their students! What a concept!) You could give fewer tests, or shorter ones—because a small, reliable sample is actually better than a large, unreliable one. A bonus there would be reducing students’ feelings of test anxiety. You could give project-based assignments, which would still take a long time to grade, but would also be a lot more interesting and fulfilling for both the students and the graders.

Or, and perhaps this is the most radical answer of all: You could stop worrying so much about evaluating student performance.

I get it, you want to know whether students are doing well, both so that you can improve your teaching and so that you can rank the students and decide who deserves various awards and merits. But do you really need to be constantly evaluating everything that students do? Did it ever occur to you that perhaps that is why so many students suffer from anxiety—because they are literally being formally evaluated with long-term consequences every single day they go to school?

If we eased up on all this evaluation, I think the fear is that students would just detach entirely; all teachers know students who only seem to show up in class because they’re being graded on attendance. But there are a couple of reasons to think that maybe this fear isn’t so well-founded after all.

If you give up on constant evaluation, you can open up opportunities to make your classes a lot more creative and interesting—and even fun. You can make students want to come to class, because they get to engage in creative exploration and collaboration instead of memorizing what you drone on at them for hours on end. Most of the reason we don’t do creative, exploratory activities is simply that we don’t know how to evaluate them reliably—so what if we just stopped worrying about that?

Moreover, are those students who only show up for the grade really getting anything out of it anyway? Maybe it would be better if they didn’t show up—indeed, if they just dropped out of college entirely and did something else with their lives until they get their heads on straight. Maybe all this effort that we are currently expending trying to force students to learn who clearly don’t appreciate the value of learning could instead be spent enriching the students who do appreciate learning and came here to do as much of it as possible. Because, ultimately, you can lead a student to algebra, but you can’t make them think. (Let me be clear, I do not mean students with less innate ability or prior preparation; I mean students who aren’t interested in learning and are only showing up because they feel compelled to. I admire students with less innate ability who nonetheless succeed because they work their butts off, and wish I were quite so motivated myself.)
There’s a downside to that, of course. Compulsory education does actually seem to have significant benefits in making people into better citizens. Maybe if we let those students just leave college, they’d never come back, and they would squander their potential. Maybe we need to force them to show up until something clicks in their brains and they finally realize why we’re doing it. In fact, we’re really not forcing them; they could drop out in most cases and simply don’t, probably because their parents are forcing them. Maybe the signaling problem is too fundamental, and the only way we can get unmotivated students to accept not getting prestigious degrees is by going through this whole process of forcing them to show up for years and evaluating everything they do until we can formally justify ultimately failing them. (Of course, almost by construction, a student who does the absolute bare minimum to pass will pass.) But college admission is competitive, and I can’t shake this feeling there are thousands of students out there who got rejected from the school they most wanted to go to, the school they were really passionate about and willing to commit their lives to, because some other student got in ahead of them—and that other student is now sitting in the back of the room playing with an iPhone, grumbling about having to show up for class every day. What about that squandered potential? Perhaps competitive admission and compulsory attendance just don’t mix, and we should stop compelling students once they get their high school diploma.