The sausage of statistics being made

 

Nov 11 JDN 2458434

“Laws, like sausages, cease to inspire respect in proportion as we know how they are made.”

~ John Godfrey Saxe, not Otto von Bismark

Statistics are a bit like laws and sausages. There are a lot of things in statistical practice that don’t align with statistical theory. The most obvious examples are the fact that many results in statistics are asymptotic: they only strictly apply for infinitely large samples, and in any finite sample they will be some sort of approximation (we often don’t even know how good an approximation).

But the problem runs deeper than this: The whole idea of a p-value was originally supposed to be used to assess one single hypothesis that is the only one you test in your entire study.

That’s frankly a ludicrous expectation: Why would you write a whole paper just to test one parameter?

This is why I don’t actually think this so-called multiple comparisons problem is a problem with researchers doing too many hypothesis tests; I think it’s a problem with statisticians being fundamentally unreasonable about what statistics is useful for. We have to do multiple comparisons, so you should be telling us how to do it correctly.

Statisticians have this beautiful pure mathematics that generates all these lovely asymptotic results… and then they stop, as if they were done. But we aren’t dealing with infinite or even “sufficiently large” samples; we need to know what happens when your sample is 100, not when your sample is 10^29. We can’t assume that our variables are independently identically distributed; we don’t know their distribution, and we’re pretty sure they’re going to be somewhat dependent.

Even in an experimental context where we can randomly and independently assign some treatments, we can’t do that with lots of variables that are likely to matter, like age, gender, nationality, or field of study. And applied econometricians are in an even tighter bind; they often can’t randomize anything. They have to rely upon “instrumental variables” that they hope are “close enough to randomized” relative to whatever they want to study.

In practice what we tend to do is… fudge it. We use the formal statistical methods, and then we step back and apply a series of informal norms to see if the result actually makes sense to us. This is why almost no psychologists were actually convinced by Daryl Bem’s precognition experiments, despite his standard experimental methodology and perfect p < 0.05 results; he couldn’t pass any of the informal tests, particularly the most basic one of not violating any known fundamental laws of physics. We knew he had somehow cherry-picked the data, even before looking at it; nothing else was possible.

This is actually part of where the “hierarchy of sciences” notion is useful: One of the norms is that you’re not allowed to break the rules of the sciences above you, but you can break the rules of the sciences below you. So psychology has to obey physics, but physics doesn’t have to obey psychology. I think this is also part of why there’s so much enmity between economists and anthropologists; really we should be on the same level, cognizant of each other’s rules, but economists want to be above anthropologists so we can ignore culture, and anthropologists want to be above economists so they can ignore incentives.

Another informal norm is the “robustness check”, in which the researcher runs a dozen different regressions approaching the same basic question from different angles. “What if we control for this? What if we interact those two variables? What if we use a different instrument?” In terms of statistical theory, this doesn’t actually make a lot of sense; the probability distributions f(y|x) of y conditional on x and f(y|x, z) of y conditional on x and z are not the same thing, and wouldn’t in general be closely tied, depending on the distribution f(x|z) of x conditional on z. But in practice, most real-world phenomena are going to continue to show up even as you run a bunch of different regressions, and so we can be more confident that something is a real phenomenon insofar as that happens. If an effect drops out when you switch out a couple of control variables, it may have been a statistical artifact. But if it keeps appearing no matter what you do to try to make it go away, then it’s probably a real thing.

Because of the powerful career incentives toward publication and the strange obsession among journals with a p-value less than 0.05, another norm has emerged: Don’t actually trust p-values that are close to 0.05. The vast majority of the time, a p-value of 0.047 was the result of publication bias. Now if you see a p-value of 0.001, maybe then you can trust it—but you’re still relying on a lot of assumptions even then. I’ve seen some researchers argue that because of this, we should tighten our standards for publication to something like p < 0.01, but that’s missing the point; what we need to do is stop publishing based on p-values. If you tighten the threshold, you’re just going to get more rejected papers and then the few papers that do get published will now have even smaller p-values that are still utterly meaningless.

These informal norms protect us from the worst outcomes of bad research. But they are almost certainly not optimal. It’s all very vague and informal, and different researchers will often disagree vehemently over whether a given interpretation is valid. What we need are formal methods for solving these problems, so that we can have the objectivity and replicability that formal methods provide. Right now, our existing formal tools simply are not up to that task.

There are some things we may never be able to formalize: If we had a formal algorithm for coming up with good ideas, the AIs would already rule the world, and this would be either Terminator or The Culture depending on whether we designed the AIs correctly. But I think we should at least be able to formalize the basic question of “Is this statement likely to be true?” that is the fundamental motivation behind statistical hypothesis testing.

I think the answer is likely to be in a broad sense Bayesian, but Bayesians still have a lot of work left to do in order to give us really flexible, reliable statistical methods we can actually apply to the messy world of real data. In particular, tell us how to choose priors please! Prior selection is a fundamental make-or-break problem in Bayesian inference that has nonetheless been greatly neglected by most Bayesian statisticians. So, what do we do? We fall back on informal norms: Try maximum likelihood, which is like using a very flat prior. Try a normally-distributed prior. See if you can construct a prior from past data. If all those give the same thing, that’s a “robustness check” (see previous informal norm).

Informal norms are also inherently harder to teach and learn. I’ve seen a lot of other grad students flail wildly at statistics, not because they don’t know what a p-value means (though maybe that’s also sometimes true), but because they don’t really quite grok the informal underpinnings of good statistical inference. This can be very hard to explain to someone: They feel like they followed all the rules correctly, but you are saying their results are wrong, and now you can’t explain why.

In fact, some of the informal norms that are in wide use are clearly detrimental. In economics, norms have emerged that certain types of models are better simply because they are “more standard”, such as the dynamic stochastic general equilibrium models that can basically be fit to everything and have never actually usefully predicted anything. In fact, the best ones just predict what we already knew from Keynesian models. But without a formal norm for testing the validity of models, it’s been “DSGE or GTFO”. At present, it is considered “nonstandard” (read: “bad”) not to assume that your agents are either a single unitary “representative agent” or a continuum of infinitely-many agents—modeling the actual fact of finitely-many agents is just not done. Yet it’s hard for me to imagine any formal criterion that wouldn’t at least give you some points for correctly including the fact that there is more than one but less than infinity people in the world (obviously your model could still be bad in other ways).

I don’t know what these new statistical methods would look like. Maybe it’s as simple as formally justifying some of the norms we already use; maybe it’s as complicated as taking a fundamentally new approach to statistical inference. But we have to start somewhere.

If you really want grad students to have better mental health, remove all the high-stakes checkpoints

Post 260: Oct 14 JDN 2458406

A study was recently published in Nature Biotechnology showing clear evidence of a mental health crisis among graduate students (no, I don’t know why they picked the biotechnology imprint—I guess it wasn’t good enough for Nature proper?). This is only the most recent of several studies showing exceptionally high rates of mental health issues among graduate students.

I’ve seen universities do a lot of public hand-wringing and lip service about this issue—but I haven’t seen any that were seriously willing to do what it takes to actually solve the problem.

I think this fact became clearest to me when I was required to fill out an official “Individual Development Plan” form as a prerequisite for my advancement to candidacy, which included one question about “What are you doing to support your own mental health and work/life balance?”

The irony here is absolutely excruciating, because advancement to candidacy has been overwhelmingly my leading source of mental health stress for at least the last six months. And it is only one of several different high-stakes checkpoints that grad students are expected to complete, always threatened with defunding or outright expulsion from the graduate program if the checkpoint is not met by a certain arbitrary deadline.

The first of these was the qualifying exams. Then comes advancement to candidacy. Then I have to complete and defend a second-year paper, then a third-year paper. Finally I have to complete and defend a dissertation, and then go onto the job market and go through a gauntlet of applications and interviews. I can’t think of any other time in my life when I was under this much academic and career pressure this consistently—even finishing high school and applying to college wasn’t like this.

If universities really wanted to improve my mental health, they would find a way to get rid of all that.

Granted, a single university does not have total control over all this: There are coordination problems between universities regarding qualifying exams, advancement, and dissertation requirements. One university that unilaterally tried to remove all these would rapidly lose prestige, as it would not be regarded as “rigorous” to reduce the pressure on your grad students. But that itself is precisely the problem—we have equated “rigor” with pressuring grad students until they are on the verge of emotional collapse. Universities don’t seem to know how to make graduate school difficult in the ways that would actually encourage excellence in research and teaching; they simply know how to make it difficult in ways that destroy their students psychologically.

The job market is even more complicated; in the current funding environment, it would be prohibitively expensive to open up enough faculty positions to actually accept even half of all graduating PhDs to tenure-track jobs. Probably the best answer here is to refocus graduate programs on supporting employment outside academia, recognizing both that PhD-level skills are valuable in many workplaces and that not every grad student really wants to become a professor.

But there are clearly ways that universities could mitigate these effects, and they don’t seem genuinely interested in doing so. They could remove the advancement exam, for example; you could simply advance to candidacy as a formality when your advisor decides you are ready, never needing to actually perform a high-stakes presentation before a committee—because what the hell does that accomplish anyway? Speaking of advisors, they could have a formalized matching process that starts with interviewing several different professors and being matched to the one that best fits your goals and interests, instead of expecting you to reach out on your own and hope for the best. They could have you write a dissertation, but not perform a “dissertation defense”—because, again, what can they possibly learn from forcing you to present in a high-stakes environment that they couldn’t have learned from reading your paper and talking with you about it over several months?

They could adjust or even remove funding deadlines—especially for international students. Here at UCI at least, once you are accepted to the program, you are ostensibly guaranteed funding for as long as you maintain reasonable academic progress—but then they define “reasonable progress” in such a way that you have to form an advancement committee, fill out forms, write a paper, and present before a committee all by a certain date or your funding is in jeopardy. Residents of California (which includes all US students who successfully established residency after a full year) are given more time if we need it—but international students aren’t. How is that fair?

The unwillingness of universities to take such actions clearly shows that their commitment to improving students’ mental health is paper-thin. They are only willing to help their students improve their work-life balance as long as it doesn’t require changing anything about the graduate program. They will provide us with counseling services and free yoga classes, but they won’t seriously reduce the pressure they put on us at every step of the way.
I understand that universities are concerned about protecting their prestige, but I ask them this: Does this really improve the quality of your research or teaching output? Do you actually graduate better students by selecting only the ones who can survive being emotionally crushed? Do all these arbitrary high-stakes performances actually result in greater advancement of human knowledge?

Or is it perhaps that you yourselves were put through such hazing rituals years ago, and now your cognitive dissonance won’t let you admit that it was all for naught? “This must be worth doing, or else they wouldn’t have put me through so much suffering!” Are you trying to transfer your own psychological pain onto your students, lest you be forced to face it yourself?

Is grade inflation a real problem?

Mar 4 JDN 2458182

You can’t spend much time teaching at the university level and not hear someone complain about “grade inflation”. Almost every professor seems to believe in it, and yet they must all be participating in it, if it’s really such a widespread problem.

This could be explained as a collective action problem, a Tragedy of the Commons: If the incentives are always to have the students with the highest grades—perhaps because of administrative pressure, or in order to get better reviews from students—then even if all professors would prefer a harsher grading scheme, no individual professor can afford to deviate from the prevailing norms.

But in fact I think there is a much simpler explanation: Grade inflation doesn’t exist.

In economic growth theory, economists make a sharp distinction between inflation—increase in prices without change in underlying fundamentals—and growth—increase in the real value of output. I contend that there is no such thing as grade inflation—what we are in fact observing is grade growth.
Am I saying that students are actually smarter now than they were 30 years ago?

Yes. That’s exactly what I’m saying.

But don’t take it from me. Take it from the decades of research on the Flynn Effect: IQ scores have been rising worldwide at a rate of about 0.3 IQ points per year for as long as we’ve been keeping good records. Students today are about 10 IQ points smarter than students 30 years ago—a 2018 IQ score of 95 is equivalent to a 1988 score of 105, which is equivalent to a 1958 score of 115. There is reason to think this trend won’t continue indefinitely, since the effect is mainly concentrated at the bottom end of the distribution; but it has continued for quite some time already.

This by itself would probably be enough to explain the observed increase in grades, but there’s more: College students are also a self-selected sample, admitted precisely because they were believed to be the smartest individuals in the application pool. Rising grades at top institutions are easily explained by rising selectivity at top schools: Harvard now accepts 5.6% of applicants. In 1942, Harvard accepted 92% of applicants. The odds of getting in have fallen from 9:1 in favor to 19:1 against. Today, you need a 4.0 GPA, a 36 ACT in every category, glowing letters of recommendation, and hundreds of hours of extracurricular activities (or a family member who donated millions of dollars, of course) to get into Harvard. In the 1940s, you needed a high school diploma and a B average.

In fact, when educational researchers have tried to quantitatively study the phenomenon of “grade inflation”, they usually come back with the result that they simply can’t find it. The US department of education conducted a study in 1995 showing that average university grades had declined since 1965. Given that the Flynn effect raised IQ by almost 10 points during that time, maybe we should be panicking about grade deflation.

It really wouldn’t be hard to make that case: “Back in my day, you could get an A just by knowing basic algebra! Now they want these kids to take partial derivatives?” “We used to just memorize facts to ace the exam; but now teachers keep asking for reasoning and critical thinking?”

More recently, a study in 2013 found that grades rose at the high school level, but fell at the college level, and showed no evidence of losing any informativeness as a signaling mechanism. The only recent study I could find showing genuinely compelling evidence for grade inflation was a 2017 study of UK students estimating that grades are growing about twice as fast as the Flynn effect alone would predict. Most studies don’t even consider the possibility that students are smarter than they used to be—they just take it for granted that any increase in average grades constitutes grade inflation. Many of them don’t even control for the increase in selectivity—here’s one using the fact that Harvard’s average rose from 2.7 to 3.4 from 1960 to 2000 as evidence of “grade inflation” when Harvard’s acceptance rate fell from almost 30% to only 10% during that period.

Indeed, the real mystery is why so many professors believe in grade inflation, when the evidence for it is so astonishingly weak.

I think it’s availability heuristic. Who are professors? They are the cream of the crop. They aced their way through high school, college, and graduate school, then got hired and earned tenure—they were one of a handful of individuals who won a fierce competition with hundreds of competitors at each stage. There are over 320 million people in the US, and only 1.3 million college faculty. This means that college professors represent about the top 0.4% of high-scoring students.

Combine that with the fact that human beings assort positively (we like to spend time with people who are similar to us) and use availability heuristic (we judge how likely something is based on how many times we have seen it).

Thus, when a professor compares to her own experience of college, she is remembering her fellow top-scoring students at elite educational institutions. She is recalling the extreme intellectual demands she had to meet to get where she is today, and erroneously assuming that these are representative of most the population of her generation. She probably went to school at one of a handful of elite institutions, even if she now teaches at a mid-level community college: three quarters of college faculty come from the top one quarter of graduate schools.

And now she compares to the students she has to teach, most of whom would not be able to meet such demands—but of course most people in her generation couldn’t either. She frets for the future of humanity only because not everyone is a genius like her.

Throw in the Curse of Knowledge: The professor doesn’t remember how hard it was to learn what she has learned so far, and so the fact that it seems easy now makes her think it was easy all along. “How can they not know how to take partial derivatives!?” Well, let’s see… were you born knowing how to take partial derivatives?

Giving a student an A for work far inferior to what you’d have done in their place isn’t unfair. Indeed, it would clearly be unfair to do anything less. You have years if not decades of additional education ahead of them, and you are from self-selected elite sample of highly intelligent individuals. Expecting everyone to perform as well as you would is simply setting up most of the population for failure.

There are potential incentives for grade inflation that do concern me: In particular, a lot of international student visas and scholarship programs insist upon maintaining a B or even A- average to continue. Professors are understandably loathe to condemn a student to having to drop out or return to their home country just because they scored 81% instead of 84% on the final exam. If we really intend to make C the average score, then students shouldn’t lose funding or visas just for scoring a B-. Indeed, I have trouble defending any threshold above outright failing—which is to say, a minimum score of D-. If you pass your classes, that should be good enough to keep your funding.

Yet apparently even this isn’t creating too much upward bias, as students who are 10 IQ points smarter are still getting about the same scores as their forebears. We should be celebrating that our population is getting smarter, but instead we’re panicking over “easy grading”.

But kids these days, am I right?

Stop telling people they need to vote. Tell them they need to cast informed votes.

Feb 11 JDN 2458161

I just spent last week’s post imploring you to defend the norms of democracy. This week, I want to talk about a norm of democracy that I actually think needs an adjustment.

Right now, there is a very strong norm that simply says: VOTE.

“It is our civic duty to vote.” “You are unpatriotic if you don’t vote.” “Voting is a moral obligation.” Etc.

The goal here is laudable: We want people to express the altruistic motivation that will drive them to escape the so-called Downs Paradox and actually go vote to make democracy work.

But the norm is missing something quite important. It’s not actually such a great thing if everyone just goes out and votes, because most people are seriously, disturbingly uninformed about politics.

The norm shouldn’t be that you must vote. The norm should be that you must cast an informed vote.

Best if you vote informed, but if you won’t get informed, then better if you don’t vote at all. Adding random noise or bias toward physical attractiveness and height does not improve electoral outcomes.

How uninformed are voters?

Most voters don’t understand even basic facts about the federal budget, like the fact that Medicare and Social Security spending are more than defense spending, or the fact that federal aid and earmarks are tiny portions of the budget. A couple years ago I had to debunk a meme that was claiming that we spend a vastly larger portion of the budget on defense than we actually do.

It gets worse: Only a quarter of Americans can even name all three branches of government. Almost half couldn’t identify the Bill of Rights. We literally required them to learn this in high school. By law they were supposed to know this.

But of course I’m not one of the ignorant ones, right? In a classic case of Dunning-Kruger Effect, nobody ever thinks they are. When asked to predict if they would pass the civics exam required to obtain citizenship, 89% of voters surveyed predicted they would. When they took it, only 17% actually passed it. (For the record, I took it and got a perfect score. You can try it yourself here.)

More informed voters already tend to be more politically engaged. But they are almost evenly divided between Democrats and Republicans, which means (especially with the way the Electoral College works) that elections are primarily determined by low-information voters. Low-information voters were decisive for Trump in a way that is unprecedented for as far back as we have data on voter knowledge (which, sadly, is not all that far back).

To be fair, more information is no panacea; humans are very good at rationalizing beliefs that they hold for tribal reasons. People who follow political news heavily typically have more distorted views on some political issues, because they only hear one side and they think they know but they don’t. To truly be more informed voters we must seek out information from reliable, nonpartisan sources, and listen to a variety of sources with differing views. Get your ideas about climate change from NPR or the IPCC, not from Huffington Post—and certainly not from Fox News. But still, maybe it’s worth reading National Review or Reason on occasion. Even when they are usually wrong, it is good for you to expose yourself to views from the other side—because sometimes they can be right. (Reason recently published an excellent article on the huge waste of government funds on building stadiums, for example, and National Review made some really good points against the New Mexico proposal to mandate college applications for high school graduates.)

And of course even those of us who are well-informed obviously have lots of other things we don’t know. Given my expertise in economics and my level of political engagement, I probably know more about politics than 99% of American voters; but I still can’t name more than a handful of members of Congress or really any state legislators aside from the ones who ran for my own district. I can’t even off the top of my head recall who heads the Orange County Water District, even though they literally decide whether I get to drink and take a shower. I’m not asking voters to know everything there is to know about politics, as no human being could possibly do such a thing. I’m merely asking that they know enough basic information to make an informed decision about who to vote for.

Moreover, I think this is a unique time in history where changing this norm has really become viable. We are living in a golden age of information access—almost literally anything you could care to know about politics, you could find in a few minutes of Google searching. I didn’t know who ran my water district, but I looked it up, and I do now: apparently Stephen R. Sheldon. I can’t name that many members of Congress, but I don’t vote for that many members of Congress, and I do carefully research each candidate running in my district when it comes time to vote. (In the next California state legislature election, Mimi Walters has got to go—she has consistently failed to stand against Trump, choosing her party over her constituency.)

This means that if you are uninformed about politics and yet still vote, you chose to do that. You aren’t living in a world where it’s extremely expensive or time-consuming to learn about politics. It is spectacularly easy to learn about politics if you actually want to; if you didn’t learn, it was because you chose not to learn. And if even this tiny cost is too much for you, then how about this? If you don’t have time to get informed, you don’t have time to vote.

Voting electronically would also help with this. People could, in the privacy of their own homes, look up information on candidates while their ballots are right there in front of them. While mail-in voter fraud actually does exist (unlike in-person voter fraud, which basically doesn’t), there are safeguards already in widespread use in Internet-based commerce that we could institute on electronic voting to provide sufficient protection. Basically, all we need to do is public-key signing: issue every voter a private key to sign their votes, which are then decrypted at the county office using a database of public keys. If public keys were stolen, that could compromise secret-ballot anonymity, but it would not allow anyone to actually change votes. Voters could come in person to collect their private keys when they register to vote, at their convenience weeks or months before the election. Of course, we’d have to make it user-friendly enough that people who aren’t very good with computers would understand the system. We could always leave open the option of in-person voting for anyone who prefers that.

Of course, establishing this norm would most likely reduce voter turnout, even if it did successfully increase voter knowledge. But we don’t actually need everyone to vote. We need everyone’s interests accurately represented. If you aren’t willing to get informed, then casting your vote isn’t representing your interests anyway, so why bother?

Information theory proves that multiple-choice is stupid

Mar 19, JDN 2457832

This post is a bit of a departure from my usual topics, but it’s something that has bothered me for a long time, and I think it fits broadly into the scope of uniting economics with the broader realm of human knowledge.

Multiple-choice questions are inherently and objectively poor methods of assessing learning.

Consider the following question, which is adapted from actual tests I have been required to administer and grade as a teaching assistant (that is, the style of question is the same; I’ve changed the details so that it wouldn’t be possible to just memorize the response—though in a moment I’ll get to why all this paranoia about students seeing test questions beforehand would also be defused if we stopped using multiple-choice):

The demand for apples follows the equation Q = 100 – 5 P.
The supply of apples follows the equation Q = 10 P.
If a tax of $2 per apple is imposed, what is the equilibrium price, quantity, tax revenue, consumer surplus, and producer surplus?

A. Price = $5, Quantity = 10, Tax revenue = $50, Consumer Surplus = $360, Producer Surplus = $100

B. Price = $6, Quantity = 20, Tax revenue = $40, Consumer Surplus = $200, Producer Surplus = $300

C. Price = $6, Quantity = 60, Tax revenue = $120, Consumer Surplus = $360, Producer Surplus = $300

D. Price = $5, Quantity = 60, Tax revenue = $120, Consumer Surplus = $280, Producer Surplus = $500

You could try solving this properly, setting supply equal to demand, adjusting for the tax, finding the equilibrium, and calculating the surplus, but don’t bother. If I were tutoring a student in preparing for this test, I’d tell them not to bother. You can get the right answer in only two steps, because of the multiple-choice format.

Step 1: Does tax revenue equal $2 times quantity? We said the tax was $2 per apple.
So that rules out everything except C and D. Welp, quantity must be 60 then.

Step 2: Is quantity 10 times price as the supply curve says? For C they are, for D they aren’t; guess it must be C then.

Now, to do that, you need to have at least a basic understanding of the economics underlying the question (How is tax revenue calculated? What does the supply curve equation mean?). But there’s an even easier technique you can use that doesn’t even require that; it’s called Answer Splicing.

Here’s how it works: You look for repeated values in the answer choices, and you choose the one that has the most repeated values. Prices $5 and $6 are repeated equally, so that’s not helpful (maybe the test designer planned at least that far). Quantity 60 is repeated, other quantities aren’t, so it’s probably that. Likewise with tax revenue $120. Consumer surplus $360 and Producer Surplus $300 are both repeated, so those are probably it. Oh, look, we’ve selected a unique answer choice C, the correct answer!

You could have done answer splicing even if the question were about 18th century German philosophy, or even if the question were written in Arabic or Japanese. In fact you even do it if it were written in a cipher, as long as the cipher was a consistent substitution cipher.

Could the question have been designed to better avoid answer splicing? Probably. But this is actually quite difficult to do, because there is a fundamental tradeoff between two types of “distractors” (as they are known in the test design industry). You want the answer choices to contain correct pieces and resemble the true answer, so that students who basically understand the question but make a mistake in the process still get it wrong. But you also want the answer choices to be distinct enough in a random enough pattern that answer splicing is unreliable. These two goals are inherently contradictory, and the result will always be a compromise between them. Professional test-designers usually lean pretty heavily against answer-splicing, which I think is probably optimal so far as it goes; but I’ve seen many a professor err too far on the side of similar choices and end up making answer splicing quite effective.

But of course, all of this could be completely avoided if I had just presented the question as an open-ended free-response. Then you’d actually have to write down the equations, show me some algebra solving them, and then interpret your results in a coherent way to answer the question I asked. What’s more, if you made a minor mistake somewhere (carried a minus sign over wrong, forgot to divide by 2 when calculating the area of the consumer surplus triangle), I can take off a few points for that error, rather than all the points just because you didn’t get the right answer. At the other extreme, if you just randomly guess, your odds of getting the right answer are miniscule, but even if you did—or copied from someone else—if you don’t show me the algebra you won’t get credit.

So the free-response question is telling me a lot more about what the student actually knows, in a much more reliable way, that is much harder to cheat or strategize against.

Moreover, this isn’t a matter of opinion. This is a theorem of information theory.

The information that is carried over a message channel can be quantitatively measured as its Shannon entropy. It is usually measured in bits, which you may already be familiar with as a unit of data storage and transmission rate in computers—and yes, those are all fundamentally the same thing. A proper formal treatment of information theory would be way too complicated for this blog, but the basic concepts are fairly straightforward: think in terms of how long a sequence of 1s and 0s it would take to convey the message. That is, roughly speaking, the Shannon entropy of that message.

How many bits are conveyed by a multiple-choice response with four choices? 2. Always. At maximum. No exceptions. It is fundamentally, provably, mathematically impossible to convey more than 2 bits of information via a channel that only has 4 possible states. Any multiple-choice response—any multiple-choice response—of four choices can be reduced to the sequence 00, 01, 10, 11.

True-false questions are a bit worse—literally, they convey 1 bit instead of 2. It’s possible to fully encode the entire response to a true-false question as simply 0 or 1.

For comparison, how many bits can I get from the free-response question? Well, in principle the answer to any mathematical question has the cardinality of the real numbers, which is infinite (in some sense beyond infinite, in fact—more infinite than mere “ordinary” infinity); but in reality you can only write down a small number of possible symbols on a page. I can’t actually write down the infinite diversity of numbers between 3.14159 and the true value of pi; in 10 digits or less, I can only (“only”) write down a few billion of them. So let’s suppose that handwritten text has about the same information density as typing, which in ASCII or Unicode has 8 bits—one byte—per character. If the response to this free-response question is 300 characters (note that this paragraph itself is over 800 characters), then the total number of bits conveyed is about 2400.

That is to say, one free-response question conveys six hundred times as much information as a multiple-choice question. Of course, a lot of that information is redundant; there are many possible correct ways to write the answer to a problem (if the answer is 1.5 you could say 3/2 or 6/4 or 1.500, etc.), and many problems have multiple valid approaches to them, and it’s often safe to skip certain steps of algebra when they are very basic, and so on. But it’s really not at all unrealistic to say that I am getting between 10 and 100 times as much useful information about a student from reading one free response than I would from one multiple-choice question.

Indeed, it’s actually a bigger difference than it appears, because when evaluating a student’s performance I’m not actually interested in the information density of the message itself; I’m interested in the product of that information density and its correlation with the true latent variable I’m trying to measure, namely the student’s actual understanding of the content. (A sequence of 500 random symbols would have a very high information density, but would be quite useless in evaluating a student!) Free-response questions aren’t just more information, they are also better information, because they are closer to the real-world problems we are training for, harder to cheat, harder to strategize, nearly impossible to guess, and provided detailed feedback about exactly what the student is struggling with (for instance, maybe they could solve the equilibrium just fine, but got hung up on calculating the consumer surplus).

As I alluded to earlier, free-response questions would also remove most of the danger of students seeing your tests beforehand. If they saw it beforehand, learned how to solve it, memorized the steps, and then were able to carry them out on the test… well, that’s actually pretty close to what you were trying to teach them. It would be better for them to learn a whole class of related problems and then be able to solve any problem from that broader class—but the first step in learning to solve a whole class of problems is in fact learning to solve one problem from that class. Just change a few details each year so that the questions aren’t identical, and you will find that any student who tried to “cheat” by seeing last year’s exam would inadvertently be studying properly for this year’s exam. And then perhaps we could stop making students literally sign nondisclosure agreements when they take college entrance exams. Listen to this Orwellian line from the SAT nondisclosure agreement:

Misconduct includes,but is not limited to:

Taking any test questions or essay topics from the testing room, including through memorization, giving them to anyone else, or discussing them with anyone else through anymeans, including, but not limited to, email, text messages or the Internet

Including through memorization. You are not allowed to memorize SAT questions, because God forbid you actually learn something when we are here to make money off evaluating you.

Multiple-choice tests fail in another way as well; by definition they cannot possibly test generation or recall of knowledge, they can only test recognition. You don’t need to come up with an answer; you know for a fact that the correct answer must be in front of you, and all you need to do is recognize it. Recall and recognition are fundamentally different memory processes, and recall is both more difficult and more important.

Indeed, the real mystery here is why we use multiple-choice exams at all.
There are a few types of very basic questions where multiple-choice is forgivable, because there are just aren’t that many possible valid answers. If I ask whether demand for apples has increased, you can pretty much say “it increased”, “it decreased”, “it stayed the same”, or “it’s impossible to determine”. So a multiple-choice format isn’t losing too much in such a case. But most really interesting and meaningful questions aren’t going to work in this format.

I don’t think it’s even particularly controversial among educators that multiple-choice questions are awful. (Though I do recall an “educational training” seminar a few weeks back that was basically an apologia for multiple choice, claiming that it is totally possible to test “higher-order cognitive skills” using multiple-choice, for reals, believe me.) So why do we still keep using them?

Well, the obvious reason is grading time. The one thing multiple-choice does have over a true free response is that it can be graded efficiently and reliably by machines, which really does make a big difference when you have 300 students in a class. But there are a couple reasons why even this isn’t a sufficient argument.

First of all, why do we have classes that big? It’s absurd. At that point you should just email the students video lectures. You’ve already foreclosed any possibility of genuine student-teacher interaction, so why are you bothering with having an actual teacher? It seems to be that universities have tried to work out what is the absolute maximum rent they can extract by structuring a class so that it is just good enough that students won’t revolt against the tuition, but they can still spend as little as possible by hiring only one adjunct or lecturer when they should have been paying 10 professors.

And don’t tell me they can’t afford to spend more on faculty—first of all, supporting faculty is why you exist. If you can’t afford to spend enough providing the primary service that you exist as an institution to provide, then you don’t deserve to exist as an institution. Moreover, they clearly can afford it—they simply prefer to spend on hiring more and more administrators and raising the pay of athletic coaches. PhD comics visualized it quite well; the average pay for administrators is three times that of even tenured faculty, and athletic coaches make ten times as much as faculty. (And here I think the mean is the relevant figure, as the mean income is what can be redistributed. Firing one administrator making $300,000 does actually free up enough to hire three faculty making $100,000 or ten grad students making $30,000.)

But even supposing that the institutional incentives here are just too strong, and we will continue to have ludicrously-huge lecture classes into the foreseeable future, there are still alternatives to multiple-choice testing.

Ironically, the College Board appears to have stumbled upon one themselves! About half the SAT math exam is organized into a format where instead of bubbling in one circle to give your 2 bits of answer, you bubble in numbers and symbols corresponding to a more complicated mathematical answer, such as entering “3/4” as “0”, “3”, “/”, “4” or “1.28” as “1”, “.”, “2”, “8”. This could easily be generalized to things like “e^2” as “e”, “^”, “2” and “sin(3pi/2)” as “sin”, “3” “pi”, “/”, “2”. There are 12 possible symbols currently allowed by the SAT, and each response is up to 4 characters, so we have already increased our possible responses from 4 to over 20,000—which is to say from 2 bits to 14. If we generalize it to include symbols like “pi” and “e” and “sin”, and allow a few more characters per response, we could easily get it over 20 bits—10 times as much information as a multiple-choice question.

But we can do better still! Even if we insist upon automation, high-end text-recognition software (of the sort any university could surely afford) is now getting to the point where it could realistically recognize a properly-formatted algebraic formula, so you’d at least know if the student remembered the formula correctly. Sentences could be transcribed into typed text, checked for grammar, and sorted for keywords—which is not nearly as good as a proper reading by an expert professor, but is still orders of magnitude better than filling circle “C”. Eventually AI will make even more detailed grading possible, though at that point we may have AIs just taking over the whole process of teaching. (Leaving professors entirely for research, presumably. Not sure if this would be good or bad.)

Automation isn’t the only answer either. You could hire more graders and teaching assistants—say one for every 30 or 40 students instead of one for every 100 students. (And then the TAs might actually be able to get to know their students! What a concept!) You could give fewer tests, or shorter ones—because a small, reliable sample is actually better than a large, unreliable one. A bonus there would be reducing students’ feelings of test anxiety. You could give project-based assignments, which would still take a long time to grade, but would also be a lot more interesting and fulfilling for both the students and the graders.

Or, and perhaps this is the most radical answer of all: You could stop worrying so much about evaluating student performance.

I get it, you want to know whether students are doing well, both so that you can improve your teaching and so that you can rank the students and decide who deserves various awards and merits. But do you really need to be constantly evaluating everything that students do? Did it ever occur to you that perhaps that is why so many students suffer from anxiety—because they are literally being formally evaluated with long-term consequences every single day they go to school?

If we eased up on all this evaluation, I think the fear is that students would just detach entirely; all teachers know students who only seem to show up in class because they’re being graded on attendance. But there are a couple of reasons to think that maybe this fear isn’t so well-founded after all.

If you give up on constant evaluation, you can open up opportunities to make your classes a lot more creative and interesting—and even fun. You can make students want to come to class, because they get to engage in creative exploration and collaboration instead of memorizing what you drone on at them for hours on end. Most of the reason we don’t do creative, exploratory activities is simply that we don’t know how to evaluate them reliably—so what if we just stopped worrying about that?

Moreover, are those students who only show up for the grade really getting anything out of it anyway? Maybe it would be better if they didn’t show up—indeed, if they just dropped out of college entirely and did something else with their lives until they get their heads on straight. Maybe all this effort that we are currently expending trying to force students to learn who clearly don’t appreciate the value of learning could instead be spent enriching the students who do appreciate learning and came here to do as much of it as possible. Because, ultimately, you can lead a student to algebra, but you can’t make them think. (Let me be clear, I do not mean students with less innate ability or prior preparation; I mean students who aren’t interested in learning and are only showing up because they feel compelled to. I admire students with less innate ability who nonetheless succeed because they work their butts off, and wish I were quite so motivated myself.)
There’s a downside to that, of course. Compulsory education does actually seem to have significant benefits in making people into better citizens. Maybe if we let those students just leave college, they’d never come back, and they would squander their potential. Maybe we need to force them to show up until something clicks in their brains and they finally realize why we’re doing it. In fact, we’re really not forcing them; they could drop out in most cases and simply don’t, probably because their parents are forcing them. Maybe the signaling problem is too fundamental, and the only way we can get unmotivated students to accept not getting prestigious degrees is by going through this whole process of forcing them to show up for years and evaluating everything they do until we can formally justify ultimately failing them. (Of course, almost by construction, a student who does the absolute bare minimum to pass will pass.) But college admission is competitive, and I can’t shake this feeling there are thousands of students out there who got rejected from the school they most wanted to go to, the school they were really passionate about and willing to commit their lives to, because some other student got in ahead of them—and that other student is now sitting in the back of the room playing with an iPhone, grumbling about having to show up for class every day. What about that squandered potential? Perhaps competitive admission and compulsory attendance just don’t mix, and we should stop compelling students once they get their high school diploma.