How do we stop overspending on healthcare?

2023-12-112023-12-03 PNRJ inequality, macroeconomics, market failure, politics, public policy, statistics #Scandinaviaisbetter, Algeria, Brazil, crisis, Denmark, doctor, entitlement, France, healthcare, hospital, Ireland, Medicaid, medical debt, Medicare, Norway, pharmaceutical, profit, Social Security, Spain, Sweden, Switzerland

Dec 10 JDN 2460290

I don’t think most Americans realize just how much more the US spends on healthcare than other countries. This is true not simply in absolute terms—of course it is, the US is rich and huge—but in relative terms: As a portion of GDP, our healthcare spending is a major outlier.

Here’s a really nice graph from Healthsystemtracker.org that illustrates it quite nicely: Almost all other First World countries share a simple linear relationship between their per-capita GDP and their per-capita healthcare spending. But one of these things is not like the other ones….

The outlier in the other direction is Ireland, but that’s because their GDP is wildly inflated by Leprechaun Economics. (Notice that it looks like Ireland is by far the richest country in the sample! This is clearly not the case in reality.) With a corrected estimate of their true economic output, they are also quite close to the line.

Since US GDP per capita ($70,181) is in between that of Denmark ($64,898) and Norway ($80,496) both of which have very good healthcare systems (#ScandinaviaIsBetter), we would expect that US spending on healthcare would similarly be in between. But while Denmark spends $6,384 per person per year on healthcare and Norway spends $7,065 per person per year, the US spends $12,914.

That is, the US spends nearly twice as much as it should on healthcare.

The absolute difference between what we should spend and what we actually spend is nearly $6,000 per person per year. Multiply that out by the 330 million people in the US, and…

The US overspends on healthcare by nearly $2 trillion per year.

This might be worth it, if health in the US were dramatically better than health in other countries. (In that case I’d be saying that other countries spend too little.) But plainly it is not.

Probably the simplest and most comparable measure of health across countries is life expectancy. US life expectancy is 76 years, and has increased over time. But if you look at the list of countries by life expectancy, the US is not even in the top 50. Our life expectancy looks more like middle-income countries such as Algeria, Brazil, and China than it does like Norway or Sweden, who should be our economic peers.

There are of course many things that factor into life expectancy aside from healthcare: poverty and homicide are both much worse in the US than in Scandinavia. But then again, poverty is much worse in Algeria, and homicide is much worse in Brazil, and yet they somehow manage to nearly match the US in life expectancy (actually exceeding it in some recent years).

The US somehow manages to spend more on healthcare than everyone else, while getting outcomes that are worse than any country of comparable wealth—and even some that are far poorer.

This is largely why there is a so-called “entitlements crisis” (as many a libertarian think tank is fond of calling it). Since libertarians want to cut Social Security most of all, they like to lump it in with Medicare and Medicaid as an “entitlement” in “crisis”; but in fact we only need a few minor adjustments to the tax code to make sure that Social Security remains solvent for decades to come. It’s healthcare spending that’s out of control.

Here, take a look.

This is the ratio of Social Security spending to GDP from 1966 to the present. Notice how it has been mostly flat since the 1980s, other than a slight increase in the Great Recession.

This is the ratio of Medicare spending to GDP over the same period. Even ignoring the first few years while it was ramping up, it rose from about 0.6% in the 1970s to almost 4% in 2020, and only started to decline in the last few years (and it’s probably too early to say whether that will continue).

Medicaid has a similar pattern: It rose steadily from 0.2% in 1966 to over 3% today—and actually doesn’t even show any signs of leveling off.

If you look at Medicare and Medicaid together, they surged from just over 1% of GDP in 1970 to nearly 7% today:

Put another way: in 1982, Social Security was 4.8% of GDP while Medicare and Medicaid combined were 2.4% of GDP. Today, Social Security is 4.9% of GDP while Medicare and Medicaid are 6.8% of GDP.

Social Security spending barely changed at all; healthcare spending more than doubled. If we reduced our Medicare and Medicaid spending as a portion of GDP back to what it was in 1982, we would save 4.4% of GDP—that is, 4.4% of over $25 trillion per year, so $1.1 trillion per year.

Of course, we can’t simply do that; if we cut benefits that much, millions of people would suddenly lose access to healthcare they need.

The problem is not that we are spending frivolously, wasting the money on treatments no one needs. On the contrary, both Medicare and Medicaid carefully vet what medical services they are willing to cover, and if anything probably deny services more often than they should.

No, the problem runs deeper than this.

Healthcare is too expensive in the United States.

We simply pay more for just about everything, and especially for specialist doctors and hospitals.

In most other countries, doctors are paid like any other white-collar profession. They are well off, comfortable, certainly, but few of them are truly rich. But in the US, we think of doctors as an upper-class profession, and expect them to be rich.

Median doctor salaries are $98,000 in France and $138,000 in the UK—but a whopping $316,000 in the US. Germany and Canada are somewhere in between, at $183,000 and $195,000 respectively.

Nurses, on the other hand, are paid only a little more in the US than in Western Europe. This means that the pay difference between doctors and nurses is much higher in the US than most other countries.

US prices on brand-name medication are frankly absurd. Our generic medications are typically cheaper than other countries, but our brand name pills often cost twice as much. I noticed this immediately on moving to the UK: I had always been getting generics before, because the brand name pills cost ten times as much, but when I moved here, suddenly I started getting all brand-name medications (at no cost to me), because the NHS was willing to buy the actual brand name products, and didn’t have to pay through the nose to do so.

But the really staggering differences are in hospitals.

Let’s compare the prices of a few inpatient procedures between the US and Switzerland. Switzerland, you should note, is a very rich country that spends a lot on healthcare and has nearly the world’s highest life expectancy. So it’s not like they are skimping on care. (Nor is it that prices in general are lower in Switzerland; on the contrary, they are generally higher.)

A coronary bypass in Switzerland costs about $33,000. In the US, it costs $76,000.

A spinal fusion in Switzerland costs about $21,000. In the US? $52,000.

Angioplasty in Switzerland: $9.000. In the US? $32,000.

Hip replacement: Switzerland? $16,000. The US? $28,000.

Knee replacement: Switzerland? $19,000. The US? $27,000.

Cholecystectomy: Switzerland? $8,000. The US? $16,000.

Appendectomy: Switzerland? $7,000. The US? $13,000.

Caesarian section: Switzerland? $8,000. The US? $11,000.

Hospital prices are even lower in Germany and Spain, whose life expectancies are not as high as Switzerland—but still higher than the US.

These prices are so much lower that in fact if you were considering getting surgery for a chronic condition in the US, don’t. Buy plane tickets to Europe and get the procedure done there. Spend an extra few thousand dollars on a nice European vacation and you’d still end up saving money. (Obviously if you need it urgently you have no choice but to use your nearest hospital.) I know that if I ever need a knee replacement (which, frankly, is likely, given my height), I’m gonna go to Spain and thereby save $22,000 relative to what it would cost in the US. That’s a difference of a car.

Combine this with the fact that the US is the only First World country without universal healthcare, and maybe you can see why we’re also the only country in the world where people are afraid to call an ambulance because they don’t think they can afford it. We are also the only country in the world with a medical debt crisis.

Where is all this extra money going?

Well, a lot of it goes to those doctors who are paid three times as much as in France. That, at least, seems defensible: If we want the best doctors in the world maybe we need to pay for them. (Then again, do we have the best doctors in the world? If so, why is our life expectancy so mediocre?)

But a significant portion is going to shareholders.

You probably already knew that there are pharmaceutical companies that rake in huge profits on those overpriced brand-name medications. The top five US pharma companies took in net earnings of nearly $82 billion last year. Pharmaceutical companies typically take in much higher profit margins than other companies: a typical corporation makes about 8% of its revenue in profit, while pharmaceutical companies average nearly 14%.

But you may not have realized that a surprisingly large proportion of hospitals are for-profit businesses—even though they make most of their revenue from Medicare and Medicaid.

I was surprised to find that the US is not unusual in that; in fact, for-profit hospitals exist in dozens of countries, and the fraction of US hospital capacity that is for-profit isn’t even particularly high by world standards.

What is especially large is the profits of US hospitals. 7 healthcare corporations in the US all posted net incomes over $1 billion in 2021.

Even nonprofit US hospitals are tremendously profitable—as oxymoronic as that may sound. In fact, mean operating profit is higher among nonprofit hospitals in the US than for-profit hospitals. So even the hospitals that aren’t supposed to be run for profit… pretty much still are. They get tax deductions as if they were charities—but they really don’t act like charities.

They are basically nonprofit in name only.

So fixing this will not be as simple as making all hospitals nonprofit. We must also restructure the institutions so that nonprofit hospitals are genuinely nonprofit, and no longer nonprofit in name only. It’s normal for a nonprofit to have a little bit of profit or loss—nobody can make everything always balance perfectly—but these hospitals have been raking in huge profits and keeping it all in cash instead of using it to reduce prices or improve services. In the study I linked above, those 2,219 “nonprofit” hospitals took in operating profits averaging $43 million each—for a total of $95 billion.

Between pharmaceutical companies and hospitals, that’s a total of over $170 billion per year just in profit. (That’s more than we spend on food stamps, even after surge due to COVID.) This is pure grift. It must be stopped.

But that still doesn’t explain why we’re spending $2 trillion more than we should! So after all, I must leave you with a question:

What is America doing wrong? Why is our healthcare so expensive?

Statisticacy

2023-06-112023-06-04 PNRJ core principles, education, financial advice, science and academia, statistics Bill Nye, blackjack, debt, education, error, exponential, growth, inflation, interest, knowledge, mean, median, Monty Hall, percent, percentage, percentage point, poker, probability, statistics

Jun 11 JDN 2460107

I wasn’t able to find a dictionary that includes the word “statisticacy”, but it doesn’t trigger my spell-check, and it does seem to have the same form as “numeracy”: numeric, numerical, numeracy, numerate; statistic, statistical, statisticacy, statisticate. It definitely still sounds very odd to my ears. Perhaps repetition will eventually make it familiar.

For the concept is clearly a very important one. Literacy and numeracy are no longer a serious problem in the First World; basically every adult at this point knows how to read and do addition. Even worldwide, 90% of men and 83% of women can read, at least at a basic level—which is an astonishing feat of our civilization by the way, well worthy of celebration.

But I have noticed a disturbing lack of, well, statisticacy. Even intelligent, educated people seem… pretty bad at understanding statistics.

I’m not talking about sophisticated econometrics here; of course most people don’t know that, and don’t need to. (Most economists don’t know that!) I mean quite basic statistical knowledge.

A few years ago I wrote a post called “Statistics you should have been taught in high school, but probably weren’t”; that’s the kind of stuff I’m talking about.

As part of being a good citizen in a modern society, every adult should understand the following:

1. The difference between a mean and a median, and why average income (mean) can increase even though most people are no richer (median).

2. The difference between increasing by X% and increasing by X percentage points: If inflation goes from 4% to 5%, that is an increase of 20% ((5/4-1)*100%), but only 1 percentage point (5%-4%).

3. The meaning of standard error, and how to interpret error bars on a graph—and why it’s a huge red flag if there aren’t any error bars on a graph.

4. Basic probabilistic reasoning: Given some scratch paper, a pen, and a calculator, everyone should be able to work out the odds of drawing a given blackjack hand, or rolling a particular number on a pair of dice. (If that’s too easy, make it a poker hand and four dice. But mostly that’s just more calculation effort, not fundamentally different.)

5. The meaning of exponential growth rates, and how they apply to economic growth and compound interest. (The difference between 3% interest and 6% interest over 30 years is more than double the total amount paid.)

I see people making errors about this sort of thing all the time.

Economic news that celebrates rising GDP but wonders why people aren’t happier (when real median income has been falling since 2019 and is only 7% higher than it was in 1999, an annual growth rate of 0.2%).

Reports on inflation, interest rates, or poll numbers that don’t clearly specify whether they are dealing with percentages or percentage points. (XKCD made fun of this.)

Speaking of poll numbers, any reporting on changes in polls that isn’t at least twice the margin of error of the polls in question. (There’s also a comic for this; this time it’s PhD Comics.)

People misunderstanding interest rates and gravely underestimating how much they’ll pay for their debt (then again, this is probably the result of strategic choices on the part of banks—so maybe the real failure is regulatory).

And, perhaps worst of all, the plague of science news articles about “New study says X”. Things causing and/or cancer, things correlated with personality types, tiny psychological nudges that supposedly have profound effects on behavior.

Some of these things will even turn out to be true; actually I think this one on fibromyalgia, this one on smoking, and this one on body image are probably accurate. But even if it’s a properly randomized experiment—and especially if it’s just a regression analysis—a single study ultimately tells us very little, and it’s irresponsible to report on them instead of telling people the extensive body of established scientific knowledge that most people still aren’t aware of.

Basically any time an article is published saying “New study says X”, a statisticate person should ignore it and treat it as random noise. This is especially true if the finding seems weird or shocking; such findings are far more likely to be random flukes than genuine discoveries. Yes, they could be true, but one study just doesn’t move the needle that much.

I don’t remember where it came from, but there is a saying about this: “What is in the textbooks is 90% true. What is in the published literature is 50% true. What is in the press releases is 90% false.” These figures are approximately correct.

If their goal is to advance public knowledge of science, science journalists would accomplish a lot more if they just opened to a random page in a mainstream science textbook and started reading it on air. Admittedly, I can see how that would be less interesting to watch; but then, their job should be to find a way to make it interesting, not to take individual studies out of context and hype them up far beyond what they deserve. (Bill Nye did this much better than most science journalists.)

I’m not sure how much to blame people for lacking this knowledge. On the one hand, they could easily look it up on Wikipedia, and apparently choose not to. On the other hand, they probably don’t even realize how important it is, and were never properly taught it in school even though they should have been. Many of these things may even be unknown unknowns; people simply don’t realize how poorly they understand. Maybe the most useful thing we could do right now is simply point out to people that these things are important, and if they don’t understand them, they should get on that Wikipedia binge as soon as possible.

And one last thing: Maybe this is asking too much, but I think that a truly statisticate person should be able to solve the Monty Hall Problem and not be confused by the result. (Hint: It’s very important that Monty Hall knows which door the car is behind, and would never open that one. If he’s guessing at random and simply happens to pick a goat, the correct answer is 1/2, not 2/3. Then again, it’s never a bad choice to switch.)

Scalability and inequality

2022-05-152022-05-08 PNRJ core principles, inequality, mathematical models, statistics biology, capital, chemistry, DNA, exponential growth, fat tail, growth, inequality, labor, normal distribution, Pareto distribution, PCR, physics, scalability, scale

May 15 JDN 2459715

Why are some molecules (e.g. DNA) billions of times larger than others (e.g. H2O), but all atoms are within a much narrower range of sizes (only a few hundred)?

Why are some animals (e.g. elephants) millions of times as heavy as other (e.g. mice), but their cells are basically the same size?

Why does capital income vary so much more (factors of thousands or millions) than wages (factors of tens or hundreds)?

These three questions turn out to have much the same answer: Scalability.

Atoms are not very scalable: Adding another proton to a nucleus causes interactions with all the other protons, which makes the whole atom unstable after a hundred protons or so. But molecules, particularly organic polymers such as DNA, are tremendously scalable: You can add another piece to one end without affecting anything else in the molecule, and keep on doing that more or less forever.

Cells are not very scalable: Even with the aid of active transport mechanisms and complex cellular machinery, a cell’s functionality is still very much limited by its surface area. But animals are tremendously scalable: The same exponential growth that got you from a zygote to a mouse only needs to continue a couple years longer and it’ll get you all the way to an elephant. (A baby elephant, anyway; an adult will require a dozen or so years—remarkably comparable to humans, in fact.)

Labor income is not very scalable: There are only so many hours in a day, and the more hours you work the less productive you’ll be in each additional hour. But capital income is perfectly scalable: We can add another digit to that brokerage account with nothing more than a few milliseconds of electronic pulses, and keep doing that basically forever (due to the way integer storage works, above 2^63 it would require special coding, but it can be done; and seeing as that’s over 9 quintillion, it’s not likely to be a problem any time soon—though I am vaguely tempted to write a short story about an interplanetary corporation that gets thrown into turmoil by an integer overflow error).

This isn’t just an effect of our accounting either. Capital is scalable in a way that labor is not. When your contribution to production is owning a factory, there’s really nothing to stop you from owning another factory, and then another, and another. But when your contribution is working at a factory, you can only work so hard for so many hours.

When a phenomenon is highly scalable, it can take on a wide range of outcomes—as we see in molecules, animals, and capital income. When it’s not, it will only take on a narrow range of outcomes—as we see in atoms, cells, and labor income.

Exponential growth is also part of the story here: Animals certainly grow exponentially, and so can capital when invested; even some polymers function that way (e.g. under polymerase chain reaction). But I think the scalability is actually more important: Growing rapidly isn’t so useful if you’re going to immediately be blocked by a scalability constraint. (This actually relates to the difference between r- and K- evolutionary strategies, and offers further insight into the differences between mice and elephants.) Conversely, even if you grow slowly, given enough time, you’ll reach whatever constraint you’re up against.

Indeed, we can even say something about the probability distribution we are likely to get from random processes that are scalable or non-scalable.

A non-scalable random process will generally converge toward the familiar normal distribution, a “bell curve”:

[Image from Wikipedia: By Inductiveload – self-made, Mathematica, Inkscape, Public Domain, https://commons.wikimedia.org/w/index.php?curid=3817954]

The normal distribution has most of its weight near the middle; most of the population ends up near there. This is clearly the case for labor income: Most people are middle class, while some are poor and a few are rich.

But a scalable random process will typically converge toward quite a different distribution, a Pareto distribution:

[Image from Wikipedia: By Danvildanvil – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=31096324]

A Pareto distribution has most of its weight near zero, but covers an extremely wide range. Indeed it is what we call fat tailed, meaning that really extreme events occur often enough to have a meaningful effect on the average. A Pareto distribution has most of the people at the bottom, but the ones at the top are really on top.

And indeed, that’s exactly how capital income works: Most people have little or no capital income (indeed only about half of Americans and only a third(!) of Brits own any stocks at all), while a handful of hectobillionaires make utterly ludicrous amounts of money literally in their sleep.

Indeed, it turns out that income in general is pretty close to distributed normally (or maybe lognormally) for most of the income range, and then becomes very much Pareto at the top—where nearly all the income is capital income.

This fundamental difference in scalability between capital and labor underlies much of what makes income inequality so difficult to fight. Capital is scalable, and begets more capital. Labor is non-scalable, and we only have to much to give.

It would require a radically different system of capital ownership to really eliminate this gap—and, well, that’s been tried, and so far, it hasn’t worked out so well. Our best option is probably to let people continue to own whatever amounts of capital, and then tax the proceeds in order to redistribute the resulting income. That certainly has its own downsides, but they seem to be a lot more manageable than either unfettered anarcho-capitalism or totalitarian communism.

On the accuracy of testing

2021-01-312021-01-31 PNRJ current events, heuristics and biases, statistics base rate neglect, COVID-19, eigenvalue, eigenvector, false negative, false positive, informedness, Markov matrix, sensitivity, signal theory, specificity, test, true negative

Jan 31 JDN 2459246

One of the most important tools we have for controlling the spread of a pandemic is testing to see who is infected. But no test is perfectly reliable. Currently we have tests that are about 80% accurate. But what does it mean to say that a test is “80% accurate”? Many people get this wrong.

First of all, it certainly does not mean that if you have a positive result, you have an 80% chance of having the virus. Yet this is probably what most people think when they hear “80% accurate”.

So I thought it was worthwhile to demystify this a little bit, an explain just what we are talking about when we discuss the accuracy of a test—which turns out to have deep implications not only for pandemics, but for knowledge in general.

There are really two key measures of a test’s accuracy, called sensitivity and specificity, The sensitivity is the probability that, if the true answer is positive (you have the virus), the test result will be positive. This is the sense in which our tests are 80% accurate. The specificity is the probability that, if the true answer is negative (you don’t have the virus), the test result is negative. The terms make sense: A test is sensitive if it always picks up what’s there, and specific if it doesn’t pick up what isn’t there.

These two measures need not be the same, and typically are quite different. In fact, there is often a tradeoff between them: Increasing the sensitivity will often decrease the specificity.

This is easiest to see with an extreme example: I can create a COVID test that has “100% accuracy” in the sense of sensitivity. How do I accomplish this miracle? I simply assume that everyone in the world has COVID. Then it is absolutely guaranteed that I will have zero false negatives.

I will of course have many false positives—indeed the vast majority of my “positive results” will be me assuming that COVID is present without any evidence. But I can guarantee a 100% true positive rate, so long as I am prepared to accept a 0% true negative rate.

It’s possible to combine tests in ways that make them more than the sum of their parts. You can first run a test with a high specificity, and then re-test with a test that has a high sensitivity. The result will have both rates higher than either test alone.

For example, suppose test A has a sensitivity of 70% and a specificity of 90%, while test B has the reverse.

Then, if the true answer is positive, test A will return true 70% of the time, while test B will return true 90% of the time. So there is a 70% + (30%)(90%) = 97% chance of getting a positive result on the combined test.

If the true answer is negative, test A will return false 90% of the time, while test B will return false 70% of the time. So there is a 90% + (10%)(70%) = 97% chance of getting a negative result on the combined test.

Actually if we are going to specify the accuracy of a test in a single number, I think it would be better to use a much more obscure term, the informedness. Informedness is sensitivity plus specificity, minus one. It ranges between -1 and 1, where 1 is a perfect test, and 0 is a test that tells you absolutely nothing. -1 isn’t the worst possible test; it’s a test that’s simply calibrated backwards! Re-label it, and you’ve got a perfect test. So really maybe we should talk about the absolute value of the informedness.

It’s much harder to play tricks with informedness: My “miracle test” that just assumes everyone has the virus actually has an informedness of zero. This makes sense: The “test” actually provides no information you didn’t already have.

Surprisingly, I was not able to quickly find any references to this really neat mathematical result for informedness, but I find it unlikely that I am the only one who came up with it: The informedness of a test is the non-unit eigenvalue of a Markov matrix representing the test. (If you don’t know what all that means, don’t worry about it; it’s not important for this post. I just found it a rather satisfying mathematical result that I couldn’t find anyone else talking about.)

But there’s another problem as well: Even if we know everything about the accuracy of a test, we still can’t infer the probability of actually having the virus from the test result. For that, we need to know the baseline prevalence. Failing to account for that is the very common base rate fallacy.

Here’s a quick example to help you see what the problem is. Suppose that 1% of the population has the virus. And suppose that the tests have 90% sensitivity and 95% specificity. If I get a positive result, what is the probability I have the virus?

If you guessed something like 90%, you have committed the base rate fallacy. It’s actually much smaller than that. In fact, the true probability you have the virus is only 15%.

In a population of 10000 people, 100 (1%) will have the virus while 9900 (99%) will not. Of the 100 who have the virus, 90 (90%) will test positive and 10 (10%) will test negative. Of the 9900 who do not have the virus, 495 (5%) will test positive and 9405 (95%) will test negative.

This means that out of 585 positive test results, only 90 will actually be true positives!

If we wanted to improve the test so that we could say that someone who tests positive is probably actually positive, would it be better to increase sensitivity or specificity? Well, let’s see.

If we increased the sensitivity to 95% and left the specificity at 95%, we’d get 95 true positives and 495 false positives. This raises the probability to only 16%.

But if we increased the specificity to 97% and left the sensitivity at 90%, we’d get 90 true positives and 297 false positives. This raises the probability all the way to 23%.

But suppose instead we care about the probability that you don’t have the virus, given that you test negative. Our original test had 9900 true negatives and 10 false negatives, so it was quite good in this regard; if you test negative, you only have a 0.1% chance of having the virus.

Which approach is better really depends on what we care about. When dealing with a pandemic, false negatives are much worse than false positives, so we care most about sensitivity. (Though my example should show why specificity also matters.) But there are other contexts in which false positives are more harmful—such as convicting a defendant in a court of law—and then we want to choose a test which has a high true negative rate, even if it means accepting a low true positive rate.

In science in general, we seem to care a lot about false positives; a p-value is simply one minus the specificity of the statistical test, and as we all know, low p-values are highly sought after. But the sensitivity of statistical tests is often quite unclear. This means that we can be reasonably confident of our positive results (provided the baseline probability wasn’t too low, the statistics weren’t p-hacked, etc.); but we really don’t know how confident to be in our negative results. Personally I think negative results are undervalued, and part of how we got a replication crisis and p-hacking was by undervaluing those negative results. I think it would be better in general for us to report 95% confidence intervals (or better yet, 95% Bayesian prediction intervals) for all of our effects, rather than worrying about whether they meet some arbitrary threshold probability of not being exactly zero. Nobody really cares whether the effect is exactly zero (and it almost never is!); we care how big the effect is. I think the long-run trend has been toward this kind of analysis, but it’s still far from the norm in the social sciences. We’ve become utterly obsessed with specificity, and basically forgot that sensitivity exists.

Above all, be careful when you encounter a statement like “the test is 80% accurate”; what does that mean? 80% sensitivity? 80% specificity? 80% informedness? 80% probability that an observed positive is true? These are all different things, and the difference can matter a great deal.

Signaling and the Curse of Knowledge

2021-01-032020-12-27 PNRJ book reviews, cognitive science, core principles, ethics, heuristics and biases, mathematical models, personal, science and academia, statistics Curse of Knowledge, Impostor Syndrome, publication bias, signaling, Steven Pinker, Strunk & White

Jan 3 JDN 2459218

I received several books for Christmas this year, and the one I was most excited to read first was The Sense of Style by Steven Pinker. Pinker is exactly the right person to write such a book: He is both a brilliant linguist and cognitive scientist and also an eloquent and highly successful writer. There are two other books on writing that I rate at the same tier: On Writing by Stephen King, and The Art of Fiction by John Gardner. Don’t bother with style manuals from people who only write style manuals; if you want to learn how to write, learn from people who are actually successful at writing.

Indeed, I knew I’d love The Sense of Style as soon as I read its preface, containing some truly hilarious takedowns of Strunk & White. And honestly Strunk & White are among the best standard style manuals; they at least actually manage to offer some useful advice while also being stuffy, pedantic, and often outright inaccurate. Most style manuals only do the second part.

One of Pinker’s central focuses in The Sense of Style is on The Curse of Knowledge, an all-too-common bias in which knowing things makes us unable to appreciate the fact that other people don’t already know it. I think I succumbed to this failing most greatly in my first book, Special Relativity from the Ground Up, in which my concept of “the ground” was above most people’s ceilings. I was trying to write for high school physics students, and I think the book ended up mostly being read by college physics professors.

The problem is surely a real one: After years of gaining expertise in a subject, we are all liable to forget the difficulty of reaching our current summit and automatically deploy concepts and jargon that only a small group of experts actually understand. But I think Pinker underestimates the difficulty of escaping this problem, because it’s not just a cognitive bias that we all suffer from time to time. It’s also something that our society strongly incentivizes.

Pinker points out that a small but nontrivial proportion of published academic papers are genuinely well written, using this to argue that obscurantist jargon-laden writing isn’t necessary for publication; but he didn’t seem to even consider the fact that nearly all of those well-written papers were published by authors who already had tenure or even distinction in the field. I challenge you to find a single paper written by a lowly grad student that could actually get published without being full of needlessly technical terminology and awkward passive constructions: “A murian model was utilized for the experiment, in an acoustically sealed environment” rather than “I tested using mice and rats in a quiet room”. This is not because grad students are more thoroughly entrenched in the jargon than tenured professors (quite the contrary), nor that grad students are worse writers in general (that one could really go either way), but because grad students have more to prove. We need to signal our membership in the tribe, whereas once you’ve got tenure—or especially once you’ve got an endowed chair or something—you have already proven yourself.

Pinker seems to briefly touch this insight (p. 69), without fully appreciating its significance: “Even when we have an inlkling that we are speaking in a specialized lingo, we may be reluctant to slip back into plain speech. It could betray to our peers the awful truth that we are still greenhorns, tenderfoots, newbies. And if our readers do know the lingo, we might be insulting their intelligence while spelling it out. We would rather run the risk of confusing them while at least appearing to be soophisticated than take a chance at belaboring the obvious while striking them as naive or condescending.”

What we are dealing with here is a signaling problem. The fact that one can write better once one is well-established is the phenomenon of countersignaling, where one who has already established their status stops investing in signaling.

Here’s a simple model for you. Suppose each person has a level of knowledge x, which they are trying to demonstrate. They know their own level of knowledge, but nobody else does.

Suppose that when we observe someone’s knowledge, we get two pieces of information: We have an imperfect observation of their true knowledge which is x+e, the real value of x plus some amount of error e. Nobody knows exactly what the error is. To keep the model as simple as possible I’ll assume that e is drawn from a uniform distribution between -1 and 1.

Finally, assume that we are trying to select people above a certain threshold: Perhaps we are publishing in a journal, or hiring candidates for a job. Let’s call that threshold z. If x < z-1, then since e can never be larger than 1, we will immediately observe that they are below the threshold and reject them. If x > z+1, then since e can never be smaller than -1, we will immediately observe that they are above the threshold and accept them.

But when z-1 < x < z+1, we may think they are above the threshold when they actually are not (if e is positive), or think they are not above the threshold when they actually are (if e is negative).

So then let’s say that they can invest in signaling by putting some amount of visible work in y (like citing obscure papers or using complex jargon). This additional work may be costly and provide no real value in itself, but it can still be useful so long as one simple condition is met: It’s easier to do if your true knowledge x is high.

In fact, for this very simple model, let’s say that you are strictly limited by the constraint that y <= x. You can’t show off what you don’t know.

If your true value x > z, then you should choose y = x. Then, upon observing your signal, we know immediately that you must be above the threshold.

But if your true value x < z, then you should choose y = 0, because there’s no point in signaling that you were almost at the threshold. You’ll still get rejected.

Yet remember before that only those with z-1 < x < z+1 actually need to bother signaling at all. Those with x > z+1 can actually countersignal, by also choosing y = 0. Since you already have tenure, nobody doubts that you belong in the club.

This means we’ll end up with three groups: Those with x < z, who don’t signal and don’t get accepted; those with z < x < z+1, who signal and get accepted; and those with x > z+1, who don’t signal but get accepted. Then life will be hardest for those who are just above the threshold, who have to spend enormous effort signaling in order to get accepted—and that sure does sound like grad school.

You can make the model more sophisticated if you like: Perhaps the error isn’t uniformly distributed, but some other distribution with wider support (like a normal distribution, or a logistic distribution); perhaps the signalling isn’t perfect, but itself has some error; and so on. With such additions, you can get a result where the least-qualified still signal a little bit so they get some chance, and the most-qualified still signal a little bit to avoid a small risk of being rejected. But it’s a fairly general phenomenon that those closest to the threshold will be the ones who have to spend the most effort in signaling.

This reveals a disturbing overlap between the Curse of Knowledge and Impostor Syndrome: We write in impenetrable obfuscationist jargon because we are trying to conceal our own insecurity about our knowledge and our status in the profession. We’d rather you not know what we’re talking about than have you realize that we don’t know what we’re talking about.

For the truth is, we don’t know what we’re talking about. And neither do you, and neither does anyone else. This is the agonizing truth of research that nearly everyone doing research knows, but one must be either very brave, very foolish, or very well-established to admit out loud: It is in the nature of doing research on the frontier of human knowledge that there is always far more that we don’t understand about our subject than that we do understand.

I would like to be more open about that. I would like to write papers saying things like “I have no idea why it turned out this way; it doesn’t make sense to me; I can’t explain it.” But to say that the profession disincentivizes speaking this way would be a grave understatement. It’s more accurate to say that the profession punishes speaking this way to the full extent of its power. You’re supposed to have a theory, and it’s supposed to work. If it doesn’t actually work, well, maybe you can massage the numbers until it seems to, or maybe you can retroactively change the theory into something that does work. Or maybe you can just not publish that paper and write a different one.

Here is a graph of one million published z-scores in academic journals:

It looks like a bell curve, except that almost all the values between -2 and 2 are mysteriously missing.

If we were actually publishing all the good science that gets done, it would in fact be a very nice bell curve. All those missing values are papers that never got published, or results that were excluded from papers, or statistical analyses that were massaged, in order to get a p-value less than the magical threshold for publication of 0.05. (For the statistically uninitiated, a z-score less than -2 or greater than +2 generally corresponds to a p-value less than 0.05, so these are effectively the same constraint.)

I have literally never read a single paper published in an academic journal in the last 50 years that said in plain language, “I have no idea what’s going on here.” And yet I have read many papers—probably most of them, in fact—where that would have been an appropriate thing to say. It’s actually quite a rare paper, at least in the social sciences, that actually has a theory good enough to really precisely fit the data and not require any special pleading or retroactive changes. (Often the bar for a theory’s success is lowered to “the effect is usually in the right direction”.) Typically results from behavioral experiments are bizarre and baffling, because people are a little screwy. It’s just that nobody is willing to stake their career on being that honest about the depth of our ignorance.

This is a deep shame, for the greatest advances in human knowledge have almost always come from people recognizing the depth of their ignorance. Paradigms never shift until people recognize that the one they are using is defective.

This is why it’s so hard to beat the Curse of Knowledge: You need to signal that you know what you’re talking about, and the truth is you probably don’t, because nobody does. So you need to sound like you know what you’re talking about in order to get people to listen to you. You may be doing nothing more than educated guesses based on extremely limited data, but that’s actually the best anyone can do; those other people saying they have it all figured out are either doing the same thing, or they’re doing something even less reliable than that. So you’d better sound like you have it all figured out, and that’s a lot more convincing when you “utilize a murian model” than when you “use rats and mice”.

Perhaps we can at least push a little bit toward plainer language. It helps to be addressing a broader audience: it is both blessing and curse that whatever I put on this blog is what you will read, without any gatekeepers in my path. I can use plainer language here if I so choose, because no one can stop me. But of course there’s a signaling risk here as well: The Internet is a public place, and potential employers can read this as well, and perhaps decide they don’t like me speaking so plainly about the deep flaws in the academic system. Maybe I’d be better off keeping my mouth shut, at least for awhile. I’ve never been very good at keeping my mouth shut.

Once we get established in the system, perhaps we can switch to countersignaling, though even this doesn’t always happen. I think there are two reasons this can fail: First, you can almost always try to climb higher. Once you have tenure, aim for an endowed chair. Once you have that, try to win a Nobel. Second, once you’ve spent years of your life learning to write in a particular stilted, obscurantist, jargon-ridden way, it can be very difficult to change that habit. People have been rewarding you all your life for writing in ways that make your work unreadable; why would you want to take the risk of suddenly making it readable?

I don’t have a simple solution to this problem, because it is so deeply embedded. It’s not something that one person or even a small number of people can really fix. Ultimately we will need to, as a society, start actually rewarding people for speaking plainly about what they don’t know. Admitting that you have no clue will need to be seen as a sign of wisdom and honesty rather than a sign of foolishness and ignorance. And perhaps even that won’t be enough: Because the fact will still remain that knowing what you know that other people don’t know is a very difficult thing to do.

How do we get rid of gerrymandering?

2018-11-182018-11-11 PNRJ public policy, statistics election, FiveThirtyEight, gerrymandering, North Carolina, political science, politics, voting

Nov 18 JDN 2458441

I don’t mean in a technical sense; there is a large literature in political science on better voting mechanisms, and this is basically a solved problem. Proportional representation, algorithmic redistricting, or (my personal favorite) reweighted range voting would eradicate gerrymandering forever.

No, I mean strategically and politically—how do we actually make this happen?

Let’s set aside the Senate. (No, really. Set it aside. Get rid of it. “Take my wife… please.”) The Senate should not exist. It is fundamentally anathema to the most basic principle of democracy, “one person, one vote”; and even its most ardent supporters at the time admitted it had absolutely no principled justification for existing. Smaller states are wildly overrepresented (Wyoming, 580,000 people, gets the same number of Senators as California, 39 million), and non-states are not represented (DC has more people than Wyoming, and Puerto Rico has more people than Iowa). The “Senate popular vote” thus doesn’t really make sense as a concept. But this is not “gerrymandering”, as there is no redistricting process that can be used strategically to tilt voting results in favor of one party or another.

It is in the House of Representatives that gerrymandering is a problem.
North Carolina is a particularly extreme example. Republicans won 50.3% of the popular vote in this year’s House election; North Carolina has 13 seats; so, any reasonable person would think that the Republicans should get 7 of the 13 seats. Under algorithmic redistricting, they would have received 8 of 13 seats. Under proportional representation, they would have received, you guessed it, exactly 7. And under reweighted range voting? Well, that depends on how much people like each party. Assuming that Democrats and Republicans are about equally strong in their preferences, we would also expect the Republicans to win about 7. They in fact received 10 of 13 seats.

Indeed, as FiveThirtyEight found, this is almost the best the Republicans could possibly have done, if they had applied the optimal gerrymandering configuration. There are a couple of districts on the real map that occasionally swing which wouldn’t under the truly optimal gerrymandering; but none of these would flip Democrat more than 20% of the time.

Most states are not as gerrymandered as North Carolina. But there is a pattern you’ll notice among the highly-gerrymandered states.

Alabama is close to optimally gerrymandered for Republicans.

Arkansas is close to optimally gerrymandered for Republicans.

Idaho is close to optimally gerrymandered for Republicans.

Mississippi is close to optimally gerrymandered for Republicans.

As discussed, North Carolina is close to optimally gerrymandered for Republicans.
South Carolina is close to optimally gerrymandered for Republicans.

Texas is close to optimally gerrymandered for Republicans.

Wisconsin is close to optimally gerrymandered for Republicans.

Tennessee is close to optimally gerrymandered for Democrats.

Arizona is close to algorithmic redistricting.

California is close to algorithmic redistricting.

Connecticut is close to algorithmic redistricting.

Michigan is close to algorithmic redistricting.

Missouri is close to algorithmic redistricting.

Ohio is close to algorithmic redistricting.

Oregon is close to algorithmic redistricting.

Illinois is close to algorithmic redistricting, with some bias toward Democrats.

Kentucky is close to algorithmic redistricting, with some bias toward Democrats.

Louisiana is close to algorithmic redistricting, with some bias toward Democrats.

Maryland is close to algorithmic redistricting, with some bias toward Democrats.

Minnesota is close to algorithmic redistricting, with some bias toward Republicans.

New Jersey is close to algorithmic redistricting, with some bias toward Republicans.

Pennsylvania is close to algorithmic redistricting, with some bias toward Republicans.

Colorado is close to proportional representation.

Florida is close to proportional representation.

Iowa is close to proportional representation.

Maine is close to proportional representation.

Nebraska is close to proportional representation.

Nevada is close to proportional representation.

New Hampshire is close to proportional representation.

New Mexico is close to proportional representation.

Washington is close to proportional representation.

Georgia is somewhere between proportional representation and algorithmic redistricting.

Indiana is somewhere between proportional representation and algorithmic redistricting.

New York is somewhere between proportional representation and algorithmic redistricting.

Virginia is somewhere between proportional representation and algorithmic redistricting.

Hawaii is so overwhelmingly Democrat it’s impossible to gerrymander.

Rhode Island is so overwhelmingly Democrat it’s impossible to gerrymander.

Kansas is so overwhelmingly Republican it’s impossible to gerrymander.

Oklahoma is so overwhelmingly Republican it’s impossible to gerrymander.

Utah is so overwhelmingly Republican it’s impossible to gerrymander.

West Virginia is so overwhelmingly Republican it’s impossible to gerrymander.

You may have noticed the pattern. Most states are either close to algorithmic redistricting (14), close to proportional representation (9), or somewhere in between those (4). Of these, 4 are slightly biased toward Democrats and 3 are slightly biased toward Republicans.

6 states are so partisan that gerrymandering isn’t really possible there.

6 states are missing from the FiveThirtyEight analysis; I think they couldn’t get good data on them.

Of the remaining 9 states, 1 is strongly gerrymandered toward Democrats (gaining a whopping 1 seat, by the way), and 8 are strongly gerrymandered toward Republicans.

If we look at the nation as a whole, switching from the current system to proportional representation would increase the number of Democrat seats from 168 to 174 (+6), decrease the number of Republican seats from 195 to 179 (-16), and increase the number of competitive seats from 72 to 82 (+10).

Going to algorithmic redistricting instead would reduce the number of Democrat seats from 168 to 151 (-17), decrease the number of Republican seats from 195 to 180 (-15), and increase the number of competitive seats from 72 to a whopping 104 (+32).

Proportional representation minimizes wasted votes and best represents public opinion (with the possible exception of reweighted range voting, which we can’t really forecast because it uses more expressive information than what polls currently provide). It is thus to be preferred. Relative to the current system, proportional representation would decrease the representation of Republicans relative to Democrats by 24 seats—over 5% of the entire House.

Thus, let us not speak of gerrymandering as a “both sides” sort of problem. There is a very clear pattern here: Gerrymandering systematically favors Republicans.

Yet this does not answer the question I posed: How do we actually fix this?

The answer is going to sound a bit paradoxical: We must motivate voters to vote more so that voters will be better represented.

I have an acquaintance who has complained about this apparently paradoxical assertion: How can we vote to make our votes matter? (He advocates using violence instead.)

But the key thing to understand here is that it isn’t that our votes don’t matter at all—it is merely that they don’t matter enough.

If we were living in an authoritarian regime with sham elections (as some far-left people I’ve spoken to actually seem to believe), then indeed voting would be pointless. You couldn’t vote out Saddam Hussein or Benito Mussolini, even though they both did hold “elections” to make you think you had some voice. At that point, yes, obviously the only remaining choices are revolution or foreign invasion. (It does seem worth noting that both regimes fell by the latter, not the former.)

The US has not fallen that far just yet.

Votes in the US do not count evenly—but they do still count.

We have to work harder than our opponents for the same level of success, but we can still succeed.

Our legs may be shackled to weights, but they are not yet chained to posts in the ground.

Indeed, several states in this very election passed referenda to create independent redistricting commissions, and Democrats have gained at least 32 seats in the House—“at least” because some states are still counting mail-in ballots or undergoing recounts.

The one that has me on the edge of my seat is right here in Orange County, which several outlets (including the New York Times) have made preliminary projections in favor of Mimi Walters (R) but Nate Silver is forecasting higher probability for Katie Porter (D). It says “100% of precincts reporting”, but there are still as many ballots uncounted as there are counted, because California now has almost twice as many voters who vote by mail than in person.

Unfortunately, some of the states that are most highly gerrymandered don’t allow citizen-sponsored ballot initiatives (North Carolina, for instance). This is likely no coincidence. But this still doesn’t make us powerless. If your state is highly gerrymandered, make noise about it. Join or even organize protests. Write letters to legislators. Post on social media. Create memes.
Even most Republican voters don’t believe in gerrymandering. They want to win fair and square. Even if you can’t get them to vote for the candidates you want, reach out to them to get them to complain to their legislators about the injustice of the gerrymandering itself. Appeal to their patriotic values; election manipulation is clearly not what America stands for.

If your state is not highly gerrymandered, think bigger. We should be pushing for a Constitutional amendment implementing either proportional representation or algorithmic redistricting. The majority of states already have reasonably fair districts; if we can get 2/3 of the House and 2/3 of the Senate to agree on such an amendment, we don’t need to win North Carolina or Mississippi.

The sausage of statistics being made

2018-11-112018-11-04 PNRJ education, statistics asymptotic theory, Bayesian, Daryl Bem, DSGE, independently identically distributed, inference, instrumental variables, laws and sausages, logic, multiple comparisons, norms, p-value, publication bias, representative agent, robustness check, statistics, Terminator, The Culture

Nov 11 JDN 2458434

“Laws, like sausages, cease to inspire respect in proportion as we know how they are made.”

~ John Godfrey Saxe, not Otto von Bismark

Statistics are a bit like laws and sausages. There are a lot of things in statistical practice that don’t align with statistical theory. The most obvious examples are the fact that many results in statistics are asymptotic: they only strictly apply for infinitely large samples, and in any finite sample they will be some sort of approximation (we often don’t even know how good an approximation).

But the problem runs deeper than this: The whole idea of a p-value was originally supposed to be used to assess one single hypothesis that is the only one you test in your entire study.

That’s frankly a ludicrous expectation: Why would you write a whole paper just to test one parameter?

This is why I don’t actually think this so-called multiple comparisons problem is a problem with researchers doing too many hypothesis tests; I think it’s a problem with statisticians being fundamentally unreasonable about what statistics is useful for. We have to do multiple comparisons, so you should be telling us how to do it correctly.

Statisticians have this beautiful pure mathematics that generates all these lovely asymptotic results… and then they stop, as if they were done. But we aren’t dealing with infinite or even “sufficiently large” samples; we need to know what happens when your sample is 100, not when your sample is 10^29. We can’t assume that our variables are independently identically distributed; we don’t know their distribution, and we’re pretty sure they’re going to be somewhat dependent.

Even in an experimental context where we can randomly and independently assign some treatments, we can’t do that with lots of variables that are likely to matter, like age, gender, nationality, or field of study. And applied econometricians are in an even tighter bind; they often can’t randomize anything. They have to rely upon “instrumental variables” that they hope are “close enough to randomized” relative to whatever they want to study.

In practice what we tend to do is… fudge it. We use the formal statistical methods, and then we step back and apply a series of informal norms to see if the result actually makes sense to us. This is why almost no psychologists were actually convinced by Daryl Bem’s precognition experiments, despite his standard experimental methodology and perfect p < 0.05 results; he couldn’t pass any of the informal tests, particularly the most basic one of not violating any known fundamental laws of physics. We knew he had somehow cherry-picked the data, even before looking at it; nothing else was possible.

This is actually part of where the “hierarchy of sciences” notion is useful: One of the norms is that you’re not allowed to break the rules of the sciences above you, but you can break the rules of the sciences below you. So psychology has to obey physics, but physics doesn’t have to obey psychology. I think this is also part of why there’s so much enmity between economists and anthropologists; really we should be on the same level, cognizant of each other’s rules, but economists want to be above anthropologists so we can ignore culture, and anthropologists want to be above economists so they can ignore incentives.

Another informal norm is the “robustness check”, in which the researcher runs a dozen different regressions approaching the same basic question from different angles. “What if we control for this? What if we interact those two variables? What if we use a different instrument?” In terms of statistical theory, this doesn’t actually make a lot of sense; the probability distributions f(y|x) of y conditional on x and f(y|x, z) of y conditional on x and z are not the same thing, and wouldn’t in general be closely tied, depending on the distribution f(x|z) of x conditional on z. But in practice, most real-world phenomena are going to continue to show up even as you run a bunch of different regressions, and so we can be more confident that something is a real phenomenon insofar as that happens. If an effect drops out when you switch out a couple of control variables, it may have been a statistical artifact. But if it keeps appearing no matter what you do to try to make it go away, then it’s probably a real thing.

Because of the powerful career incentives toward publication and the strange obsession among journals with a p-value less than 0.05, another norm has emerged: Don’t actually trust p-values that are close to 0.05. The vast majority of the time, a p-value of 0.047 was the result of publication bias. Now if you see a p-value of 0.001, maybe then you can trust it—but you’re still relying on a lot of assumptions even then. I’ve seen some researchers argue that because of this, we should tighten our standards for publication to something like p < 0.01, but that’s missing the point; what we need to do is stop publishing based on p-values. If you tighten the threshold, you’re just going to get more rejected papers and then the few papers that do get published will now have even smaller p-values that are still utterly meaningless.

These informal norms protect us from the worst outcomes of bad research. But they are almost certainly not optimal. It’s all very vague and informal, and different researchers will often disagree vehemently over whether a given interpretation is valid. What we need are formal methods for solving these problems, so that we can have the objectivity and replicability that formal methods provide. Right now, our existing formal tools simply are not up to that task.

There are some things we may never be able to formalize: If we had a formal algorithm for coming up with good ideas, the AIs would already rule the world, and this would be either Terminator or The Culture depending on whether we designed the AIs correctly. But I think we should at least be able to formalize the basic question of “Is this statement likely to be true?” that is the fundamental motivation behind statistical hypothesis testing.

I think the answer is likely to be in a broad sense Bayesian, but Bayesians still have a lot of work left to do in order to give us really flexible, reliable statistical methods we can actually apply to the messy world of real data. In particular, tell us how to choose priors please! Prior selection is a fundamental make-or-break problem in Bayesian inference that has nonetheless been greatly neglected by most Bayesian statisticians. So, what do we do? We fall back on informal norms: Try maximum likelihood, which is like using a very flat prior. Try a normally-distributed prior. See if you can construct a prior from past data. If all those give the same thing, that’s a “robustness check” (see previous informal norm).

Informal norms are also inherently harder to teach and learn. I’ve seen a lot of other grad students flail wildly at statistics, not because they don’t know what a p-value means (though maybe that’s also sometimes true), but because they don’t really quite grok the informal underpinnings of good statistical inference. This can be very hard to explain to someone: They feel like they followed all the rules correctly, but you are saying their results are wrong, and now you can’t explain why.

In fact, some of the informal norms that are in wide use are clearly detrimental. In economics, norms have emerged that certain types of models are better simply because they are “more standard”, such as the dynamic stochastic general equilibrium models that can basically be fit to everything and have never actually usefully predicted anything. In fact, the best ones just predict what we already knew from Keynesian models. But without a formal norm for testing the validity of models, it’s been “DSGE or GTFO”. At present, it is considered “nonstandard” (read: “bad”) not to assume that your agents are either a single unitary “representative agent” or a continuum of infinitely-many agents—modeling the actual fact of finitely-many agents is just not done. Yet it’s hard for me to imagine any formal criterion that wouldn’t at least give you some points for correctly including the fact that there is more than one but less than infinity people in the world (obviously your model could still be bad in other ways).

I don’t know what these new statistical methods would look like. Maybe it’s as simple as formally justifying some of the norms we already use; maybe it’s as complicated as taking a fundamentally new approach to statistical inference. But we have to start somewhere.

Why are humans so bad with probability?

2018-04-292018-04-22 PNRJ cognitive science, critique of neoclassical economics, statistics availability heuristic, casino, categorical prospect theory, corner solution, evolution, experiment, information, lottery, probability, prospect theory, risk, social cognition, uncertainty

Apr 29 JDN 2458238

In previous posts on deviations from expected utility and cumulative prospect theory, I’ve detailed some of the myriad ways in which human beings deviate from optimal rational behavior when it comes to probability.

This post is going to be a bit different: Yes, we behave irrationally when it comes to probability. Why?

Why aren’t we optimal expected utility maximizers?
This question is not as simple as it sounds. Some of the ways that human beings deviate from neoclassical behavior are simply because neoclassical theory requires levels of knowledge and intelligence far beyond what human beings are capable of; basically anything requiring “perfect information” qualifies, as does any game theory prediction that involves solving extensive-form games with infinite strategy spaces by backward induction. (Don’t feel bad if you have no idea what that means; that’s kind of my point. Solving infinite extensive-form games by backward induction is an unsolved problem in game theory; just this past week I saw a new paper presented that offered a partial potential solution—and yet we expect people to do it optimally every time?)

I’m also not going to include questions of fundamental uncertainty, like “Will Apple stock rise or fall tomorrow?” or “Will the US go to war with North Korea in the next ten years?” where it isn’t even clear how we would assign a probability. (Though I will get back to them, for reasons that will become clear.)

No, let’s just look at the absolute simplest cases, where the probabilities are all well-defined and completely transparent: Lotteries and casino games. Why are we so bad at that?

Lotteries are not a computationally complex problem. You figure out how much the prize is worth to you, multiply it by the probability of winning—which is clearly spelled out for you—and compare that to how much the ticket price is worth to you. The most challenging part lies in specifying your marginal utility of wealth—the “how much it’s worth to you” part—but that’s something you basically had to do anyway, to make any kind of trade-offs on how to spend your time and money. Maybe you didn’t need to compute it quite so precisely over that particular range of parameters, but you need at least some idea how much $1 versus $10,000 is worth to you in order to get by in a market economy.

Casino games are a bit more complicated, but not much, and most of the work has been done for you; you can look on the Internet and find tables of probability calculations for poker, blackjack, roulette, craps and more. Memorizing all those probabilities might take some doing, but human memory is astonishingly capacious, and part of being an expert card player, especially in blackjack, seems to involve memorizing a lot of those probabilities.

Furthermore, by any plausible expected utility calculation, lotteries and casino games are a bad deal. Unless you’re an expert poker player or blackjack card-counter, your expected income from playing at a casino is always negative—and the casino set it up that way on purpose.

Why, then, can lotteries and casinos stay in business? Why are we so bad at such a simple problem?

Clearly we are using some sort of heuristic judgment in order to save computing power, and the people who make lotteries and casinos have designed formal models that can exploit those heuristics to pump money from us. (Shame on them, really; I don’t fully understand why this sort of thing is legal.)

In another previous post I proposed what I call “categorical prospect theory”, which I think is a decently accurate description of the heuristics people use when assessing probability (though I’ve not yet had the chance to test it experimentally).

But why use this particular heuristic? Indeed, why use a heuristic at all for such a simple problem?

I think it’s helpful to keep in mind that these simple problems are weird; they are absolutely not the sort of thing a tribe of hunter-gatherers is likely to encounter on the savannah. It doesn’t make sense for our brains to be optimized to solve poker or roulette.

The sort of problems that our ancestors encountered—indeed, the sort of problems that we encounter, most of the time—were not problems of calculable probability risk; they were problems of fundamental uncertainty. And they were frequently matters of life or death (which is why we’d expect them to be highly evolutionarily optimized): “Was that sound a lion, or just the wind?” “Is this mushroom safe to eat?” “Is that meat spoiled?”

In fact, many of the uncertainties most important to our ancestors are still important today: “Will these new strangers be friendly, or dangerous?” “Is that person attracted to me, or am I just projecting my own feelings?” “Can I trust you to keep your promise?” These sorts of social uncertainties are even deeper; it’s not clear that any finite being could ever totally resolve its uncertainty surrounding the behavior of other beings with the same level of intelligence, as the cognitive arms race continues indefinitely. The better I understand you, the better you understand me—and if you’re trying to deceive me, as I get better at detecting deception, you’ll get better at deceiving.

Personally, I think that it was precisely this sort of feedback loop that resulting in human beings getting such ridiculously huge brains in the first place. Chimpanzees are pretty good at dealing with the natural environment, maybe even better than we are; but even young children can outsmart them in social tasks any day. And once you start evolving for social cognition, it’s very hard to stop; basically you need to be constrained by something very fundamental, like, say, maximum caloric intake or the shape of the birth canal. Where chimpanzees look like their brains were what we call an “interior solution”, where evolution optimized toward a particular balance between cost and benefit, human brains look more like a “corner solution”, where the evolutionary pressure was entirely in one direction until we hit up against a hard constraint. That’s exactly what one would expect to happen if we were caught in a cognitive arms race.

What sort of heuristic makes sense for dealing with fundamental uncertainty—as opposed to precisely calculable probability? Well, you don’t want to compute a utility function and multiply by it, because that adds all sorts of extra computation and you have no idea what probability to assign. But you’ve got to do something like that in some sense, because that really is the optimal way to respond.

So here’s a heuristic you might try: Separate events into some broad categories based on how frequently they seem to occur, and what sort of response would be necessary.

Some things, like the sun rising each morning, seem to always happen. So you should act as if those things are going to happen pretty much always, because they do happen… pretty much always.

Other things, like rain, seem to happen frequently but not always. So you should look for signs that those things might happen, and prepare for them when the signs point in that direction.

Still other things, like being attacked by lions, happen very rarely, but are a really big deal when they do. You can’t go around expecting those to happen all the time, that would be crazy; but you need to be vigilant, and if you see any sign that they might be happening, even if you’re pretty sure they’re not, you may need to respond as if they were actually happening, just in case. The cost of a false positive is much lower than the cost of a false negative.

And still other things, like people sprouting wings and flying, never seem to happen. So you should act as if those things are never going to happen, and you don’t have to worry about them.

This heuristic is quite simple to apply once set up: It can simply slot in memories of when things did and didn’t happen in order to decide which category they go in—i.e. availability heuristic. If you can remember a lot of examples of “almost never”, maybe you should move it to “unlikely” instead. If you get a really big number of examples, you might even want to move it all the way to “likely”.

Another large advantage of this heuristic is that by combining utility and probability into one metric—we might call it “importance”, though Bayesian econometricians might complain about that—we can save on memory space and computing power. I don’t need to separately compute a utility and a probability; I just need to figure out how much effort I should put into dealing with this situation. A high probability of a small cost and a low probability of a large cost may be equally worth my time.

How might these heuristics go wrong? Well, if your environment changes sufficiently, the probabilities could shift and what seemed certain no longer is. For most of human history, “people walking on the Moon” would seem about as plausible as sprouting wings and flying away, and yet it has happened. Being attacked by lions is now exceedingly rare except in very specific places, but we still harbor a certain awe and fear before lions. And of course availability heuristic can be greatly distorted by mass media, which makes people feel like terrorist attacks and nuclear meltdowns are common and deaths by car accidents and influenza are rare—when exactly the opposite is true.

How many categories should you set, and what frequencies should they be associated with? This part I’m still struggling with, and it’s an important piece of the puzzle I will need before I can take this theory to experiment. There is probably a trade-off between more categories giving you more precision in tailoring your optimal behavior, but costing more cognitive resources to maintain. Is the optimal number 3? 4? 7? 10? I really don’t know. Even I could specify the number of categories, I’d still need to figure out precisely what categories to assign.

Demystifying dummy variables

2017-11-052017-10-29 PNRJ statistics binary variables, dummy variables, econometrics, regression, statistics

Nov 5, JDN 2458062

Continuing my series of blog posts on basic statistical concepts, today I’m going to talk about dummy variables. Dummy variables are quite simple, but for some reason a lot of people—even people with extensive statistical training—often have trouble understanding them. Perhaps people are simply overthinking matters, or making subtle errors that end up having large consequences.

A dummy variable (more formally a binary variable) is a variable that has only two states: “No”, usually represented 0, and “Yes”, usually represented 1. A dummy variable answers a single “Yes or no” question. They are most commonly used for categorical variables, answering questions like “Is the person’s race White?” and “Is the state California?”; but in fact almost any kind of data can be represented this way: We could represent income using a series of dummy variables like “Is your income greater than $50,000?” “Is your income greater than $51,000?” and so on. As long as the number of possible outcomes is finite—which, in practice, it always is—the data can be represented by some (possibly large) set of dummy variables. In fact, if your data set is large enough, representing numerical data with dummy variables can be a very good thing to do, as it allows you to account for nonlinear effects without assuming some specific functional form.
Most of the misunderstanding regarding dummy variables involves applying them in regressions and interpreting the results.
Probably the most common confusion is about what dummy variables to include. When you have a set of categories represented in your data (e.g. one for each US state), you want to include dummy variables for all but one of them. The most common mistake here is to try to include all of them, and end up with a regression that doesn’t make sense, or if you have a catchall category like “Other” (e.g. race is coded as “White/Black/Other”), leaving out that one and getting results with a nonsensical baseline.

You don’t have to leave one out if you only have one set of categories and you don’t include a constant in your regression; then the baseline will emerge automatically from the regression. But this is dangerous, as the interpretation of the coefficients is no longer quite so simple.

The thing to keep in mind is that a coefficient on a dummy variable is an effect of a change—so the coefficient on “White” is the effect of being White. In order to be an effect of a change, that change must be measured against some baseline. The dummy variable you exclude from the regression is the baseline—because the effect of changing to the baseline from the baseline is by definition zero.
Here’s a very simple example where all the regressions can be done by hand. Suppose you have a household with 1 human and 1 cat, and you want to know the effect of species on number of legs. (I mean, hopefully this is something you already know; but that makes it a good illustration.) In what follows, you can safely skip the matrix algebra; but I included it for any readers who want to see how these concepts play out mechanically in the math.
Your outcome variable Y is legs: The human has 2 and the cat has 4. We can write this as a matrix:

\[ Y = \begin{bmatrix} 2 \\ 4 \end{bmatrix} \]

reg_1

What dummy variables should we choose? There are actually several options.

The simplest option is to include both a human variable and a cat variable, and no constant. Let’s put the human variable first. Then our human subject has a value of X₁ = [1 0] (“Yes” to human and “No” to cat) and our cat subject has a value of X₂ = [0 1].

This is very nice in this case, as it makes our matrix of independent variables simply an identity matrix:

\[ X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \]

reg_2

This makes the calculations extremely nice, because transposing, multiplying, and inverting an identity matrix all just give us back an identity matrix. The standard OLS regression coefficient is B = (X’X)^-1X’Y, which in this case just becomes Y itself.

\[ B = (X’X)^{-1} X’Y = Y = \begin{bmatrix} 2 \\ 4 \end{bmatrix} \]

reg_3

Our coefficients are 2 and 4. How would we interpret this? Pretty much what you’d think: The effect of being human is having 2 legs, while the effect of being a cat is having 4 legs. This amounts to choosing a baseline of nothing—the effect is compared to a hypothetical entity with no legs at all. And indeed this is what will happen more generally if you do a regression with a dummy for each category and no constant: The baseline will be a hypothetical entity with an outcome of zero on whatever your outcome variable is.
So far, so good.

But what if we had additional variables to include? Say we have both cats and humans with black hair and brown hair (and no other colors). If we now include the variables human, cat, black hair, brown hair, we won’t get the results we expect—in fact, we’ll get no result at all. The regression is mathematically impossible, regardless of how large a sample we have.

This is why it’s much safer to choose one of the categories as a baseline, and include that as a constant. We could pick either one; we just need to be clear about which one we chose.

Say we take human as the baseline. Then our variables are constant and cat. The variable constant is just 1 for every single individual. The variable cat is 0 for humans and 1 for cats.

Now our independent variable matrix looks like this:

\[ X = \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} \]

reg_4
The matrix algebra isn’t quite so nice this time:

\[ X’X = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 1 \end{bmatrix} \]

\[ (X’X)^{-1} = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \]

\[ X’Y = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 6 \\ 4 \end{bmatrix} \]

\[ B = (X’X)^{-1} X’Y = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} 6 \\ 4 \end{bmatrix} = \begin{bmatrix} 2 \\ 2 \end{bmatrix} \]

reg_5

Our coefficients are now 2 and 2. Now, how do we interpret that result? We took human as the baseline, so what we are saying here is that the default is to have 2 legs, and then the effect of being a cat is to get 2 extra legs.
That sounds a bit anthropocentric—most animals are quadripeds, after all—so let’s try taking cat as the baseline instead. Now our variables are constant and human, and our independent variable matrix looks like this:

\[ X = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \]

\[ X’X = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 1 \end{bmatrix} \]

\[ (X’X)^{-1} = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \]

\[ X’Y = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 6 \\ 2 \end{bmatrix} \]

\[ B = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} 6 \\ 2 \end{bmatrix} = \begin{bmatrix} 4 \\ -2 \end{bmatrix} \]

reg_6

Our coefficients are 4 and -2. This seems much more phylogenetically correct: The default number of legs is 4, and the effect of being human is to lose 2 legs.
All these regressions are really saying the same thing: Humans have 2 legs, cats have 4. And in this particular case, it’s simple and obvious. But once things start getting more complicated, people tend to make mistakes even on these very simple questions.

A common mistake would be to try to include a constant and both dummy variables: constant human cat. What happens if we try that? The matrix algebra gets particularly nasty, first of all:

\[ X = \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix} \]

\[ X’X = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 1 & 1 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix} \]

reg_7

Our covariance matrix X’X is now 3×3, first of all. That means we have more coefficients than we have data points. But we could throw in another human and another cat to fix that problem.

More importantly, the covariance matrix is not invertible. Rows 2 and 3 add up together to equal row 1, so we have a singular matrix.

If you tried to run this regression, you’d get an error message about “perfect multicollinearity”. What this really means is you haven’t chosen a valid baseline. Your baseline isn’t human and it isn’t cat; and since you included a constant, it isn’t a baseline of nothing either. It’s… unspecified.

You actually can choose whatever baseline you want for this regression, by setting the constant term to whatever number you want. Set a constant of 0 and your baseline is nothing: you’ll get back the coefficients 0, 2 and 4. Set a constant of 2 and your baseline is human: you’ll get 2, 0 and 2. Set a constant of 4 and your baseline is cat: you’ll get 4, -2, 0. You can even choose something weird like 3 (you’ll get 3, -1, 1) or 7 (you’ll get 7, -5, -3) or -4 (you’ll get -4, 6, 8). You don’t even have to choose integers; you could pick -0.9 or 3.14159. As long as the constant plus the coefficient on human add to 2 and the constant plus the coefficient on cat add to 4, you’ll get a valid regression.
Again, this example seems pretty simple. But it’s an easy trap to fall into if you don’t think carefully about what variables you are including. If you are looking at effects on income and you have dummy variables on race, gender, schooling (e.g. no high school, high school diploma, some college, Bachelor’s, master’s, PhD), and what state a person lives in, it would be very tempting to just throw all those variables into a regression and see what comes out. But nothing is going to come out, because you haven’t specified a baseline. Your baseline isn’t even some hypothetical person with $0 income (which already doesn’t sound like a great choice); it’s just not a coherent baseline at all.

Generally the best thing to do (for the most precise estimates) is to choose the most common category in each set as the baseline. So for the US a good choice would be to set the baseline as White, female, high school diploma, California. Another common strategy when looking at discrimination specifically is to make the most privileged category the baseline, so we’d instead have White, male, PhD, and… Maryland, it turns out. Then we expect all our coefficients to be negative: Your income is generally lower if you are not White, not male, have less than a PhD, or live outside Maryland.

This is also important if you are interested in interactions: For example, the effect on your income of being Black in California is probably not the same as the effect of being Black in Mississippi. Then you’ll want to include terms like Black and Mississippi, which for dummy variables is the same thing as taking the Black variable and multiplying by the Mississippi variable.

But now you need to be especially clear about what your baseline is: If being White in California is your baseline, then the coefficient on Black is the effect of being Black in California, while the coefficient on Mississippi is the effect of being in Mississippi if you are White. The coefficient on Black and Mississippi is the effect of being Black in Mississippi, over and above the sum of the effects of being Black and the effect of being in Mississippi. If we saw a positive coefficient there, it wouldn’t mean that it’s good to be Black in Mississippi; it would simply mean that it’s not as bad as we might expect if we just summed the downsides of being Black with the downsides of being in Mississippi. And if we saw a negative coefficient there, it would mean that being Black in Mississippi is even worse than you would expect just from summing up the effects of being Black with the effects of being in Mississippi.

As long as you choose your baseline carefully and stick to it, interpreting regressions with dummy variables isn’t very hard. But so many people forget this step that they get very confused by the end, looking at a term like Black female Mississippi and seeing a positive coefficient, and thinking that must mean that life is good for Black women in Mississippi, when really all it means is the small mercy that being a Black woman in Mississippi isn’t quite as bad as you might think if you just added up the effect of being Black, plus the effect of being a woman, plus the effect of being Black and a woman, plus the effect of living in Mississippi, plus the effect of being Black in Mississippi, plus the effect of being a woman in Mississippi.

Statistics you should have been taught in high school, but probably weren’t

2017-10-152017-10-08 PNRJ core principles, statistics average, Curse of Knowledge, fat-tailed distribution, linear regression, mean, mean absolute deviation, median, normal distribution, regression, standard deviation, standard error

Oct 15, JDN 2458042

Today I’m trying something a little different. This post will assume a lot less background knowledge than most of the others. For some of my readers, this post will probably seem too basic, obvious, even boring. For others, it might feel like a breath of fresh air, relief at last from the overly-dense posts I am generally inclined to write out of Curse of Knowledge. Hopefully I can balance these two effects well enough to gain rather than lose readers.

Here are four core statistical concepts that I think all adults should know, necessary for functional literacy in understanding the never-ending stream of news stories about “A new study shows…” and more generally in applying social science to political decisions. In theory shese should all be taught as part of a core high school curriculum, but typically they either aren’t taught or aren’t retained once students graduate. (Really, I think we should replace one year of algebra with one semester of statistics and one semester of logic. Most people don’t actually need algebra, but they absolutely do need logic and statistics.)

Mean and median

The mean and the median are quite simple concepts, and you’ve probably at least heard of them before, yet confusion between them has caused a great many misunderstandings.

Part of the problem is the word “average”. Normally, the word “average” applies to the mean—for example, a batting average, or an average speed. But in common usage the word “average” can also mean “typical” or “representative”—an average person, an average family. And in many cases, particularly when in comes to economics, the mean is in no way typical or representative.

The mean of a sample of values is just the sum of all those values, divided by the number of values. The mean of the sample {1,2,3,10,1000} is (1+2+3+10+1000)/5 = 203.2

The median of a sample of values is the middle one—order the values, choose the one in the exact center. If you have an even number, take the mean of the two values on either side. So the median of the sample {1,2,3,10,1000} is 3.

I intentionally chose an extreme example: The mean and median of these samples are completely different. But this is something that can happen in real life.

This is vital for understanding the distribution of income, because for almost all countries (and certainly for the world as a whole), the mean income is substantially higher (usually between 50% and 100% higher) than the median income. Yet the mean income is what is reported as “per capita GDP”, but the median income is a much better measure of actual standard of living.

As for the word “average”, it’s probably best to just remove it from your vocabulary. Say “mean” instead if that’s what you intend, or “median” if that’s what you’re using instead.

Standard deviation and mean absolute deviation

Standard deviation is another one you’ve probably seen before.

Standard deviation is kind of a weird concept, honestly. It’s so entrenched in statistics that we’re probably stuck with it, but it’s really not a very good measure of anything intuitively interesting.

Mean absolute deviation is a much more intuitive concept, and much more robust to weird distributions (such as those of incomes and financial markets), but it isn’t as widely used by statisticians for some reason.

The standard deviation is defined as the square root of the mean of the squared differences between the individual values in sample and the mean of that sample. So for my {1,2,3,10,1000} example, the standard deviation is sqrt(((1-203.2)^2 + (2-203.2)^2 + (3-203.2)^2 + (10-203.2)^2 + (1000-203.2)^2)/5) = 398.4.

What can you infer from that figure? Not a lot, honestly. The standard deviation is bigger than the mean, so we have some sense that there’s a lot of variation in our sample. But interpreting exactly what that means is not easy.

The mean absolute deviation is much simpler: It’s the mean of the absolute value of differences between the individual values in a sample and the mean of that sample. In this case it is ((203.2-1) + (203.2-2) + (203.2-3) + (203.2-10) + (1000-203.2))/5 = 318.7.

This has a much simpler interpretation: The mean distance between each value and the mean is 318.7. On average (if we still use that word), each value is about 318.7 away from the mean of 203.2.

When you ask people to interpret a standard deviation, most of them actually reply as if you had asked them about the mean absolute deviation. They say things like “the average distance from the mean”. Only people who know statistics very well and are being very careful would actually say the true answer, “the square root of the sum of squared distances from the mean”.

But there is an even more fundamental reason to prefer the mean absolute deviation, and that is that sometimes the standard deviation doesn’t exist!

For very fat-tailed distributions, the sum that would give you the standard deviation simply fails to converge. You could say the standard deviation is infinite, or that it’s simply undefined. Either way we know it’s fat-tailed, but that’s about all. Any finite sample would have a well-defined standard deviation, but that will keep changing as your sample grows, and never converge toward anything in particular.

But usually the mean still exists, and if the mean exists, then the mean absolute deviation also exists. (In some rare cases even they fail, such as the Cauchy distribution—but actually even then there is usually a way to recover what the mean and mean absolute deviation “should have been” even though they don’t technically exist.)

Standard error

The standard error is even more important for statistical inference than the standard deviation, and frankly even harder to intuitively understand.

The actual definition of the standard error is this: The standard deviation of the distribution of sample means, provided that the null hypothesis is true and the distribution is a normal distribution.

How it is usually used is something more like this: “A good guess of the margin of error on my estimates, such that I’m probably not off by more than 2 standard errors in either direction.”

You may notice that those two things aren’t the same, and don’t even seem particularly closely related. You are correct in noticing this, and I hope that you never forget it. One thing that extensive training in statistics (especially frequentist statistics) seems to do to people is to make them forget that.

In particular, the standard error strictly only applies if the value you are trying to estimate is zero, which usually means that your results aren’t interesting. (To be fair, not always; finding zero effect of minimum wage on unemployment was a big deal.) Using it as a margin of error on your actual nonzero estimates is deeply dubious, even though almost everyone does it for lack of an uncontroversial alternative.
Application of standard errors typically also relies heavily on the assumption of a normal distribution, even though plenty of real-world distributions aren’t normal and don’t even approach a normal distribution in quite large samples. The Central Limit Theorem says that the sampling distribution of the mean of any non-fat-tailed distribution will approach a normal distribution eventually as sample size increases, but it doesn’t say how large a sample needs to be to do that, nor does it apply to fat-tailed distributions.

Therefore, the standard error is really a very conservative estimate of your margin of error; it assumes essentially that the only kind of error you had was random sampling error from a normal distribution in an otherwise perfect randomized controlled experiment. All sorts of other forms of error and bias could have occurred at various stages—and typically, did—making your error estimate inherently too small.

This is why you should never believe a claim that comes from only a single study or a handful of studies. There are simply too many things that could have gone wrong. Only when there are a large number of studies, with varying methodologies, all pointing to the same core conclusion, do we really have good empirical evidence of that conclusion. This is part of why the journalistic model of “A new study shows…” is so terrible; if you really want to know what’s true, you look at large meta-analyses of dozens or hundreds of studies, not a single study that could be completely wrong.

Linear regression and its limits

Finally, I come to linear regression, the workhorse of statistical social science. Almost everything in applied social science ultimately comes down to variations on linear regression.

There is the simplest kind, ordinary least-squares or OLS; but then there is two-stage least-squares 2SLS, fixed-effects regression, clustered regression, random-effects regression, heterogeneous treatment effects, and so on.
The basic idea of all regressions is extremely simple: We have an outcome Y, a variable we are interested in D, and some other variables X.

This might be an effect of education D on earnings Y, or minimum wage D on unemployment Y, or eating strawberries D on getting cancer Y. In our X variables we might include age, gender, race, or whatever seems relevant to Y but can’t be affected by D.

We then make the incredibly bold (and typically unjustifiable) assumption that all the effects are linear, and say that:

Y = A + B*D + C*X + E

A, B, and C are coefficients we estimate by fitting a straight line through the data. The last bit, E, is a random error that we allow to fill in any gaps. Then, if the standard error of B is less than half the size of B itself, we declare that our result is “statistically significant”, and we publish our paper “proving” that D has an effect on Y that is proportional to B.

No, really, that’s pretty much it. Most of the work in econometrics involves trying to find good choices of X that will make our estimates of B better. A few of the more sophisticated techniques involve breaking up this single regression into a few pieces that are regressed separately, in the hopes of removing unwanted correlations between our variable of interest D and our error term E.

What about nonlinear effects, you ask? Yeah, we don’t much talk about those.

Occasionally we might include a term for D^2:

Y = A + B1*D + B2*D^2 + C*X + E

Then, if the coefficient B2 is small enough, which is usually what happens, we say “we found no evidence of a nonlinear effect”.

Those who are a bit more sophisticated will instead report (correctly) that they have found the linear projection of the effect, rather than the effect itself; but if the effect was nonlinear enough, the linear projection might be almost meaningless. Also, if you’re too careful about the caveats on your research, nobody publishes your work, because there are plenty of other people competing with you who are willing to upsell their research as far more reliable than it actually is.

If this process seems rather underwhelming to you, that’s good. I think people being too easily impressed by linear regression is a much more widespread problem than people not having enough trust in linear regression.

Yes, it is possible to go too far the other way, and dismiss even dozens of brilliant experiments as totally useless because they used linear regression; but I don’t actually hear people doing that very often. (Maybe occasionally: The evidence that gun ownership increases suicide and homicide and that corporal punishment harms children is largely based on linear regression, but it’s also quite strong at this point, and I do still hear people denying it.)

Far more often I see people point to a single study using linear regression to prove that blueberries cure cancer or eating aspartame will kill you or yoga cures back pain or reading Harry Potter makes you hate Donald Trump or olive oil prevents Alzheimer’s or psychopaths are more likely to enjoy rap music. The more exciting and surprising a new study is, the more dubious you should be of its conclusions. If a very surprising result is unsupported by many other studies and just uses linear regression, you can probably safely ignore it.

A really good scientific study might use linear regression, but it would also be based on detailed, well-founded theory and apply a proper experimental (or at least quasi-experimental) design. It would check for confounding influences, look for nonlinear effects, and be honest that standard errors are a conservative estimate of the margin of error. Most scientific studies probably should end by saying “We don’t actually know whether this is true; we need other people to check it.” Yet sadly few do, because the publishers that have a strangle-hold on the industry prefer sexy, exciting, “significant” findings to actual careful, honest research. They’d rather you find something that isn’t there than not find anything, which goes against everything science stands for. Until that changes, all I can really tell you is to be skeptical when you read about linear regressions.

	Of men and bears \| H… on 9/11, 14 years on—and where ar…
	Of men and bears \| H… on The War on Terror has been a t…
	Everyone includes yo… on The paperclippers are already…
	What does “can… on Against “doing your…
	Bundling the stakes… on Against Self-Delusion

Human Economics

Cognitive development economics: Understanding the mind in order to feed the world

statistics