Demystifying dummy variables

Nov 5, JDN 2458062

Continuing my series of blog posts on basic statistical concepts, today I’m going to talk about dummy variables. Dummy variables are quite simple, but for some reason a lot of people—even people with extensive statistical training—often have trouble understanding them. Perhaps people are simply overthinking matters, or making subtle errors that end up having large consequences.

A dummy variable (more formally a binary variable) is a variable that has only two states: “No”, usually represented 0, and “Yes”, usually represented 1. A dummy variable answers a single “Yes or no” question. They are most commonly used for categorical variables, answering questions like “Is the person’s race White?” and “Is the state California?”; but in fact almost any kind of data can be represented this way: We could represent income using a series of dummy variables like “Is your income greater than $50,000?” “Is your income greater than $51,000?” and so on. As long as the number of possible outcomes is finite—which, in practice, it always is—the data can be represented by some (possibly large) set of dummy variables. In fact, if your data set is large enough, representing numerical data with dummy variables can be a very good thing to do, as it allows you to account for nonlinear effects without assuming some specific functional form.
Most of the misunderstanding regarding dummy variables involves applying them in regressions and interpreting the results.
Probably the most common confusion is about what dummy variables to include. When you have a set of categories represented in your data (e.g. one for each US state), you want to include dummy variables for all but one of them. The most common mistake here is to try to include all of them, and end up with a regression that doesn’t make sense, or if you have a catchall category like “Other” (e.g. race is coded as “White/Black/Other”), leaving out that one and getting results with a nonsensical baseline.

You don’t have to leave one out if you only have one set of categories and you don’t include a constant in your regression; then the baseline will emerge automatically from the regression. But this is dangerous, as the interpretation of the coefficients is no longer quite so simple.

The thing to keep in mind is that a coefficient on a dummy variable is an effect of a change—so the coefficient on “White” is the effect of being White. In order to be an effect of a change, that change must be measured against some baseline. The dummy variable you exclude from the regression is the baseline—because the effect of changing to the baseline from the baseline is by definition zero.
Here’s a very simple example where all the regressions can be done by hand. Suppose you have a household with 1 human and 1 cat, and you want to know the effect of species on number of legs. (I mean, hopefully this is something you already know; but that makes it a good illustration.) In what follows, you can safely skip the matrix algebra; but I included it for any readers who want to see how these concepts play out mechanically in the math.
Your outcome variable Y is legs: The human has 2 and the cat has 4. We can write this as a matrix:

\[ Y = \begin{bmatrix} 2 \\ 4 \end{bmatrix} \]

reg_1

What dummy variables should we choose? There are actually several options.

 

The simplest option is to include both a human variable and a cat variable, and no constant. Let’s put the human variable first. Then our human subject has a value of X1 = [1 0] (“Yes” to human and “No” to cat) and our cat subject has a value of X2 = [0 1].

This is very nice in this case, as it makes our matrix of independent variables simply an identity matrix:

\[ X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \]

reg_2

This makes the calculations extremely nice, because transposing, multiplying, and inverting an identity matrix all just give us back an identity matrix. The standard OLS regression coefficient is B = (X’X)-1 X’Y, which in this case just becomes Y itself.

\[ B = (X’X)^{-1} X’Y = Y = \begin{bmatrix} 2 \\ 4 \end{bmatrix} \]

reg_3

Our coefficients are 2 and 4. How would we interpret this? Pretty much what you’d think: The effect of being human is having 2 legs, while the effect of being a cat is having 4 legs. This amounts to choosing a baseline of nothing—the effect is compared to a hypothetical entity with no legs at all. And indeed this is what will happen more generally if you do a regression with a dummy for each category and no constant: The baseline will be a hypothetical entity with an outcome of zero on whatever your outcome variable is.
So far, so good.

But what if we had additional variables to include? Say we have both cats and humans with black hair and brown hair (and no other colors). If we now include the variables human, cat, black hair, brown hair, we won’t get the results we expect—in fact, we’ll get no result at all. The regression is mathematically impossible, regardless of how large a sample we have.

This is why it’s much safer to choose one of the categories as a baseline, and include that as a constant. We could pick either one; we just need to be clear about which one we chose.

Say we take human as the baseline. Then our variables are constant and cat. The variable constant is just 1 for every single individual. The variable cat is 0 for humans and 1 for cats.

Now our independent variable matrix looks like this:

\[ X = \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} \]

reg_4
The matrix algebra isn’t quite so nice this time:

\[ X’X = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 1 \end{bmatrix} \]

\[ (X’X)^{-1} = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \]

\[ X’Y = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 6 \\ 4 \end{bmatrix} \]

\[ B = (X’X)^{-1} X’Y = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} 6 \\ 4 \end{bmatrix} = \begin{bmatrix} 2 \\ 2 \end{bmatrix} \]

reg_5

Our coefficients are now 2 and 2. Now, how do we interpret that result? We took human as the baseline, so what we are saying here is that the default is to have 2 legs, and then the effect of being a cat is to get 2 extra legs.
That sounds a bit anthropocentric—most animals are quadripeds, after all—so let’s try taking cat as the baseline instead. Now our variables are constant and human, and our independent variable matrix looks like this:

\[ X = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \]

\[ X’X = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 1 \end{bmatrix} \]

\[ (X’X)^{-1} = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \]

\[ X’Y = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 6 \\ 2 \end{bmatrix} \]

\[ B = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} 6 \\ 2 \end{bmatrix} = \begin{bmatrix} 4 \\ -2 \end{bmatrix} \]

reg_6

Our coefficients are 4 and -2. This seems much more phylogenetically correct: The default number of legs is 4, and the effect of being human is to lose 2 legs.
All these regressions are really saying the same thing: Humans have 2 legs, cats have 4. And in this particular case, it’s simple and obvious. But once things start getting more complicated, people tend to make mistakes even on these very simple questions.

A common mistake would be to try to include a constant and both dummy variables: constant human cat. What happens if we try that? The matrix algebra gets particularly nasty, first of all:

\[ X = \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix} \]

\[ X’X = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 1 & 1 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix} \]

reg_7

Our covariance matrix X’X is now 3×3, first of all. That means we have more coefficients than we have data points. But we could throw in another human and another cat to fix that problem.

 

More importantly, the covariance matrix is not invertible. Rows 2 and 3 add up together to equal row 1, so we have a singular matrix.

If you tried to run this regression, you’d get an error message about “perfect multicollinearity”. What this really means is you haven’t chosen a valid baseline. Your baseline isn’t human and it isn’t cat; and since you included a constant, it isn’t a baseline of nothing either. It’s… unspecified.

You actually can choose whatever baseline you want for this regression, by setting the constant term to whatever number you want. Set a constant of 0 and your baseline is nothing: you’ll get back the coefficients 0, 2 and 4. Set a constant of 2 and your baseline is human: you’ll get 2, 0 and 2. Set a constant of 4 and your baseline is cat: you’ll get 4, -2, 0. You can even choose something weird like 3 (you’ll get 3, -1, 1) or 7 (you’ll get 7, -5, -3) or -4 (you’ll get -4, 6, 8). You don’t even have to choose integers; you could pick -0.9 or 3.14159. As long as the constant plus the coefficient on human add to 2 and the constant plus the coefficient on cat add to 4, you’ll get a valid regression.
Again, this example seems pretty simple. But it’s an easy trap to fall into if you don’t think carefully about what variables you are including. If you are looking at effects on income and you have dummy variables on race, gender, schooling (e.g. no high school, high school diploma, some college, Bachelor’s, master’s, PhD), and what state a person lives in, it would be very tempting to just throw all those variables into a regression and see what comes out. But nothing is going to come out, because you haven’t specified a baseline. Your baseline isn’t even some hypothetical person with $0 income (which already doesn’t sound like a great choice); it’s just not a coherent baseline at all.

Generally the best thing to do (for the most precise estimates) is to choose the most common category in each set as the baseline. So for the US a good choice would be to set the baseline as White, female, high school diploma, California. Another common strategy when looking at discrimination specifically is to make the most privileged category the baseline, so we’d instead have White, male, PhD, and… Maryland, it turns out. Then we expect all our coefficients to be negative: Your income is generally lower if you are not White, not male, have less than a PhD, or live outside Maryland.

This is also important if you are interested in interactions: For example, the effect on your income of being Black in California is probably not the same as the effect of being Black in Mississippi. Then you’ll want to include terms like Black and Mississippi, which for dummy variables is the same thing as taking the Black variable and multiplying by the Mississippi variable.

But now you need to be especially clear about what your baseline is: If being White in California is your baseline, then the coefficient on Black is the effect of being Black in California, while the coefficient on Mississippi is the effect of being in Mississippi if you are White. The coefficient on Black and Mississippi is the effect of being Black in Mississippi, over and above the sum of the effects of being Black and the effect of being in Mississippi. If we saw a positive coefficient there, it wouldn’t mean that it’s good to be Black in Mississippi; it would simply mean that it’s not as bad as we might expect if we just summed the downsides of being Black with the downsides of being in Mississippi. And if we saw a negative coefficient there, it would mean that being Black in Mississippi is even worse than you would expect just from summing up the effects of being Black with the effects of being in Mississippi.

As long as you choose your baseline carefully and stick to it, interpreting regressions with dummy variables isn’t very hard. But so many people forget this step that they get very confused by the end, looking at a term like Black female Mississippi and seeing a positive coefficient, and thinking that must mean that life is good for Black women in Mississippi, when really all it means is the small mercy that being a Black woman in Mississippi isn’t quite as bad as you might think if you just added up the effect of being Black, plus the effect of being a woman, plus the effect of being Black and a woman, plus the effect of living in Mississippi, plus the effect of being Black in Mississippi, plus the effect of being a woman in Mississippi.

 

Statistics you should have been taught in high school, but probably weren’t

Oct 15, JDN 2458042

Today I’m trying something a little different. This post will assume a lot less background knowledge than most of the others. For some of my readers, this post will probably seem too basic, obvious, even boring. For others, it might feel like a breath of fresh air, relief at last from the overly-dense posts I am generally inclined to write out of Curse of Knowledge. Hopefully I can balance these two effects well enough to gain rather than lose readers.

Here are four core statistical concepts that I think all adults should know, necessary for functional literacy in understanding the never-ending stream of news stories about “A new study shows…” and more generally in applying social science to political decisions. In theory shese should all be taught as part of a core high school curriculum, but typically they either aren’t taught or aren’t retained once students graduate. (Really, I think we should replace one year of algebra with one semester of statistics and one semester of logic. Most people don’t actually need algebra, but they absolutely do need logic and statistics.)

  1. Mean and median

The mean and the median are quite simple concepts, and you’ve probably at least heard of them before, yet confusion between them has caused a great many misunderstandings.

Part of the problem is the word “average”. Normally, the word “average” applies to the mean—for example, a batting average, or an average speed. But in common usage the word “average” can also mean “typical” or “representative”—an average person, an average family. And in many cases, particularly when in comes to economics, the mean is in no way typical or representative.

The mean of a sample of values is just the sum of all those values, divided by the number of values. The mean of the sample {1,2,3,10,1000} is (1+2+3+10+1000)/5 = 203.2

The median of a sample of values is the middle one—order the values, choose the one in the exact center. If you have an even number, take the mean of the two values on either side. So the median of the sample {1,2,3,10,1000} is 3.

I intentionally chose an extreme example: The mean and median of these samples are completely different. But this is something that can happen in real life.

This is vital for understanding the distribution of income, because for almost all countries (and certainly for the world as a whole), the mean income is substantially higher (usually between 50% and 100% higher) than the median income. Yet the mean income is what is reported as “per capita GDP”, but the median income is a much better measure of actual standard of living.

As for the word “average”, it’s probably best to just remove it from your vocabulary. Say “mean” instead if that’s what you intend, or “median” if that’s what you’re using instead.

  1. Standard deviation and mean absolute deviation

Standard deviation is another one you’ve probably seen before.

Standard deviation is kind of a weird concept, honestly. It’s so entrenched in statistics that we’re probably stuck with it, but it’s really not a very good measure of anything intuitively interesting.

Mean absolute deviation is a much more intuitive concept, and much more robust to weird distributions (such as those of incomes and financial markets), but it isn’t as widely used by statisticians for some reason.

The standard deviation is defined as the square root of the mean of the squared differences between the individual values in sample and the mean of that sample. So for my {1,2,3,10,1000} example, the standard deviation is sqrt(((1-203.2)^2 + (2-203.2)^2 + (3-203.2)^2 + (10-203.2)^2 + (1000-203.2)^2)/5) = 398.4.

What can you infer from that figure? Not a lot, honestly. The standard deviation is bigger than the mean, so we have some sense that there’s a lot of variation in our sample. But interpreting exactly what that means is not easy.

The mean absolute deviation is much simpler: It’s the mean of the absolute value of differences between the individual values in a sample and the mean of that sample. In this case it is ((203.2-1) + (203.2-2) + (203.2-3) + (203.2-10) + (1000-203.2))/5 = 318.7.

This has a much simpler interpretation: The mean distance between each value and the mean is 318.7. On average (if we still use that word), each value is about 318.7 away from the mean of 203.2.

When you ask people to interpret a standard deviation, most of them actually reply as if you had asked them about the mean absolute deviation. They say things like “the average distance from the mean”. Only people who know statistics very well and are being very careful would actually say the true answer, “the square root of the sum of squared distances from the mean”.

But there is an even more fundamental reason to prefer the mean absolute deviation, and that is that sometimes the standard deviation doesn’t exist!

For very fat-tailed distributions, the sum that would give you the standard deviation simply fails to converge. You could say the standard deviation is infinite, or that it’s simply undefined. Either way we know it’s fat-tailed, but that’s about all. Any finite sample would have a well-defined standard deviation, but that will keep changing as your sample grows, and never converge toward anything in particular.

But usually the mean still exists, and if the mean exists, then the mean absolute deviation also exists. (In some rare cases even they fail, such as the Cauchy distribution—but actually even then there is usually a way to recover what the mean and mean absolute deviation “should have been” even though they don’t technically exist.)

  1. Standard error

The standard error is even more important for statistical inference than the standard deviation, and frankly even harder to intuitively understand.

The actual definition of the standard error is this: The standard deviation of the distribution of sample means, provided that the null hypothesis is true and the distribution is a normal distribution.

How it is usually used is something more like this: “A good guess of the margin of error on my estimates, such that I’m probably not off by more than 2 standard errors in either direction.”

You may notice that those two things aren’t the same, and don’t even seem particularly closely related. You are correct in noticing this, and I hope that you never forget it. One thing that extensive training in statistics (especially frequentist statistics) seems to do to people is to make them forget that.

In particular, the standard error strictly only applies if the value you are trying to estimate is zero, which usually means that your results aren’t interesting. (To be fair, not always; finding zero effect of minimum wage on unemployment was a big deal.) Using it as a margin of error on your actual nonzero estimates is deeply dubious, even though almost everyone does it for lack of an uncontroversial alternative.
Application of standard errors typically also relies heavily on the assumption of a normal distribution, even though plenty of real-world distributions aren’t normal and don’t even approach a normal distribution in quite large samples. The Central Limit Theorem says that the sampling distribution of the mean of any non-fat-tailed distribution will approach a normal distribution eventually as sample size increases, but it doesn’t say how large a sample needs to be to do that, nor does it apply to fat-tailed distributions.

Therefore, the standard error is really a very conservative estimate of your margin of error; it assumes essentially that the only kind of error you had was random sampling error from a normal distribution in an otherwise perfect randomized controlled experiment. All sorts of other forms of error and bias could have occurred at various stages—and typically, did—making your error estimate inherently too small.

This is why you should never believe a claim that comes from only a single study or a handful of studies. There are simply too many things that could have gone wrong. Only when there are a large number of studies, with varying methodologies, all pointing to the same core conclusion, do we really have good empirical evidence of that conclusion. This is part of why the journalistic model of “A new study shows…” is so terrible; if you really want to know what’s true, you look at large meta-analyses of dozens or hundreds of studies, not a single study that could be completely wrong.

  1. Linear regression and its limits

Finally, I come to linear regression, the workhorse of statistical social science. Almost everything in applied social science ultimately comes down to variations on linear regression.

There is the simplest kind, ordinary least-squares or OLS; but then there is two-stage least-squares 2SLS, fixed-effects regression, clustered regression, random-effects regression, heterogeneous treatment effects, and so on.
The basic idea of all regressions is extremely simple: We have an outcome Y, a variable we are interested in D, and some other variables X.

This might be an effect of education D on earnings Y, or minimum wage D on unemployment Y, or eating strawberries D on getting cancer Y. In our X variables we might include age, gender, race, or whatever seems relevant to Y but can’t be affected by D.

We then make the incredibly bold (and typically unjustifiable) assumption that all the effects are linear, and say that:

Y = A + B*D + C*X + E

A, B, and C are coefficients we estimate by fitting a straight line through the data. The last bit, E, is a random error that we allow to fill in any gaps. Then, if the standard error of B is less than half the size of B itself, we declare that our result is “statistically significant”, and we publish our paper “proving” that D has an effect on Y that is proportional to B.

No, really, that’s pretty much it. Most of the work in econometrics involves trying to find good choices of X that will make our estimates of B better. A few of the more sophisticated techniques involve breaking up this single regression into a few pieces that are regressed separately, in the hopes of removing unwanted correlations between our variable of interest D and our error term E.

What about nonlinear effects, you ask? Yeah, we don’t much talk about those.

Occasionally we might include a term for D^2:

Y = A + B1*D + B2*D^2 + C*X + E

Then, if the coefficient B2 is small enough, which is usually what happens, we say “we found no evidence of a nonlinear effect”.

Those who are a bit more sophisticated will instead report (correctly) that they have found the linear projection of the effect, rather than the effect itself; but if the effect was nonlinear enough, the linear projection might be almost meaningless. Also, if you’re too careful about the caveats on your research, nobody publishes your work, because there are plenty of other people competing with you who are willing to upsell their research as far more reliable than it actually is.

If this process seems rather underwhelming to you, that’s good. I think people being too easily impressed by linear regression is a much more widespread problem than people not having enough trust in linear regression.

Yes, it is possible to go too far the other way, and dismiss even dozens of brilliant experiments as totally useless because they used linear regression; but I don’t actually hear people doing that very often. (Maybe occasionally: The evidence that gun ownership increases suicide and homicide and that corporal punishment harms children is largely based on linear regression, but it’s also quite strong at this point, and I do still hear people denying it.)

Far more often I see people point to a single study using linear regression to prove that blueberries cure cancer or eating aspartame will kill you or yoga cures back pain or reading Harry Potter makes you hate Donald Trump or olive oil prevents Alzheimer’s or psychopaths are more likely to enjoy rap music. The more exciting and surprising a new study is, the more dubious you should be of its conclusions. If a very surprising result is unsupported by many other studies and just uses linear regression, you can probably safely ignore it.

A really good scientific study might use linear regression, but it would also be based on detailed, well-founded theory and apply a proper experimental (or at least quasi-experimental) design. It would check for confounding influences, look for nonlinear effects, and be honest that standard errors are a conservative estimate of the margin of error. Most scientific studies probably should end by saying “We don’t actually know whether this is true; we need other people to check it.” Yet sadly few do, because the publishers that have a strangle-hold on the industry prefer sexy, exciting, “significant” findings to actual careful, honest research. They’d rather you find something that isn’t there than not find anything, which goes against everything science stands for. Until that changes, all I can really tell you is to be skeptical when you read about linear regressions.

The replication crisis, and the future of science

Aug 27, JDN 2457628 [Sat]

After settling in a little bit in Irvine, I’m now ready to resume blogging, but for now it will be on a reduced schedule. I’ll release a new post every Saturday, at least for the time being.

Today’s post was chosen by Patreon vote, though only one person voted (this whole Patreon voting thing has not been as successful as I’d hoped). It’s about something we scientists really don’t like to talk about, but definitely need to: We are in the middle of a major crisis of scientific replication.

Whenever large studies are conducted attempting to replicate published scientific results, their ability to do so is almost always dismal.

Psychology is the one everyone likes to pick on, because their record is particularly bad. Only 39% of studies were really replicated with the published effect size, though a further 36% were at least qualitatively but not quantitatively similar. Yet economics has its own replication problem, and even medical research is not immune to replication failure.

It’s important not to overstate the crisis; the majority of scientific studies do at least qualitatively replicate. We are doing better than flipping a coin, which is better than one can say of financial forecasters.
There are three kinds of replication, and only one of them should be expected to give near-100% results. That kind is reanalysiswhen you take the same data and use the same methods, you absolutely should get the exact same results. I favor making reanalysis a routine requirement of publication; if we can’t get your results by applying your statistical methods to your data, then your paper needs revision before we can entrust it to publication. A number of papers have failed on reanalysis, which is absurd and embarrassing; the worst offender was probably Rogart-Reinhoff, which was used in public policy decisions around the world despite having spreadsheet errors.

The second kind is direct replication—when you do the exact same experiment again and see if you get the same result within error bounds. This kind of replication should work something like 90% of the time, but in fact works more like 60% of the time.

The third kind is conceptual replication—when you do a similar experiment designed to test the same phenomenon from a different perspective. This kind of replication should work something like 60% of the time, but actually only works about 20% of the time.

Economists are well equipped to understand and solve this crisis, because it’s not actually about science. It’s about incentives. I facepalm every time I see another article by an aggrieved statistician about the “misunderstanding” of p-values; no, scientist aren’t misunderstanding anything. They know damn well how p-values are supposed to work. So why do they keep using them wrong? Because their jobs depend on doing so.

The first key point to understand here is “publish or perish”; academics in an increasingly competitive system are required to publish their research in order to get tenure, and frequently required to get tenure in order to keep their jobs at all. (Or they could become adjuncts, who are paid one-fifth as much.)

The second is the fundamentally defective way our research journals are run (as I have discussed in a previous post). As private for-profit corporations whose primary interest is in raising more revenue, our research journals aren’t trying to publish what will genuinely advance scientific knowledge. They are trying to publish what will draw attention to themselves. It’s a similar flaw to what has arisen in our news media; they aren’t trying to convey the truth, they are trying to get ratings to draw advertisers. This is how you get hours of meaningless fluff about a missing airliner and then a single chyron scroll about a war in Congo or a flood in Indonesia. Research journals haven’t fallen quite so far because they have reputations to uphold in order to attract scientists to read them and publish in them; but still, their fundamental goal is and has always been to raise attention in order to raise revenue.

The best way to do that is to publish things that are interesting. But if a scientific finding is interesting, that means it is surprising. It has to be unexpected or unusual in some way. And above all, it has to be positive; you have to have actually found an effect. Except in very rare circumstances, the null result is never considered interesting. This adds up to making journals publish what is improbable.

In particular, it creates a perfect storm for the abuse of p-values. A p-value, roughly speaking, is the probability you would get the observed result if there were no effect at all—for instance, the probability that you’d observe this wage gap between men and women in your sample if in the real world men and women were paid the exact same wages. The standard heuristic is a p-value of 0.05; indeed, it has become so enshrined that it is almost an explicit condition of publication now. Your result must be less than 5% likely to happen if there is no real difference. But if you will only publish results that show a p-value of 0.05, then the papers that get published and read will only be the ones that found such p-values—which renders the p-values meaningless.

It was never particularly meaningful anyway; as we Bayesians have been trying to explain since time immemorial, it matters how likely your hypothesis was in the first place. For something like wage gaps where we’re reasonably sure, but maybe could be wrong, the p-value is not too unreasonable. But if the theory is almost certainly true (“does gravity fall off as the inverse square of distance?”), even a high p-value like 0.35 is still supportive, while if the theory is almost certainly false (“are human beings capable of precognition?”—actual study), even a tiny p-value like 0.001 is still basically irrelevant. We really should be using much more sophisticated inference techniques, but those are harder to do, and don’t provide the nice simple threshold of “Is it below 0.05?”

But okay, p-values can be useful in many cases—if they are used correctly and you see all the results. If you have effect X with p-values 0.03, 0.07, 0.01, 0.06, and 0.09, effect X is probably a real thing. If you have effect Y with p-values 0.04, 0.02, 0.29, 0.35, and 0.74, effect Y is probably not a real thing. But I’ve just set it up so that these would be published exactly the same. They each have two published papers with “statistically significant” results. The other papers never get published and therefore never get seen, so we throw away vital information. This is called the file drawer problem.

Researchers often have a lot of flexibility in designing their experiments. If their only goal were to find truth, they would use this flexibility to test a variety of scenarios and publish all the results, so they can be compared holistically. But that isn’t their only goal; they also care about keeping their jobs so they can pay rent and feed their families. And under our current system, the only way to ensure that you can do that is by publishing things, which basically means only including the parts that showed up as statistically significant—otherwise, journals aren’t interested. And so we get huge numbers of papers published that tell us basically nothing, because we set up such strong incentives for researchers to give misleading results.

The saddest part is that this could be easily fixed.

First, reduce the incentives to publish by finding other ways to evaluate the skill of academics—like teaching for goodness’ sake. Working papers are another good approach. Journals already get far more submissions than they know what to do with, and most of these papers will never be read by more than a handful of people. We don’t need more published findings, we need better published findings—so stop incentivizing mere publication and start finding ways to incentivize research quality.

Second, eliminate private for-profit research journals. Science should be done by government agencies and nonprofits, not for-profit corporations. (And yes, I would apply this to pharmaceutical companies as well, which should really be pharmaceutical manufacturers who make cheap drugs based off of academic research and carry small profit margins.) Why? Again, it’s all about incentives. Corporations have no reason to want to find truth and every reason to want to tilt it in their favor.

Third, increase the number of tenured faculty positions. Instead of building so many new grand edifices to please your plutocratic donors, use your (skyrocketing) tuition money to hire more professors so that you can teach more students better. You can find even more funds if you cut the salaries of your administrators and football coaches. Come on, universities; you are the one industry in the world where labor demand and labor supply are the same people a few years later. You have no excuse for not having the smoothest market clearing in the world. You should never have gluts or shortages.

Fourth, require pre-registration of research studies (as some branches of medicine already do). If the study is sound, an optimal rational agent shouldn’t care in the slightest whether it had a positive or negative result, and if our ape brains won’t let us think that way, we need to establish institutions to force it to happen. They shouldn’t even see the effect size and p-value before they make the decision to publish it; all they should care about is that the experiment makes sense and the proper procedure was conducted.
If we did all that, the replication crisis could be almost completely resolved, as the incentives would be realigned to more closely match the genuine search for truth.

Alas, I don’t see universities or governments or research journals having the political will to actually make such changes, which is very sad indeed.

Actually, our economic growth has been fairly ecologically sustainable lately!

JDN 2457538

Environmentalists have a reputation for being pessimists, and it is not entirely undeserved. While as Paul Samuelson said, all Street indexes have predicted nine out of the last five recessions, environmentalists have predicted more like twenty out of the last zero ecological collapses.

Some fairly serious scientists have endorsed predictions of imminent collapse that haven’t panned out, and many continue to do so. This Guardian article should be hilarious to statisticians, as it literally takes trends that are going one direction, maps them onto a theory that arbitrarily decides they’ll suddenly reverse, and then says “the theory fits the data”. This should be taught in statistics courses as a lesson in how not to fit models. More data distortion occurs in this Scientific American article, which contains the phrase “food per capita is decreasing”; well, that’s true if you just look at the last couple of years, but according to FAOSTAT, food production per capita in 2012 (the most recent data in FAOSTAT) was higher than literally every other year on record except 2011. So if you allow for even the slightest amount of random fluctuation, it’s very clear that food per capita is increasing, not decreasing.

global_food.png

So many people predicting imminent collapse of human civilization. And yet, for some reason, all the people predicting this go about their lives as if it weren’t happening! Why, it’s almost as if they don’t really believe it, and just say it to get attention. Nobody gets on the news by saying “Civilization is doing fine; things are mostly getting better.”

There’s a long history of these sorts of gloom and doom predictions; perhaps the paradigm example is Thomas Malthus in 1779 predicting the imminent destruction of civilization by inevitable famine—just in time for global infant mortality rates to start plummeting and economic output to surge beyond anyone’s wildest dreams.

Still, when I sat down to study this it was remarkable to me just how good the outlook is for future sustainability. The Index of Sustainable Economic Welfare was created essentially in an attempt to show how our economic growth is largely an illusion driven by our rapacious natural resource consumption, but it has since been discontinued, perhaps because it didn’t show that. Using the US as an example, I reconstructed the index as best I could from World Bank data, and here’s what came out for the period since 1990:

ISEW

The top line is US GDP as normally measured. The bottom line is the ISEW. The gap between those lines expands on a linear scale, but not on a logarithmic scale; that is to say, GDP and ISEW grow at almost exactly the same rate, so ISEW is always a constant (and large) proportion of GDP. By construction it is necessarily smaller (it basically takes GDP and subtracts out from it), but the fact that it is growing at the same rate shows that our economic growth is not being driven by depletion of natural resources or the military-industrial complex; it’s being driven by real improvements in education and technology.

The Human Development Index has grown in almost every country (albeit at quite different rates) since 1990. Global poverty is the lowest it has ever been. We are living in a golden age of prosperity. This is such a golden age for our civilization, our happiness rating maxed out and now we’re getting +20% production and extra gold from every source. (Sorry, gamer in-joke.)

Now, it is said that pride cometh before a fall; so perhaps our current mind-boggling improvements in human welfare have only been purchased on borrowed time as we further drain our natural resources.

There is some cause for alarm: We’re literally running out of fish, and groundwater tables are falling rapidly. Due to poor land use deserts are expanding. Huge quantities of garbage now float in our oceans. And of course, climate change is poised to kill millions of people. Arctic ice will melt every summer starting in the next few years.

And yet, global carbon emissions have not been increasing the last few years, despite strong global economic growth. We need to be reducing emissions, not just keeping them flat (in a previous post I talked about some policies to do that); but even keeping them flat while still raising standard of living is something a lot of environmentalists kept telling us we couldn’t possibly do. Despite constant talk of “overpopulation” and a “population bomb”, population growth rates are declining and world population is projected to level off around 9 billion. Total solar power production in the US expanded by a factor of 40 in just the last 10 years.

Of course, I don’t deny that there are serious environmental problems, and we need to make policies to combat them; but we are doing that. Humanity is not mindlessly plunging headlong into an abyss; we are taking steps to improve our future.

And in fact I think environmentalists deserve a lot of credit for that! Raising awareness of environmental problems has made most Americans recognize that climate change is a serious problem. Further pressure might make them realize it should be one of our top priorities (presently most Americans do not).

And who knows, maybe the extremist doomsayers are necessary to set the Overton Window for the rest of us. I think we of the center-left (toward which reality has a well-known bias) often underestimate how much we rely upon the radical left to pull the discussion away from the radical right and make us seem more reasonable by comparison. It could well be that “climate change will kill tens of millions of people unless we act now to institute a carbon tax and build hundreds of nuclear power plants” is easier to swallow after hearing “climate change will destroy humanity unless we act now to transform global capitalism to agrarian anarcho-socialism.” Ultimately I wish people could be persuaded simply by the overwhelming scientific evidence in favor of the carbon tax/nuclear power argument, but alas, humans are simply not rational enough for that; and you must go to policy with the public you have. So maybe irrational levels of pessimism are a worthwhile corrective to the irrational levels of optimism coming from the other side, like the execrable sophistry of “in praise of fossil fuels” (yes, we know our economy was built on coal and oil—that’s the problem. We’re “rolling drunk on petroleum”; when we’re trying to quit drinking, reminding us how much we enjoy drinking is not helpful.).

But I worry that this sort of irrational pessimism carries its own risks. First there is the risk of simply giving up, succumbing to learned helplessness and deciding there’s nothing we can possibly do to save ourselves. Second is the risk that we will do something needlessly drastic (like the a radical socialist revolution) that impoverishes or even kills millions of people for no reason. The extreme fear that we are on the verge of ecological collapse could lead people to take a “by any means necessary” stance and end up with a cure worse than the disease. So far the word “ecoterrorism” has mainly been applied to what was really ecovandalism; but if we were in fact on the verge of total civilizational collapse, I can understand why someone would think quite literal terrorism was justified (actually the main reason I don’t is that I just don’t see how it could actually help). Just about anything is worth it to save humanity from destruction.

How I wish we measured percentage change

JDN 2457415

For today’s post I’m taking a break from issues of global policy to discuss a bit of a mathematical pet peeve. It is an opinion I share with many economists—for instance Miles Kimball has a very nice post about it, complete with some clever analogies to music.

I hate when we talk about percentages in asymmetric terms.

What do I mean by this? Well, here are a few examples.

If my stock portfolio loses 10% one year and then gains 11% the following year, have I gained or lost money? I’ve lost money. Only a little bit—I’m down 0.1%—but still, a loss.

In 2003, Venezuela suffered a depression of -26.7% growth one year, and then an economic boom of 36.1% growth the following year. What was their new GDP, relative to what it was before the depression? Very slightly less than before. (99.8% of its pre-recession value, to be precise.) You would think that falling 27% and rising 36% would leave you about 9% ahead; in fact it leaves you behind.

Would you rather live in a country with 11% inflation and have constant nominal pay, or live in a country with no inflation and take a 10% pay cut? You should prefer the inflation; in that case your real income only falls by 9.9%, instead of 10%.

We often say that the real interest rate is simply the nominal interest rate minus the rate of inflation, but that’s actually only an approximation. If you have 7% inflation and a nominal interest rate of 11%, your real interest rate is not actually 4%; it is 3.74%. If you have 2% inflation and a nominal interest rate of 0%, your real interest rate is not actually -2%; it is -1.96%.

This is what I mean by asymmetric:

Rising 10% and falling 10% do not cancel each other out. To cancel out a fall of 10%, you must actually rise 11.1%.

Gaining 20% and losing 20% do not cancel each other out. To cancel out a loss of 20%, you need a gain of 25%.

Is it starting to bother you yet? It sure bothers me.

Worst of all is the fact that the way we usually measure percentages, losses are bounded at 100% while gains are unbounded. To cancel a loss of 100%, you’d need a gain of infinity.

There are two basic ways of solving this problem: The simple way, and the good way.

The simple way is to just start measuring percentages symmetrically, by including both the starting and ending values in the calculation and averaging them.
That is, instead of using this formula:

% change = 100% * (new – old)/(old)

You use this one:

% change = 100% * (new – old)/((new + old)/2)

In this new system, percentage changes are symmetric.

Suppose a country’s GDP rises from $5 trillion to $6 trillion.

In the old system we’d say it has risen 20%:

100% * ($6 T – $5 T)/($5 T) = 20%

In the symmetric system, we’d say it has risen 18.2%:

100% * ($6 T – $5 T)/($5.5 T) = 18.2%

Suppose it falls back to $5 trillion the next year.

In the old system we’d say it has only fallen 16.7%:

100% * ($5 T – $6 T)/($6 T) = -16.7%

But in the symmetric system, we’d say it has fallen 18.2%.

100% * ($5 T – $6 T)/($5.5 T) = -18.2%

In the old system, the gain of 20% was somehow canceled by a loss of 16.7%. In the symmetric system, the gain of 18.2% was canceled by a loss of 18.2%, just as you’d expect.

This also removes the problem of losses being bounded but gains being unbounded. Now both losses and gains are bounded, at the rather surprising value of 200%.

Formally, that’s because of these limits:
lim_{x rightarrow infty} {(x-1) over {(x+1)/2}} = 2

lim_{x rightarrow infty} {(0-x) over {(x+0)/2}} = -2

It might be easier to intuit these limits with an example. Suppose something explodes from a value of 1 to a value of 10,000,000. In the old system, this means it rose 1,000,000,000%. In the symmetric system, it rose 199.9999%. Like the speed of light, you can approach 200%, but never quite get there.

100% * (10^7 – 1)/(5*10^6 + 0.5) = 199.9999%

Gaining 200% in the symmetric system is gaining an infinite amount. That’s… weird, to say the least. Also, losing everything is now losing… 200%?

This is simple to explain and compute, but it’s ultimately not the best way.

The best way is to use logarithms.

As you may vaguely recall from math classes past, logarithms are the inverse of exponents.

Since 2^4 = 16, log_2 (16) = 4.

The natural logarithm ln() is the most fundamental for deep mathematical reasons I don’t have room to explain right now. It uses the base e, a transcendental number that starts 2.718281828459045…

To the uninitiated, this probably seems like an odd choice—no rational number has a natural logarithm that is itself a rational number (well, other than 1, since ln(1) = 0).

But perhaps it will seem a bit more comfortable once I show you that natural logarithms are remarkably close to percentages, particularly for the small changes in which percentages make sense.

We define something called log points such that the change in log points is 100 times the natural logarithm of the ratio of the two:

log points = 100 * ln(new / old)

This is symmetric because of the following property of logarithms:

ln(a/b) = – ln(b/a)

Let’s return to the country that saw its GDP rise from $5 trillion to $6 trillion.

The logarithmic change is 18.2 log points:

100 * ln($6 T / $5 T) = 100 * ln(1.2) = 18.2

If it falls back to $5 T, the change is -18.2 log points:

100 * ln($5 T / $6 T) = 100 * ln(0.833) = -18.2

Notice how in the symmetric percentage system, it rose and fell 18.2%; and in the logarithmic system, it rose and fell 18.2 log points. They are almost interchangeable, for small percentages.

In this graph, the old value is assumed to be 1. The horizontal axis is the new value, and the vertical axis is the percentage change we would report by each method.

percentage_change_small

The green line is the usual way we measure percentages.

The red curve is the symmetric percentage method.

The blue curve is the logarithmic method.

For percentages within +/- 10%, all three methods are about the same. Then both new methods give about the same answer all the way up to changes of +/- 40%. Since most real changes in economics are within that range, the symmetric method and the logarithmic method are basically interchangeable.

However, for very large changes, even these two methods diverge, and in my opinion the logarithm is to be preferred.

percentage_change_large

The symmetric percentage never gets above 200% or below -200%, while the logarithm is unbounded in both directions.

If you lose everything, the old system would say you have lost 100%. The symmetric system would say you have lost 200%. The logarithmic system would say you have lost infinity log points. If infinity seems a bit too extreme, think of it this way: You have in fact lost everything. No finite proportional gain can ever bring it back. A loss that requires a gain of infinity percent seems like it should be called a loss of infinity percent, doesn’t it? Under the logarithmic system it is.

If you gain an infinite amount, the old system would say you have gained infinity percent. The logarithmic system would also say that you have gained infinity log points. But the symmetric percentage system would say that you have gained 200%. 200%? Counter-intuitive, to say the least.

Log points also have another very nice property that neither the usual system nor the symmetric percentage system have: You can add them.

If you gain 25 log points, lose 15 log points, then gain 10 log points, you have gained 20 log points.

25 – 15 + 10 = 20

Just as you’d expect!

But if you gain 25%, then lose 15%, and then gain 10%, you have gained… 16.9%.

(1 + 0.25)*(1 – 0.15)*(1 + 0.10) = 1.169

If you gain 25% symmetric, lose 15% symmetric, then gain 10% symmetric, that calculation is really a pain. To find the value y that is p symmetric percentage points from the starting value x, you end up needing to solve this equation:

p = 100 * (y – x)/((x+y)/2)

This can be done; it comes out like this:

y = (200 + p)/(200 – p) * x

(This also gives a bit of insight into why it is that the bounds are +/- 200%.)

So by chaining those, we can in fact find out what happens after gaining 25%, losing 15%, then gaining 10% in the symmetric system:

(200 + 25)/(200 – 25)*(200 – 15)/(200 + 15)*(200 + 10)/(200 – 10) = 1.223

Then we can put that back into the symmetric system:

100% * (1.223 – 1)/((1+1.223)/2) = 20.1%

So after all that work, we find out that you have gained 20.1% symmetric. We could almost just add them—because they are so similar to log points—but we can’t quite.

Log points actually turn out to be really convenient, once you get the hang of them. The problem is that there’s a conceptual leap for most people to grasp what a logarithm is in the first place.

In particular, the hardest part to grasp is probably that a doubling is not 100 log points.

It is in fact 69 log points, because ln(2) = 0.69.

(Doubling in the symmetric percentage system is gaining 67%—much closer to the log points than to the usual percentage system.)

Calculation of the new value is a bit more difficult than in the usual system, but not as difficult as in the symmetric percentage system.

If you have a change of p log points from a starting point of x, the ending point y is:

y = e^{p/100} * x

The fact that you can add log points ultimately comes from the way exponents add:

e^{p1/100} * e^{p2/100} = e^{(p1+p2)/100}

Suppose US GDP grew 2% in 2007, then 0% in 2008, then fell 8% in 2009 and rose 4% in 2010 (this is approximately true). Where was it in 2010 relative to 2006? Who knows, right? It turns out to be a net loss of 2.4%; so if it was $15 T before it’s now $14.63 T. If you had just added, you’d think it was only down 2%; you’d have underestimated the loss by $70 billion.

But if it had grown 2 log points, then 0 log points, then fell 8 log points, then rose 4 log points, the answer is easy: It’s down 2 log points. If it was $15 T before, it’s now $14.70 T. Adding gives the correct answer this time.

Thus, instead of saying that the stock market fell 4.3%, we should say it fell 4.4 log points. Instead of saying that GDP is up 1.9%, we should say it is up 1.8 log points. For small changes it won’t even matter; if inflation is 1.4%, it is in fact also 1.4 log points. Log points are a bit harder to conceptualize; but they are symmetric and additive, which other methods are not.

Is this a matter of life and death on a global scale? No.

But I can’t write about those every day, now can I?

The power of exponential growth

JDN 2457390

There’s a famous riddle: If the water in a lakebed doubles in volume every day, and the lakebed started filling on January 1, and is half full on June 17, when will it be full?

The answer is of course June 18—if it doubles every day, it will go from half full to full in a single day.

But most people assume that half the work takes about half the time, so they usually give answers in December. Others try to correct, but don’t go far enough, and say something like October.

Human brains are programmed to understand linear processes. We expect things to come in direct proportion: If you work twice as hard, you expect to get twice as much done. If you study twice as long, you expect to learn twice as much. If you pay twice as much, you expect to get twice as much stuff.

We tend to apply this same intuition to situations where it does not belong, processes that are not actually linear but exponential. As a result, when we extrapolate the slow growth early in the process, we wildly underestimate the total growth in the long run.

For example, suppose we have two countries. Arcadia has a GDP of $100 billion per year, and they grow at 4% per year. Berkland has a GDP of $200 billion, and they grow at 2% per year. Assuming that they maintain these growth rates, how long will it take for Arcadia’s GDP to exceed Berkland’s?

If we do this intuitively, we might sort of guess that at 4% you’d add 100% in 25 years, and at 2% you’d add 100% in 50 years; so it should be something like 75 years, because then Arcadia will have added $300 million while Berkland added $200 million. You might even just fudge the numbers in your head and say “about a century”.

In fact, it is only 35 years. You could solve this exactly by setting (100)(1.04^x) = (200)(1.02^x); but I have an intuitive method that I think may help you to estimate exponential processes in the future.

Divide the percentage into 69. (For some numbers it’s easier to use 70 or 72; remember, these are just to be approximate. The exact figure is 100*ln(2) = 69.3147… and then it wouldn’t be the percentage p but 100*ln(1+p/100); try plotting those and you’ll see why using p works.) This is the time it will take to double.

So at 4%, Arcadia will double in about 17.5 years, quadrupling in 35 years. At 2%, Berkland will double in about 35 years. Thus, in 35 years, Arcadia will quadruple and Berkland will double, so their GDPs will be equal.

Economics is full of exponential processes: Compound interest is exponential, and over moderately long periods GDP and population both tend to grow exponentially. (In fact they grow logistically, which is similar to exponential until it gets very large and begins to slow down. If you smooth out our recessions, you can get a sense that since the 1940s, US GDP growth has slowed down from about 4% per year to about 2% per year.) It is therefore quite important to understand how exponential growth works.

Let’s try another one. If one account has $1 million, growing at 5% per year, and another has $1,000, growing at 10% per year, how long will it take for the second account to have more money in it?

69/5 is about 14, so the first account doubles in 14 years. 69/10 is about 7, so the second account doubles in 7 years. A factor of 1000 is about 10 doublings (2^10 = 1024), so the second account needs to have doubled 10 times more than the first account. Since it doubles twice as often, this means that it must have doubled 20 times while the other doubled 10 times. Therefore, it will take about 140 years.

In fact, it takes 141—so our quick approximation is actually remarkably good.

This example is instructive in another way; 141 years is a pretty long time, isn’t it? You can’t just assume that exponential growth is “as fast as you want it to be”. Once people realize that exponential growth is very fast, they often overcorrect, assuming that exponential growth automatically means growth that is absurdly—or arbitrarily—fast. (XKCD made a similar point in this comic.)

I think the worst examples of this mistake are among Singularitarians. They—correctly—note that computing power has become exponentially greater and cheaper over time, doubling about every 18 months, which has been dubbed Moore’s Law. They assume that this will continue into the indefinite future (this is already problematic; the growth rate seems to be already slowing down). And therefore they conclude there will be a sudden moment, a technological singularity, at which computers will suddenly outstrip humans in every way and bring about a new world order of artificial intelligence basically overnight. They call it a “hard takeoff”; here’s a direct quote:

But many thinkers in this field including Nick Bostrom and Eliezer Yudkowsky worry that AI won’t work like this at all. Instead there could be a “hard takeoff”, a huge subjective discontinuity in the function mapping AI research progress to intelligence as measured in ability-to-get-things-done. If on January 1 you have a toy AI as smart as a cow, one which can identify certain objects in pictures and navigate a complex environment, and on February 1 it’s proved the Riemann hypothesis and started building a ring around the sun, that was a hard takeoff.

Wait… what? For someone like me who understands exponential growth, the last part is a baffling non sequitur. If computers start half as smart as us and double every 18 months, in 18 months, they will be as smart as us. In 36 months, they will be twice as smart as us. Twice as smart as us literally means that two people working together perfectly can match them—certainly a few dozen working realistically can. We’re not in danger of total AI domination from that. With millions of people working against the AI, we should be able to keep up with it for at least another 30 years. So are you assuming that this trend is continuing or not? (Oh, and by the way, we’ve had AIs that can identify objects and navigate complex environments for a couple years now, and so far, no ringworld around the Sun.)

That same essay make a biological argument, which misunderstands human evolution in a way that is surprisingly subtle yet ultimately fundamental:

If you were to come up with a sort of objective zoological IQ based on amount of evolutionary work required to reach a certain level, complexity of brain structures, etc, you might put nematodes at 1, cows at 90, chimps at 99, homo erectus at 99.9, and modern humans at 100. The difference between 99.9 and 100 is the difference between “frequently eaten by lions” and “has to pass anti-poaching laws to prevent all lions from being wiped out”.

No, actually, what makes humans what we are is not that we are 1% smarter than chimpanzees.

First of all, we’re actually more like 200% smarter than chimpanzees, measured by encephalization quotient; they clock in at 2.49 while we hit 7.44. If you simply measure by raw volume, they have about 400 mL to our 1300 mL, so again roughly 3 times as big. But that’s relatively unimportant; with Moore’s Law, tripling only takes about 2.5 years.

But even having triple the brain power is not what makes humans different. It was a necessary condition, but not a sufficient one. Indeed, it was so insufficient that for about 200,000 years we had brains just as powerful as we do now and yet we did basically nothing in technological or economic terms—total, complete stagnation on a global scale. This is a conservative estimate of when we had brains of the same size and structure as we do today.

What makes humans what we are? Cooperation. We are what we are because we are together.
The capacity of human intelligence today is not 1300 mL of brain. It’s more like 1.3 gigaliters of brain, where a gigaliter, a billion liters, is about the volume of the Empire State Building. We have the intellectual capacity we do not because we are individually geniuses, but because we have built institutions of research and education that combine, synthesize, and share the knowledge of billions of people who came before us. Isaac Newton didn’t understand the world as well as the average third-grader in the 21st century does today. Does the third-grader have more brain? Of course not. But they absolutely do have more knowledge.

(I recently finished my first playthrough of Legacy of the Void, in which a central point concerns whether the Protoss should detach themselves from the Khala, a psychic union which combines all their knowledge and experience into one. I won’t spoil the ending, but let me say this: I can understand their hesitation, for it is basically our equivalent of the Khala—first literacy, and now the Internet—that has made us what we are. It would no doubt be the Khala that made them what they are as well.)

Is AI still dangerous? Absolutely. There are all sorts of damaging effects AI could have, culturally, economically, militarily—and some of them are already beginning to happen. I even agree with the basic conclusion of that essay that OpenAI is a bad idea because the cost of making AI available to people who will abuse it or create one that is dangerous is higher than the benefit of making AI available to everyone. But exponential growth not only isn’t the same thing as instantaneous takeoff, it isn’t even compatible with it.

The next time you encounter an example of exponential growth, try this. Don’t just fudge it in your head, don’t overcorrect and assume everything will be fast—just divide the percentage into 69 to see how long it will take to double.

What does correlation have to do with causation?

JDN 2457345

I’ve been thinking of expanding the topics of this blog into some basic statistics and econometrics. It has been said that there are “Lies, damn lies, and statistics”; but in fact it’s almost the opposite—there are truths, whole truths, and statistics. Almost everything in the world that we know—not merely guess, or suppose, or intuit, or believe, but actually know, with a quantifiable level of certainty—is done by means of statistics. All sciences are based on them, from physics (when they say the Higgs discovery is a “5-sigma event”, that’s a statistic) to psychology, ecology to economics. Far from being something we cannot trust, they are in a sense the only thing we can trust.

The reason it sometimes feels like we cannot trust statistics is that most people do not understand statistics very well; this creates opportunities for both accidental confusion and willful distortion. My hope is therefore to provide you with some of the basic statistical knowledge you need to combat the worst distortions and correct the worst confusions.

I wasn’t quite sure where to start on this quest, but I suppose I have to start somewhere. I figured I may as well start with an adage about statistics that I hear commonly abused: “Correlation does not imply causation.”

Taken at its original meaning, this is definitely true. Unfortunately, it can be easily abused or misunderstood.

In its original meaning, the formal sense of the word “imply” meaning logical implication, to “imply” something is an extremely strong statement. It means that you logically entail that result, that if the antecedent is true, the consequent must be true, on pain of logical contradiction. Logical implication is for most practical purposes synonymous with mathematical proof. (Unfortunately, it’s not quite synonymous, because of things like Gödel’s incompleteness theorems and Löb’s theorem.)

And indeed, correlation does not logically entail causation; it’s quite possible to have correlations without any causal connection whatsoever, simply by chance. One of my former professors liked to brag that from 1990 to 2010 whether or not she ate breakfast had a statistically significant positive correlation with that day’s closing price for the Dow Jones Industrial Average.

How is this possible? Did my professor actually somehow influence the stock market by eating breakfast? Of course not; if she could do that, she’d be a billionaire by now. And obviously the Dow’s price at 17:00 couldn’t influence whether she ate breakfast at 09:00. Could there be some common cause driving both of them, like the weather? I guess it’s possible; maybe in good weather she gets up earlier and people are in better moods so they buy more stocks. But the most likely reason for this correlation is much simpler than that: She tried a whole bunch of different combinations until she found two things that correlated. At the usual significance level of 0.05, on average you need to try about 20 combinations of totally unrelated things before two of them will show up as correlated. (My guess is she used a number of different stock indexes and varied the starting and ending year. That’s a way to generate a surprisingly large number of degrees of freedom without it seeming like you’re doing anything particularly nefarious.)

But how do we know they aren’t actually causally related? Well, I suppose we don’t. Especially if the universe is ultimately deterministic and nonlocal (as I’ve become increasingly convinced by the results of recent quantum experiments), any two data sets could be causally related somehow. But the point is they don’t have to be; you can pick any randomly-generated datasets, pair them up in 20 different ways, and odds are, one of those ways will show a statistically significant correlation.

All of that is true, and important to understand. Finding a correlation between eating grapefruit and getting breast cancer, or between liking bitter foods and being a psychopath, does not necessarily mean that there is any real causal link between the two. If we can replicate these results in a bunch of other studies, that would suggest that the link is real; but typically, such findings cannot be replicated. There is something deeply wrong with the way science journalists operate; they like to publish the new and exciting findings, which 9 times out of 10 turn out to be completely wrong. They never want to talk about the really important and fascinating things that we know are true because we’ve been confirming them over hundreds of different experiments, because that’s “old news”. The journalistic desire to be new and first fundamentally contradicts the scientific requirement of being replicated and confirmed.

So, yes, it’s quite possible to have a correlation that tells you absolutely nothing about causation.

But this is exceptional. In most cases, correlation actually tells you quite a bit about causation.

And this is why I don’t like the adage; “imply” has a very different meaning in common speech, meaning merely to suggest or evoke. Almost everything you say implies all sorts of things in this broader sense, some more strongly than others, even though it may logically entail none of them.

Correlation does in fact suggest causation. Like any suggestion, it can be overridden. If we know that 20 different combinations were tried until one finally yielded a correlation, we have reason to distrust that correlation. If we find a correlation between A and B but there is no logical way they can be connected, we infer that it is simply an odd coincidence.

But when we encounter any given correlation, there are three other scenarios which are far more likely than mere coincidence: A causes B, B causes A, or some other factor C causes A and B. These are also not mutually exclusive; they can all be true to some extent, and in many cases are.

A great deal of work in science, and particularly in economics, is based upon using correlation to infer causation, and has to be—because there is simply no alternative means of approaching the problem.

Yes, sometimes you can do randomized controlled experiments, and some really important new findings in behavioral economics and development economics have been made this way. Indeed, much of the work that I hope to do over the course of my career is based on randomized controlled experiments, because they truly are the foundation of scientific knowledge. But sometimes, that’s just not an option.

Let’s consider an example: In my master’s thesis I found a strong correlation between the level of corruption in a country (as estimated by the World Bank) and the proportion of that country’s income which goes to the top 0.01% of the population. Countries that have higher levels of corruption also tend to have a larger proportion of income that accrues to the top 0.01%. That correlation is a fact; it’s there. There’s no denying it. But where does it come from? That’s the real question.

Could it be pure coincidence? Well, maybe; but when it keeps showing up in several different models with different variables included, that becomes unlikely. A single p < 0.05 will happen about 1 in 20 times by chance; but five in a row should happen less than 1 in 1 million times (assuming they’re independent, which, to be fair, they usually aren’t).

Could it be some artifact of the measurement methods? It’s possible. In particular, I was concerned about the possibility of Halo Effect, in which people tend to assume that something which is better (or worse) in one way is automatically better (or worse) in other ways as well. People might think of their country as more corrupt simply because it has higher inequality, even if there is no real connection. But it would have taken a very large halo bias to explain this effect.

So, does corruption cause income inequality? It’s not hard to see how that might happen: More corrupt individuals could bribe leaders or exploit loopholes to make themselves extremely rich, and thereby increase inequality.

Does inequality cause corruption? This also makes some sense, since it’s a lot easier to bribe leaders and manipulate regulations when you have a lot of money to work with in the first place.

Does something else cause both corruption and inequality? Also quite plausible. Maybe some general cultural factors are involved, or certain economic policies lead to both corruption and inequality. I did try to control for such things, but I obviously couldn’t include all possible variables.

So, which way does the causation run? Unfortunately, I don’t know. I tried some clever statistical techniques to try to figure this out; in particular, I looked at which tends to come first—the corruption or the inequality—and whether they could be used to predict each other, a method called Granger causality. Those results were inconclusive, however. I could neither verify nor exclude a causal connection in either direction. But is there a causal connection? I think so. It’s too robust to just be coincidence. I simply don’t know whether A causes B, B causes A, or C causes A and B.

Imagine trying to do this same study as a randomized controlled experiment. Are we supposed to create two societies and flip a coin to decide which one we make more corrupt? Or which one we give more income inequality? Perhaps you could do some sort of experiment with a proxy for corruption (cheating on a test or something like that), and then have unequal payoffs in the experiment—but that is very far removed from how corruption actually works in the real world, and worse, it’s prohibitively expensive to make really life-altering income inequality within an experimental context. Sure, we can give one participant $1 and the other $1,000; but we can’t give one participant $10,000 and the other $10 million, and it’s the latter that we’re really talking about when we deal with real-world income inequality. I’m not opposed to doing such an experiment, but it can only tell us so much. At some point you need to actually test the validity of your theory in the real world, and for that we need to use statistical correlations.

Or think about macroeconomics; how exactly are you supposed to test a theory of the business cycle experimentally? I guess theoretically you could subject an entire country to a new monetary policy selected at random, but the consequences of being put into the wrong experimental group would be disastrous. Moreover, nobody is going to accept a random monetary policy democratically, so you’d have to introduce it against the will of the population, by some sort of tyranny or at least technocracy. Even if this is theoretically possible, it’s mind-bogglingly unethical.

Now, you might be thinking: But we do change real-world policies, right? Couldn’t we use those changes as a sort of “experiment”? Yes, absolutely; that’s called a quasi-experiment or a natural experiment. They are tremendously useful. But since they are not truly randomized, they aren’t quite experiments. Ultimately, everything you get out of a quasi-experiment is based on statistical correlations.

Thus, abuse of the adage “Correlation does not imply causation” can lead to ignoring whole subfields of science, because there is no realistic way of running experiments in those subfields. Sometimes, statistics are all we have to work with.

This is why I like to say it a little differently:

Correlation does not prove causation. But correlation definitely can suggest causation.