On foxes and hedgehogs, part I

Aug 3 JDN 2460891

Today I finally got around to reading Expert Political Judgment by Philip E. Tetlock, more or less in a single sitting because I’ve been sick the last week with some pretty tight limits on what activities I can do. (It’s mostly been reading, watching TV, or playing video games that don’t require intense focus.)

It’s really an excellent book, and I now both understand why it came so highly recommended to me, and now pass on that recommendation to you: Read it.

The central thesis of the book really boils down to three propositions:

  1. Human beings, even experts, are very bad at predicting political outcomes.
  2. Some people, who use an open-minded strategy (called “foxes”), perform substantially better than other people, who use a more dogmatic strategy (called “hedgehogs”).
  3. When rewarding predictors with money, power, fame, prestige, and status, human beings systematically favor (over)confident “hedgehogs” over (correctly) humble “foxes”.

I decided I didn’t want to make this post about current events, but I think you’ll probably agree with me when I say:

That explains a lot.

How did Tetlock determine this?

Well, he studies the issue several different ways, but the core experiment that drives his account is actually a rather simple one:

  1. He gathered a large group of subject-matter experts: Economists, political scientists, historians, and area-studies professors.
  2. He came up with a large set of questions about politics, economics, and similar topics, which could all be formulated as a set of probabilities: “How likely is this to get better/get worse/stay the same?” (For example, this was in the 1980s, so he asked about the fate of the Soviet Union: “By 1990, will they become democratic, remain as they are, or collapse and fragment?”)
  3. Each respondent answered a subset of the questions, some about their own particular field, some about another, more distant field; they assigned probabilities on an 11-point scale, from 0% to 100% in increments of 10%.
  4. A few years later, he compared the predictions to the actual results, scoring them using a Brier score, which penalizes you for assigning high probability to things that didn’t happen or low probability to things that did happen.
  5. He compared the resulting scores between people with different backgrounds, on different topics, with different thinking styles, and a variety of other variables. He also benchmarked them using some automated algorithms like “always say 33%” and “always give ‘stay the same’ 100%”.

I’ll show you the key results of that analysis momentarily, but to help it make more sense to you, let me elaborate a bit more on the “foxes” and “hedgehogs”. The notion is was first popularized by Isaiah Berlin in an essay called, simply, The Hedgehog and the Fox.

“The fox knows many things, but the hedgehog knows one very big thing.”

That is, someone who reasons as a “fox” combines ideas from many different sources and perspective, and tries to weigh them all together into some sort of synthesis that then yields a final answer. This process is messy and complicated, and rarely yields high confidence about anything.

Whereas, someone who reasons as a “hedgehog” has a comprehensive theory of the world, an ideology, that provides clear answers to almost any possible question, with the surely minor, insubstantial flaw that those answers are not particularly likely to be correct.

He also considered “hedge-foxes” (people who are mostly fox but also a little bit hedgehog) and “fox-hogs” (people who are mostly hedgehog but also a little bit fox).

Tetlock has decomposed the scores into two components: calibration and discrimination. (Both very overloaded words, but they are standard in the literature.)

Calibration is how well your stated probabilities matched up with the actual probabilities; that is, if you predicted 10% probability on 20 different events, you have very good calibration if precisely 2 of those events occurred, and very poor calibration if 18 of those events occurred.

Discrimination more or less describes how useful your predictions are, what information they contain above and beyond the simple base rate. If you just assign equal probability to all events, you probably will have reasonably good calibration, but you’ll have zero discrimination; whereas if you somehow managed to assign 100% to everything that happened and 0% to everything that didn’t, your discrimination would be perfect (and we would have to find out how you cheated, or else declare you clairvoyant).

For both measures, higher is better. The ideal for each is 100%, but it’s virtually impossible to get 100% discrimination and actually not that hard to get 100% calibration if you just use the base rates for everything.


There is a bit of a tradeoff between these two: It’s not too hard to get reasonably good calibration if you just never go out on a limb, but then your predictions aren’t as useful; we could have mostly just guessed them from the base rates.

On the graph, you’ll see downward-sloping lines that are meant to represent this tradeoff: Two prediction methods that would yield the same overall score but different levels of calibration and discrimination will be on the same line. In a sense, two points on the same line are equally good methods that prioritize usefulness over accuracy differently.

All right, let’s see the graph at last:

The pattern is quite clear: The more foxy you are, the better you do, and the more hedgehoggy you are, the worse you do.

I’d also like to point out the other two regions here: “Mindless competition” and “Formal models”.

The former includes really simple algorithms like “always return 33%” or “always give ‘stay the same’ 100%”. These perform shockingly well. The most sophisticated of these, “case-specific extrapolation” (35 and 36 on the graph, which basically assumes that each country will continue doing what it’s been doing) actually performs as well if not better than even the foxes.

And what’s that at the upper-right corner, absolutely dominating the graph? That’s “Formal models”. This describes basically taking all the variables you can find and shoving them into a gigantic logit model, and then outputting the result. It’s computationally intensive and requires a lot of data (hence why he didn’t feel like it deserved to be called “mindless”), but it’s really not very complicated, and it’s the best prediction method, in every way, by far.

This has made me feel quite vindicated about a weird nerd thing I do: When I have a big decision to make (especially a financial decision), I create a spreadsheet and assemble a linear utility model to determine which choice will maximize my utility, under different parameterizations based on my past experiences. Whichever result seems to win the most robustly, I choose. This is fundamentally similar to the “formal models” prediction method, where the thing I’m trying to predict is my own happiness. (It’s a bit less formal, actually, since I don’t have detailed happiness data to feed into the regression.) And it has worked for me, astonishingly well. It definitely beats going by my own gut. I highly recommend it.

What does this mean?

Well first of all, it means humans suck at predicting things. At least for this data set, even our experts don’t perform substantially better than mindless models like “always assume the base rate”.

Nor do experts perform much better in their own fields than in other fields; they do all perform better than undergrads or random people (who somehow perform worse than the “mindless” models)

But Tetlock also investigates further, trying to better understand this “fox/hedgehog” distinction and why it yields different performance. He really bends over backwards to try to redeem the hedgehogs, in the following ways:

  1. He allows them to make post-hoc corrections to their scores, based on “value adjustments” (assigning higher probability to events that would be really important) and “difficulty adjustments” (assigning higher scores to questions where the three outcomes were close to equally probable) and “fuzzy sets” (giving some leeway on things that almost happened or things that might still happen later).
  2. He demonstrates a different, related experiment, in which certain manipulations can cause foxes to perform a lot worse than they normally would, and even yield really crazy results like probabilities that add up to 200%.
  3. He has a whole chapter that is a Socratic dialogue (seriously!) between four voices: A “hardline neopositivist”, a “moderate neopositivist”, a “reasonable relativist”, and an “unrelenting relativist”; and all but the “hardline neopositivist” agree that there is some legitimate place for the sort of post hoc corrections that the hedgehogs make to keep themselves from looking so bad.

This post is already getting a bit long, so that will conclude part I. Stay tuned for part II, next week!

A new theoretical model of co-ops

Mar 30 JDN 2460765

A lot of economists seem puzzled by the fact that co-ops are just as efficient as corporate firms, since they have this idea that profit-sharing inevitably results in lower efficiency due to perverse incentives.

I think they’ve been modeling co-ops wrong. Here I present a new model, a very simple one, with linear supply and demand curves. Of course one could make a more sophisticated model, but this should be enough to make the point (and this is just a blog post, not a research paper, after all).

Demand curve is p = a – b q

Marginal cost is f q

There are n workers, who would hold equal shares of the co-op.

Competitive market

First, let’s start with the traditional corporate firm in a competitive market.

Since the market is competitive, price would equal marginal cost would equal wage:

a – b q = d q

q = a/(b+f)

w = d (a/(b+f)) = (a d)/(b+f)

Total profit will be

(p – w)q = 0.

Monopoly firm

In a monopoly, marginal revenue would equal marginal cost:
d[pq]/dq = a – 2 b q

If they are also a monopsonist in the labor market, this marginal cost would be marginal cost of labor, not wage:

d[d q2]/dq = 2 f q

a – 2 b q = 2 f q

q = a/(2b + 2f)

p = a – b q = a (1 – b/(2b + 2f)) = (a (b + 2f))/(2b + 2f)

w = d q = (a f)/(2b + 2f)

Total profit will be

(p – w) q = ((a (b + 2f))/(2b + 2f) – (a f)/(2b + 2f))a/(2b + 2f) = a2/(4b + 2f)

Now consider the co-op.

First, suppose that instead of working for a wage, I work for profit sharing.

If our product market is competitive, we’ll be price-takers, and we will produce until price equals marginal cost:

p = f q

a – b q = f q

q = a/(a+b)

But will we, really? I only get 1/n share of the profits. So let’s see here. My marginal cost of production is still f q, but the marginal benefit I get from more sales may only be p/n.

In that case I would work until:

p/n = f q

(a – b q)/n = fq

a – b q = n f q

q = (a/(b+nf))

Thus I would under-produce. This is the usual argument against co-ops and similar shared ownership.

Co-ops with wages

But that’s not actually how co-ops work. They pay wages. Why do they do that? Well, consider what happens if I am offered a wage as a worker-owner of the co-op.

Is there any reason for the co-op to vote on a wage that is less than the competitive market? No, because owners are workers, so any additional profit from a lower wage would simply be taken from their own wages.

If there any reason for the co-op to vote on a wage that is more than the competitive market? No, because workers are owners, and any surplus lost by paying higher wages would simply be taken from their own profits.

So if the product market is competitive, the co-op will produce the same amount and charge the same price as a firm in perfect competition, even if they have market power over their own wages.

Monopoly co-ops

The argument above doesn’t assume that the co-op has no market power in the labor market. Thus if they are a monopoly in the product market and a monopsony in the labor market, they still pay a competitive wage.

Thus they would set marginal revenue equal to marginal cost:

a – 2 b q = f q

q = a/(2b + f)

The co-op will produce more than the monopoly firm..

This is the new price:

p = a – b q = a(1 – b/(2b+f)) = a(b+f)/(2b + f)

It’s not obvious that this is lower than the price charged by the monopoly firm, but it is.

(a (b + 2f))/(2b + 2f) – a(b+f)/(2b + f) = (a (2b + f)(b + 2f) – 2 a(b+f)2)/(2(b+f)(2b+f))

This is proportional to:

(2b + f)(b + 2f) – 2(b+f)2

2b2 + 5bf + 2f2 – (2b2 + 4bf + 2f2) = bf

So it’s not a large difference, but it’s there. In the presence of market power in the labor market, the co-op is better for consumers, because they get more goods and pay a lower price.

Thus, there is actually no lost efficiency from being a co-op. There is simply much lower inequality, and potentially higher efficiency.

But that’s just in theory.

What do we see in practice?

Exactly that.

Co-ops have the same productivity and efficiency as corporate firms, but they pay higher wages, provide better benefits, and offer collateral benefits to their communities. In fact, they are sometimes more efficient than corporate firms.

Since they’re just as efficient—if not more so—and produce much lower inequality, switching more firms over to co-ops would clearly be a good thing.

Why, then, aren’t co-ops more common?

Because the people who have the money don’t like them.

The biggest barrier facing co-ops is their inability to get financing, because they don’t pay shareholders (so no IPOs) and banks don’t like to lend to them. They tend to make less profit than corporate firms, which offers investors a lower return—instead that money goes to the worker-owners. This lower return isn’t due to inefficiency; it’s just a different distribution of income, more to labor and less to capital.

We will need new financial institutions to support co-ops, such as the Cooperative Fund of New England. And general redistribution of wealth would also help, because if middle class people had more wealth they could afford to finance co-ops. (It would also be good for many other reasons, of course.)

How to detect discrimination, empirically

Aug 25 JDN 2460548

For concreteness, I’ll use men and women as my example, though the same principles would apply for race, sexual orientation, and so on. Suppose we find that there are more men than women in a given profession; does this mean that women are being discriminated against?

Not necessarily. Maybe women are less interested in that kind of work, or innately less qualified. Is there a way we can determine empirically that it really is discrimination?

It turns out that there is. All we need is a reliable measure of performance in that profession. Then, we compare performance between men and women, and that comparison can tell us whether discrimination is happening or not. The key insight is that workers in a job are not a random sample; they are a selected sample. The results of that selection can tell us whether discrimination is happening.

Here’s a simple model to show how this works.

Suppose there are five different skill levels in the job, from 1 to 5 where 5 is the most skilled. And suppose there are 5 women and 5 men in the population.

1. Baseline

The baseline case to consider is when innate talents are equal and there is no discrimination. In that case, we should expect men and women to be equally represented in the profession.

For the simplest case, let’s say that there is one person at each skill level:

MenWomen
11
22
33
44
55

Now suppose that everyone above a certain skill threshold gets hired. Since we’re assuming no discrimination, the threshold should be the same for men and women. Let’s say it’s 3; then these are the people who get hired:

Hired MenHired Women
33
44
55

The result is that not only are there the same number of men and women in the job, their skill levels are also the same. There are just as many highly-competent men as highly-competent women.

2. Innate Differences

Now, suppose there is some innate difference in talent between men and women for this job. For most jobs this seems suspicious, but consider pro sports: Men really are better at basketball, in general, than women, and this is pretty clearly genetic. So it’s not absurd to suppose that for at least some jobs, there might be some innate differences. What would that look like?


Again suppose a population of 5 men and 5 women, but now the women are a bit less qualified: There are two 1s and no 5s among the women.

MenWomen
11
21
32
43
54

Then, this is the group that will get hired:

Hired MenHired Women
33
44
5

The result will be fewer women who are on average less qualified. The most highly-qualified individuals at that job will be almost entirely men. (In this simple model, entirely men; but you can easily extend it so that there are a few top-qualified women.)

This is in fact what we see for a lot of pro sports; in a head-to-head match, even the best WNBA teams would generally lose against most NBA teams. That’s what it looks like when there are real innate differences.

But it’s hard to find clear examples outside of sports. The genuine, large differences in size and physical strength between the sexes just don’t seem to be associated with similar differences in mental capabilities or even personality. You can find some subtler effects, but nothing very large—and certainly nothing large enough to explain the huge gender gaps in various industries.

3. Discrimination

What does it look like when there is discrimination?

Now assume that men and women are equally qualified, but it’s harder for women to get hired, because of discrimination. The key insight here is that this amounts to women facing a higher threshold. Where men only need to have level 3 competence to get hired, women need level 4.

So if the population looks like this:

MenWomen
11
22
33
44
55

The hired employees will look like this:

Hired MenHired Women
3
44
55

Once again we’ll have fewer women in the profession, but they will be on average more qualified. The top-performing individuals will be as likely to be women as they are to be men, while the lowest-performing individuals will be almost entirely men.

This is the kind of pattern we observe when there is discrimination. Do we see it in real life?

Yes, we see it all the time.

Corporations with women CEOs are more profitable.

Women doctors have better patient outcomes.

Startups led by women are more likely to succeed.

This shows that there is some discrimination happening, somewhere in the process. Does it mean that individual firms are actively discriminating in their hiring process? No, it doesn’t. The discrimination could be happening somewhere else; maybe it happens during education, or once women get hired. Maybe it’s a product of sexism in society as a whole, that isn’t directly under the control of employers. But it must be in there somewhere. If women are both rarer and more competent, there must be some discrimination going on.

What if there is also innate difference? We can detect that too!

4. Both

Suppose now that men are on average more talented, but there is also discrimination against women. Then the population might look like this:

MenWomen
11
21
32
43
54

And the hired employees might look like this:

Hired MenHired Women
3
4
54

In such a scenario, you’ll see a large gender imbalance, but there may not be a clear difference in competence. The tiny fraction of women who get hired will perform about as well as the men, on average.

Of course, this assumes that the two effects are of equal strength. In reality, we might see a whole spectrum of possibilities, from very strong discrimination with no innate differences, all the way to very large innate differences with no discrimination. The outcomes will then be similarly along a spectrum: When discrimination is much larger than innate difference, women will be rare but more competent. When innate difference is much larger than discrimination, women will be rare and less competent. And when there is a mix of both, women will be rare but won’t show as much difference in competence.

Moreover, if you look closer at the distribution of performance, you can still detect the two effects independently. If the lowest-performing workers are almost all men, that’s evidence of discrimination against women; while if the highest-performing workers are almost all men, that’s evidence of innate difference. And if you look at the table above, that’s exactly what we see: Both the 3 and the 5 are men, indicating the presence of both effects.

What does affirmative action do?

Effectively, affirmative action lowers the threshold for hiring women (or minorities) in order to equalize representation in the workplace. In the presence of discrimination raising that threshold, this is exactly what we need! It can take us from case 3 (discrimination) to case 1 (equality), or from case 4 (both discrimination and innate difference) to case 2 (innate difference only).

Of course, it’s possible for us to overshoot, using more affirmative action than we should have. If we achieve better representation of women, but the lowest performers at the job are women, then we have overshot, effectively now discriminating against men. Fortunately, there is very little evidence of this in practice. In general, even with affirmative action programs in place, we tend to find that the lowest performers are still men—so there is still discrimination against women that we’ve failed to compensate for.

What if we can’t measure competence?

Of course, it’s possible that we don’t have good measures of competence in a given industry. (One must wonder how firms decide who to hire, but frankly I’m prepared to believe they’re just really bad at it.) Then we can’t observe discrimination statistically in this way. What do we do then?

Well, there is at least one avenue left for us to detect discrimination: We can do direct experiments comparing resumes with male names versus female names. These sorts of experiments typically don’t find very much, though—at least for women. For different races, they absolutely do find strong results. They also find evidence of discrimination against people with disabilities, older people, and people who are physically unattractive. There’s also evidence of intersectional effects, where women of particular ethnic groups get discriminated against even when women in general don’t.

But this will only pick up discrimination if it occurs during the hiring process. The advantage of having a competence measure is that it can detect discrimination that occurs anywhere—even outside employer control. Of course, if we don’t know where the discrimination is happening, that makes it very hard to fix; so the two approaches are complementary.

And there is room for new methods too; right now we don’t have a good way to detect discrimination in promotion decisions, for example. Many of us suspect that it occurs, but unless you have a good measure of competence, you can’t really distinguish promotion discrimination from innate differences in talent. We don’t have a good method for testing that in a direct experiment, either, because unlike hiring, we can’t just use fake resumes with masculine or feminine names on them.

What behavioral economics needs

Apr 16 JDN 2460049

The transition from neoclassical to behavioral economics has been a vital step forward in science. But lately we seem to have reached a plateau, with no major advances in the paradigm in quite some time.

It could be that there is work already being done which will, in hindsight, turn out to be significant enough to make that next step forward. But my fear is that we are getting bogged down by our own methodological limitations.

Neoclassical economics shared with us its obsession with mathematical sophistication. To some extent this was inevitable; in order to impress neoclassical economists enough to convert some of them, we had to use fancy math. We had to show that we could do it their way in order to convince them why we shouldn’t—otherwise, they’d just have dismissed us the way they had dismissed psychologists for decades, as too “fuzzy-headed” to do the “hard work” of putting everything into equations.

But the truth is, putting everything into equations was never the right approach. Because human beings clearly don’t think in equations. Once we write down a utility function and get ready to take its derivative and set it equal to zero, we have already distanced ourselves from how human thought actually works.

When dealing with a simple physical system, like an atom, equations make sense. Nobody thinks that the electron knows the equation and is following it intentionally. That equation simply describes how the forces of the universe operate, and the electron is subject to those forces.

But human beings do actually know things and do things intentionally. And while an equation could be useful for analyzing human behavior in the aggregate—I’m certainly not objecting to statistical analysis—it really never made sense to say that people make their decisions by optimizing the value of some function. Most people barely even know what a function is, much less remember calculus well enough to optimize one.

Yet right now, behavioral economics is still all based in that utility-maximization paradigm. We don’t use the same simplistic utility functions as neoclassical economists; we make them more sophisticated and realistic. Yet in that very sophistication we make things more complicated, more difficult—and thus in at least that respect, even further removed from how actual human thought must operate.

The worst offender here is surely Prospect Theory. I recognize that Prospect Theory predicts human behavior better than conventional expected utility theory; nevertheless, it makes absolutely no sense to suppose that human beings actually do some kind of probability-weighting calculation in their heads when they make judgments. Most of my students—who are well-trained in mathematics and economics—can’t even do that probability-weighting calculation on paper, with a calculator, on an exam. (There’s also absolutely no reason to do it! All it does it make your decisions worse!) This is a totally unrealistic model of human thought.

This is not to say that human beings are stupid. We are still smarter than any other entity in the known universe—computers are rapidly catching up, but they haven’t caught up yet. It is just that whatever makes us smart must not be easily expressible as an equation that maximizes a function. Our thoughts are bundles of heuristics, each of which may be individually quite simple, but all of which together make us capable of not only intelligence, but something computers still sorely, pathetically lack: wisdom. Computers optimize functions better than we ever will, but we still make better decisions than they do.

I think that what behavioral economics needs now is a new unifying theory of these heuristics, which accounts for not only how they work, but how we select which one to use in a given situation, and perhaps even where they come from in the first place. This new theory will of course be complex; there’s a lot of things to explain, and human behavior is a very complex phenomenon. But it shouldn’t be—mustn’t be—reliant on sophisticated advanced mathematics, because most people can’t do advanced mathematics (almost by construction—we would call it something different otherwise). If your model assumes that people are taking derivatives in their heads, your model is already broken. 90% of the world’s people can’t take a derivative.

I guess it could be that our cognitive processes in some sense operate as if they are optimizing some function. This is commonly posited for the human motor system, for instance; clearly baseball players aren’t actually solving differential equations when they throw and catch balls, but the trajectories that balls follow do in fact obey such equations, and the reliability with which baseball players can catch and throw suggests that they are in some sense acting as if they can solve them.

But I think that a careful analysis of even this classic example reveals some deeper insights that should call this whole notion into question. How do baseball players actually do what they do? They don’t seem to be calculating at all—in fact, if you asked them to try to calculate while they were playing, it would destroy their ability to play. They learn. They engage in practiced motions, acquire skills, and notice patterns. I don’t think there is anywhere in their brains that is actually doing anything like solving a differential equation. It’s all a process of throwing and catching, throwing and catching, over and over again, watching and remembering and subtly adjusting.

One thing that is particularly interesting to me about that process is that is astonishingly flexible. It doesn’t really seem to matter what physical process you are interacting with; as long as it is sufficiently orderly, such a method will allow you to predict and ultimately control that process. You don’t need to know anything about differential equations in order to learn in this way—and, indeed, I really can’t emphasize this enough, baseball players typically don’t.

In fact, learning is so flexible that it can even perform better than calculation. The usual differential equations most people would think to use to predict the throw of a ball would assume ballistic motion in a vacuum, which absolutely not what a curveball is. In order to throw a curveball, the ball must interact with the air, and it must be launched with spin; curving a baseball relies very heavily on the Magnus Effect. I think it’s probably possible to construct an equation that would fully predict the motion of a curveball, but it would be a tremendously complicated one, and might not even have an exact closed-form solution. In fact, I think it would require solving the Navier-Stokes equations, for which there is an outstanding Millennium Prize. Since the viscosity of air is very low, maybe you could get away with approximating using the Euler fluid equations.

To be fair, a learning process that is adapting to a system that obeys an equation will yield results that become an ever-closer approximation of that equation. And it is in that sense that a baseball player can be said to be acting as if solving a differential equation. But this relies heavily on the system in question being one that obeys an equation—and when it comes to economic systems, is that even true?

What if the reason we can’t find a simple set of equations that accurately describe the economy (as opposed to equations of ever-escalating complexity that still utterly fail to describe the economy) is that there isn’t one? What if the reason we can’t find the utility function people are maximizing is that they aren’t maximizing anything?

What behavioral economics needs now is a new approach, something less constrained by the norms of neoclassical economics and more aligned with psychology and cognitive science. We should be modeling human beings based on how they actually think, not some weird mathematical construct that bears no resemblance to human reasoning but is designed to impress people who are obsessed with math.

I’m of course not the first person to have suggested this. I probably won’t be the last, or even the one who most gets listened to. But I hope that I might get at least a few more people to listen to it, because I have gone through the mathematical gauntlet and earned my bona fides. It is too easy to dismiss this kind of reasoning from people who don’t actually understand advanced mathematics. But I do understand differential equations—and I’m telling you, that’s not how people think.

Implications of stochastic overload

Apr 2 JDN 2460037

A couple weeks ago I presented my stochastic overload model, which posits a neurological mechanism for the Yerkes-Dodson effect: Stress increases sympathetic activation, and this increases performance, up to the point where it starts to risk causing neural pathways to overload and shut down.

This week I thought I’d try to get into some of the implications of this model, how it might be applied to make predictions or guide policy.

One thing I often struggle with when it comes to applying theory is what actual benefits we get from a quantitative mathematical model as opposed to simply a basic qualitative idea. In many ways I think these benefits are overrated; people seem to think that putting something into an equation automatically makes it true and useful. I am sometimes tempted to try to take advantage of this, to put things into equations even though I know there is no good reason to put them into equations, simply because so many people seem to find equations so persuasive for some reason. (Studies have even shown that, particularly in disciplines that don’t use a lot of math, inserting a totally irrelevant equation into a paper makes it more likely to be accepted.)

The basic implications of the Yerkes-Dodson effect are already widely known, and utterly ignored in our society. We know that excessive stress is harmful to health and performance, and yet our entire economy seems to be based around maximizing the amount of stress that workers experience. I actually think neoclassical economics bears a lot of the blame for this, as neoclassical economists are constantly talking about “increasing work incentives”—which is to say, making work life more and more stressful. (And let me remind you that there has never been any shortage of people willing to work in my lifetime, except possibly briefly during the COVID pandemic. The shortage has always been employers willing to hire them.)

I don’t know if my model can do anything to change that. Maybe by putting it into an equation I can make people pay more attention to it, precisely because equations have this weird persuasive power over most people.

As far as scientific benefits, I think that the chief advantage of a mathematical model lies in its ability to make quantitative predictions. It’s one thing to say that performance increases with low levels of stress then decreases with high levels; but it would be a lot more useful if we could actually precisely quantify how much stress is optimal for a given person and how they are likely to perform at different levels of stress.

Unfortunately, the stochastic overload model can only make detailed predictions if you have fully specified the probability distribution of innate activation, which requires a lot of free parameters. This is especially problematic if you don’t even know what type of distribution to use, which we really don’t; I picked three classes of distribution because they were plausible and tractable, not because I had any particular evidence for them.

Also, we don’t even have standard units of measurement for stress; we have a vague notion of what more or less stressed looks like, but we don’t have the sort of quantitative measure that could be plugged into a mathematical model. Probably the best units to use would be something like blood cortisol levels, but then we’d need to go measure those all the time, which raises its own issues. And maybe people don’t even respond to cortisol in the same ways? But at least we could measure your baseline cortisol for awhile to get a prior distribution, and then see how different incentives increase your cortisol levels; and then the model should give relatively precise predictions about how this will affect your overall performance. (This is a very neuroeconomic approach.)

So, for now, I’m not really sure how useful the stochastic overload model is. This is honestly something I feel about a lot of the theoretical ideas I have come up with; they often seem too abstract to be usefully applicable to anything.

Maybe that’s how all theory begins, and applications only appear later? But that doesn’t seem to be how people expect me to talk about it whenever I have to present my work or submit it for publication. They seem to want to know what it’s good for, right now, and I never have a good answer to give them. Do other researchers have such answers? Do they simply pretend to?

Along similar lines, I recently had one of my students ask about a theory paper I wrote on international conflict for my dissertation, and after sending him a copy, I re-read the paper. There are so many pages of equations, and while I am confident that the mathematical logic is valid,I honestly don’t know if most of them are really useful for anything. (I don’t think I really believe that GDP is produced by a Cobb-Douglas production function, and we don’t even really know how to measure capital precisely enough to say.) The central insight of the paper, which I think is really important but other people don’t seem to care about, is a qualitative one: International treaties and norms provide an equilibrium selection mechanism in iterated games. The realists are right that this is cheap talk. The liberals are right that it works. Because when there are many equilibria, cheap talk works.

I know that in truth, science proceeds in tiny steps, building a wall brick by brick, never sure exactly how many bricks it will take to finish the edifice. It’s impossible to see whether your work will be an irrelevant footnote or the linchpin for a major discovery. But that isn’t how the institutions of science are set up. That isn’t how the incentives of academia work. You’re not supposed to say that this may or may not be correct and is probably some small incremental progress the ultimate impact of which no one can possibly foresee. You’re supposed to sell your work—justify how it’s definitely true and why it’s important and how it has impact. You’re supposed to convince other people why they should care about it and not all the dozens of other probably equally-valid projects being done by other researchers.

I don’t know how to do that, and it is agonizing to even try. It feels like lying. It feels like betraying my identity. Being good at selling isn’t just orthogonal to doing good science—I think it’s opposite. I think the better you are at selling your work, the worse you are at cultivating the intellectual humility necessary to do good science. If you think you know all the answers, you’re just bad at admitting when you don’t know things. It feels like in order to succeed in academia, I have to act like an unscientific charlatan.

Honestly, why do we even need to convince you that our work is more important than someone else’s? Are there only so many science points to go around? Maybe the whole problem is this scarcity mindset. Yes, grant funding is limited; but why does publishing my work prevent you from publishing someone else’s? Why do you have to reject 95% of the papers that get sent to you? Don’t tell me you’re limited by space; the journals are digital and searchable and nobody reads the whole thing anyway. Editorial time isn’t infinite, but most of the work has already been done by the time you get a paper back from peer review. Of course, I know the real reason: Excluding people is the main source of prestige.

The role of innate activation in stochastic overload

Mar 26 JDN 2460030

Two posts ago I introduced my stochastic overload model, which offers an explanation for the Yerkes-Dodson effect by positing that additional stress increases sympathetic activation, which is useful up until the point where it starts risking an overload that forces systems to shut down and rest.

The central equation of the model is actually quite simple, expressed either as an expectation or as an integral:

Y = E[x + s | x + s < 1] P[x + s < 1]

Y = \int_{0}^{1-s} (x+s) dF(x)

The amount of output produced is the expected value of innate activation plus stress activation, times the probability that there is no overload. Increased stress raises this expectation value (the incentive effect), but also increases the probability of overload (the overload effect).

The model relies upon assuming that the brain starts with some innate level of activation that is partially random. Exactly what sort of Yerkes-Dodson curve you get from this model depends very much on what distribution this innate activation takes.

I’ve so far solved it for three types of distribution.

The simplest is a uniform distribution, where within a certain range, any level of activation is equally probable. The probability density function looks like this:

Assume the distribution has support between a and b, where a < b.

When b+s < 1, then overload is impossible, and only the incentive effect occurs; productivity increases linearly with stress.

The expected output is simply the expected value of a uniform distribution from a+s to b+s, which is:

E[x + s] = (a+b)/2+s

Then, once b+s > 1, overload risk begins to increase.

In this range, the probability of avoiding overload is:

P[x + s < 1] = F(1-s) = (1-s-a)/(b-a)

(Note that at b+s=1, this is exactly 1.)

The expected value of x+s in this range is:

E[x + s | x + s < 1] = (1-s)(1+s)/(2(b-a))

Multiplying these two together:

Y = [(1-s)(1+s)(1-s-a)]/[2(b-a)^2]

Here is what that looks like for a=0, b=1/2:

It does have the right qualitative features: increasing, then decreasing. But its sure looks weird, doesn’t it? It has this strange kinked shape.

So let’s consider some other distributions.

The next one I was able to solve it for is an exponential distribution, where the most probable activation is zero, and then higher activation always has lower probability than lower activation in an exponential decay:

For this it was actually easiest to do the integral directly (I did it by integrating by parts, but I’m sure you don’t care about all the mathematical steps):

Y = \int_{0}^{1-s} (x+s) dF(x)

Y = (1/λ+s) – (1/ λ + 1)e^(-λ(1-s))

The parameter λdecides how steeply your activation probability decays. Someone with low λ is relatively highly activated all the time, while someone with high λ is usually not highly activated; this seems like it might be related to the personality trait neuroticism.

Here are graphs of what the resulting Yerkes-Dodson curve looks like for several different values of λ:

λ = 0.5:

λ = 1:

λ = 2:

λ = 4:

λ = 8:

The λ = 0.5 person has high activation a lot of the time. They are actually fairly productive even without stress, but stress quickly overwhelms them. The λ = 8 person has low activation most of the time. They are not very productive without stress, but can also bear relatively high amounts of stress without overloading.

(The low-λ people also have overall lower peak productivity in this model, but that might not be true in reality, if λ is inversely correlated with some other attributes that are related to productivity.)

Neither uniform nor exponential has the nice bell-curve shape for innate activation we might have hoped for. There is another class of distributions, beta distributions, which do have this shape, and they are sort of tractable—you need something called an incomplete beta function, which isn’t an elementary function but it’s useful enough that most statistical packages include it.

Beta distributions have two parameters, α and β. They look like this:

Beta distributions are quite useful in Bayesian statistics; if you’re trying to estimate the probability of a random event that either succeeds or fails with a fixed probability (a Bernoulli process), and so far you have observed a successes and b failures, your best guess of its probability at each trial is a beta distribution with α = a+1 and β = b+1.

For beta distributions with parameters α and β, the result comes out to (I is that incomplete beta function I mentioned earlier):

Y = I(1-s, α+1, β) + I(1-s, α, β)

For whole number values of α andβ, the incomplete beta function can be computed by hand (though it is more work the larger they are); here’s an example with α = β = 2.

The innate activation probability looks like this:

And the result comes out like this:

Y = 2(1-s)^3 – 3/2(1-s)^4 + 3s(1-s)^2 – 2s(1-s)^3

This person has pretty high innate activation most of the time, so stress very quickly overwhelms them. If I had chosen a much higher β, I could change that, making them less likely to be innately so activated.

These are the cases I’ve found to be relatively tractable so far. They all have the right qualitative pattern: Increasing stress increases productivity for awhile, then begins decreasing it once overload risk becomes too high. They also show a general pattern where people who are innately highly activated (neurotic?) are much more likely to overload and thus much more sensitive to stress.

The stochastic overload model

The stochastic overload model

Mar 12 JDN 2460016

The next few posts are going to be a bit different, a bit more advanced and technical than usual. This is because, for the first time in several months at least, I am actually working on what could be reasonably considered something like theoretical research.

I am writing it up in the form of blog posts, because actually writing a paper is still too stressful for me right now. This also forces me to articulate my ideas in a clearer and more readable way, rather than dive directly into a morass of equations. It also means that even if I do never actually get around to finishing a paper, the idea is out there, and maybe someone else could make use of it (and hopefully give me some of the credit).

I’ve written previously about the Yerkes-Dodson effect: On cognitively-demanding tasks, increased stress increases performance, but only to a point, after which it begins decreasing it again. The effect is well-documented, but the mechanism is poorly understood.

I am currently on the wrong side of the Yerkes-Dodson curve, which is why I’m too stressed to write this as a formal paper right now. But that also gave me some ideas about how it may work.

I have come up with a simple but powerful mathematical model that may provide a mechanism for the Yerkes-Dodson effect.

This model is clearly well within the realm of a behavioral economic model, but it is also closely tied to neuroscience and cognitive science.

I call it the stochastic overload model.

First, a metaphor: Consider an engine, which can run faster or slower. If you increase its RPMs, it will output more power, and provide more torque—but only up to a certain point. Eventually it hits a threshold where it will break down, or even break apart. In real engines, we often include safety systems that force the engine to shut down as it approaches such a threshold.

I believe that human brains function on a similar principle. Stress increases arousal, which activates a variety of processes via the sympathetic nervous system. This activation improves performance on both physical and cognitive tasks. But it has a downside; especially on cognitively demanding tasks which required sustained effort, I hypothesize that too much sympathetic activation can result in a kind of system overload, where your brain can no longer handle the stress and processes are forced to shut down.

This shutdown could be brief—a few seconds, or even a fraction of a second—or it could be prolonged—hours or days. That might depend on just how severe the stress is, or how much of your brain it requires, or how prolonged it is. For purposes of the model, this isn’t vital. It’s probably easiest to imagine it being a relatively brief, localized shutdown of a particular neural pathway. Then, your performance in a task is summed up over many such pathways over a longer period of time, and by the law of large numbers your overall performance is essentially the average performance of all your brain systems.

That’s the “overload” part of the model. Now for the “stochastic” part.

Let’s say that, in the absence of stress, your brain has a certain innate level of sympathetic activation, which varies over time in an essentially chaotic, unpredictable—stochastic—sort of way. It is never really completely deactivated, and may even have some chance of randomly overloading itself even without outside input. (Actually, a potential role in the model for the personality trait neuroticism is an innate tendency toward higher levels of sympathetic activation in the absence of outside stress.)

Let’s say that this innate activation is x, which follows some kind of known random distribution F(x).

For simplicity, let’s also say that added stress s adds linearly to your level of sympathetic activation, so your overall level of activation is x + s.

For simplicity, let’s say that activation ranges between 0 and 1, where 0 is no activation at all and 1 is the maximum possible activation and triggers overload.

I’m assuming that if a pathway shuts down from overload, it doesn’t contribute at all to performance on the task. (You can assume it’s only reduced performance, but this adds complexity without any qualitative change.)

Since sympathetic activation improves performance, but can result in overload, your overall expected performance in a given task can be computed as the product of two terms:

[expected value of x + s, provided overload does not occur] * [probability overload does not occur]

E[x + s | x + s < 1] P[x + s < 1]

The first term can be thought of as the incentive effect: Higher stress promotes more activation and thus better performance.

The second term can be thought of as the overload effect: Higher stress also increases the risk that activation will exceed the threshold and force shutdown.

This equation actually turns out to have a remarkably elegant form as an integral (and here’s where I get especially technical and mathematical):

\int_{0}^{1-s} (x+s) dF(x)

The integral subsumes both the incentive effect and the overload effect into one term; you can also think of the +s in the integrand as the incentive effect and the 1-s in the limit of integration as the overload effect.

For the uninitated, this is probably just Greek. So let me show you some pictures to help with your intuition. These are all freehand sketches, so let me apologize in advance for my limited drawing skills. Think of this as like Arthur Laffer’s famous cocktail napkin.

Suppose that, in the absence of outside stress, your innate activation follows a distribution like this (this could be a normal or logit PDF; as I’ll talk about next week, logit is far more tractable):

As I start adding stress, this shifts the distribution upward, toward increased activation:

Initially, this will improve average performance.

But at some point, increased stress actually becomes harmful, as it increases the probability of overload.

And eventually, the probability of overload becomes so high that performance becomes worse than it was with no stress at all:

The result is that overall performance, as a function of stress, looks like an inverted U-shaped curve—the Yerkes-Dodson curve:

The precise shape of this curve depends on the distribution that we use for the innate activation, which I will save for next week’s post.

The injustice of talent

Sep 4 JDN 2459827

Consider the following two principles of distributive justice.

A: People deserve to be rewarded in proportion to what they accomplish.

B: People deserve to be rewarded in proportion to the effort they put in.

Both principles sound pretty reasonable, don’t they? They both seem like sensible notions of fairness, and I think most people would broadly agree with both them.

This is a problem, because they are mutually contradictory. We cannot possibly follow them both.

For, as much as our society would like to pretend otherwise—and I think this contradiction is precisely why our society would like to pretend otherwise—what you accomplish is not simply a function of the effort you put in.

Don’t get me wrong; it is partly a function of the effort you put in. Hard work does contribute to success. But it is neither sufficient, nor strictly necessary.

Rather, success is a function of three factors: Effort, Environment, and Talent.

Effort is the work you yourself put in, and basically everyone agrees you deserve to be rewarded for that.

Environment includes all the outside factors that affect you—including both natural and social environment. Inheritance, illness, and just plain luck are all in here, and there is general, if not universal, agreement that society should make at least some efforts to minimize inequality created by such causes.

And then, there is talent. Talent includes whatever capacities you innately have. It could be strictly genetic, or it could be acquired in childhood or even in the womb. But by the time you are an adult and responsible for your own life, these factors are largely fixed and immutable. This includes things like intelligence, disability, even height. The trillion-dollar question is: How much should we reward talent?

For talent clearly does matter. I will never swim like Michael Phelps, run like Usain Bolt, or shoot hoops like Steph Curry. It doesn’t matter how much effort I put in, how many hours I spend training—I will never reach their level of capability. Never. It’s impossible. I could certainly improve from my current condition; perhaps it would even be good for me to do so. But there are certain hard fundamental constraints imposed by biology that give them more potential in these skills than I will ever have.

Conversely, there are likely things I can do that they will never be able to do, though this is less obvious. Could Michael Phelps never be as good a programmer or as skilled a mathematician as I am? He certainly isn’t now. Maybe, with enough time, enough training, he could be; I honestly don’t know. But I can tell you this: I’m sure it would be harder for him than it was for me. He couldn’t breeze through college-level courses in differential equations and quantum mechanics the way I did. There is something I have that he doesn’t, and I’m pretty sure I was born with it. Call it spatial working memory, or mathematical intuition, or just plain IQ. Whatever it is, math comes easy to me in not so different a way from how swimming comes easy to Michael Phelps. I have talent for math; he has talent for swimming.

Moreover, these are not small differences. It’s not like we all come with basically the same capabilities with a little bit of variation that can be easily washed out by effort. We’d like to believe that—we have all sorts of cultural tropes that try to inculcate that belief in us—but it’s obviously not true. The vast majority of quantum physicists are people born with high IQ. The vast majority of pro athletes are people born with physical prowess. The vast majority of movie stars are people born with pretty faces. For many types of jobs, the determining factor seems to be talent.

This isn’t too surprising, actually—even if effort matters a lot, we would still expect talent to show up as the determining factor much of the time.

Let’s go back to that contest function model I used to analyze the job market awhile back (the one that suggests we spend way too much time and money in the hiring process). This time let’s focus on the perspective of the employees themselves.

Each employee has a level of talent, h. Employee X has talent hx and exerts effort x, producing output of a quality that is the product of these: hx x. Similarly, employee Z has talent hz and exerts effort z, producing output hz z.

Then, there’s a certain amount of luck that factors in. The most successful output isn’t necessarily the best, or maybe what should have been the best wasn’t because some random circumstance prevailed. But we’ll say that the probability an individual succeeds is proportional to the quality of their output.

So the probability that employee X succeeds is: hx x / ( hx x + hz z)

I’ll skip the algebra this time (if you’re interested you can look back at that previous post), but to make a long story short, in Nash equilibrium the two employees will exert exactly the same amount of effort.

Then, which one succeeds will be entirely determined by talent; because x = z, the probability that X succeeds is hx / ( hx + hz).

It’s not that effort doesn’t matter—it absolutely does matter, and in fact in this model, with zero effort you get zero output (which isn’t necessarily the case in real life). It’s that in equilibrium, everyone is exerting the same amount of effort; so what determines who wins is innate talent. And I gotta say, that sounds an awful lot like how professional sports works. It’s less clear whether it applies to quantum physicists.

But maybe we don’t really exert the same amount of effort! This is true. Indeed, it seems like actually effort is easier for people with higher talent—that the same hour spent running on a track is easier for Usain Bolt than for me, and the same hour studying calculus is easier for me than it would be for Usain Bolt. So in the end our equilibrium effort isn’t the same—but rather than compensating, this effect only serves to exaggerate the difference in innate talent between us.

It’s simple enough to generalize the model to allow for such a thing. For instance, I could say that the cost of producing a unit of effort is inversely proportional to your talent; then instead of hx / ( hx + hz ), in equilibrium the probability of X succeeding would become hx2 / ( hx2 + hz2). The equilibrium effort would also be different, with x > z if hx > hz.

Once we acknowledge that talent is genuinely important, we face an ethical problem. Do we want to reward people for their accomplishment (A), or for their effort (B)? There are good cases to be made for each.

Rewarding for accomplishment, which we might call meritocracy,will tend to, well, maximize accomplishment. We’ll get the best basketball players playing basketball, the best surgeons doing surgery. Moreover, accomplishment is often quite easy to measure, even when effort isn’t.

Rewarding for effort, which we might call egalitarianism, will give people the most control over their lives, and might well feel the most fair. Those who succeed will be precisely those who work hard, even if they do things they are objectively bad at. Even people who are born with very little talent will still be able to make a living by working hard. And it will ensure that people do work hard, which meritocracy can actually fail at: If you are extremely talented, you don’t really need to work hard because you just automatically succeed.

Capitalism, as an economic system, is very good at rewarding accomplishment. I think part of what makes socialism appealing to so many people is that it tries to reward effort instead. (Is it very good at that? Not so clear.)

The more extreme differences are actually in terms of disability. There’s a certain baseline level of activities that most people are capable of, which we think of as “normal”: most people can talk; most people can run, if not necessarily very fast; most people can throw a ball, if not pitch a proper curveball. But some people can’t throw. Some people can’t run. Some people can’t even talk. It’s not that they are bad at it; it’s that they are literally not capable of it. No amount of effort could have made Stephen Hawking into a baseball player—not even a bad one.

It’s these cases when I think egalitarianism becomes most appealing: It just seems deeply unfair that people with severe disabilities should have to suffer in poverty. Even if they really can’t do much productive work on their own, it just seems wrong not to help them, at least enough that they can get by. But capitalism by itself absolutely would not do that—if you aren’t making a profit for the company, they’re not going to keep you employed. So we need some kind of social safety net to help such people. And it turns out that such people are quite numerous, and our current system is really not adequate to help them.

But meritocracy has its pull as well. Especially when the job is really important—like surgery, not so much basketball—we really want the highest quality work. It’s not so important whether the neurosurgeon who removes your tumor worked really hard at it or found it a breeze; what we care about is getting that tumor out.

Where does this leave us?

I think we have no choice but to compromise, on both principles. We will reward both effort and accomplishment, to greater or lesser degree—perhaps varying based on circumstances. We will never be able to entirely reward accomplishment or entirely reward effort.

This is more or less what we already do in practice, so why worry about it? Well, because we don’t like to admit that it’s what we do in practice, and a lot of problems seem to stem from that.

We have people acting like billionaires are such brilliant, hard-working people just because they’re rich—because our society rewards effort, right? So they couldn’t be so successful if they didn’t work so hard, right? Right?

Conversely, we have people who denigrate the poor as lazy and stupid just because they are poor. Because it couldn’t possibly be that their circumstances were worse than yours? Or hey, even if they are genuinely less talented than you—do less talented people deserve to be homeless and starving?

We tell kids from a young age, “You can be whatever you want to be”, and “Work hard and you’ll succeed”; and these things simply aren’t true. There are limitations on what you can achieve through effort—limitations imposed by your environment, and limitations imposed by your innate talents.

I’m not saying we should crush children’s dreams; I’m saying we should help them to build more realistic dreams, dreams that can actually be achieved in the real world. And then, when they grow up, they either will actually succeed, or when they don’t, at least they won’t hate themselves for failing to live up to what you told them they’d be able to do.

If you were wondering why Millennials are so depressed, that’s clearly a big part of it: We were told we could be and do whatever we wanted if we worked hard enough, and then that didn’t happen; and we had so internalized what we were told that we thought it had to be our fault that we failed. We didn’t try hard enough. We weren’t good enough. I have spent years feeling this way—on some level I do still feel this way—and it was not because adults tried to crush my dreams when I was a child, but on the contrary because they didn’t do anything to temper them. They never told me that life is hard, and people fail, and that I would probably fail at my most ambitious goals—and it wouldn’t be my fault, and it would still turn out okay.

That’s really it, I think: They never told me that it’s okay not to be wildly successful. They never told me that I’d still be good enough even if I never had any great world-class accomplishments. Instead, they kept feeding me the lie that I would have great world-class accomplishments; and then, when I didn’t, I felt like a failure and I hated myself. I think my own experience may be particularly extreme in this regard, but I know a lot of other people in my generation who had similar experiences, especially those who were also considered “gifted” as children. And we are all now suffering from depression, anxiety, and Impostor Syndrome.

All because nobody wanted to admit that talent, effort, and success are not the same thing.

Scalability and inequality

May 15 JDN 2459715

Why are some molecules (e.g. DNA) billions of times larger than others (e.g. H2O), but all atoms are within a much narrower range of sizes (only a few hundred)?

Why are some animals (e.g. elephants) millions of times as heavy as other (e.g. mice), but their cells are basically the same size?

Why does capital income vary so much more (factors of thousands or millions) than wages (factors of tens or hundreds)?

These three questions turn out to have much the same answer: Scalability.

Atoms are not very scalable: Adding another proton to a nucleus causes interactions with all the other protons, which makes the whole atom unstable after a hundred protons or so. But molecules, particularly organic polymers such as DNA, are tremendously scalable: You can add another piece to one end without affecting anything else in the molecule, and keep on doing that more or less forever.

Cells are not very scalable: Even with the aid of active transport mechanisms and complex cellular machinery, a cell’s functionality is still very much limited by its surface area. But animals are tremendously scalable: The same exponential growth that got you from a zygote to a mouse only needs to continue a couple years longer and it’ll get you all the way to an elephant. (A baby elephant, anyway; an adult will require a dozen or so years—remarkably comparable to humans, in fact.)

Labor income is not very scalable: There are only so many hours in a day, and the more hours you work the less productive you’ll be in each additional hour. But capital income is perfectly scalable: We can add another digit to that brokerage account with nothing more than a few milliseconds of electronic pulses, and keep doing that basically forever (due to the way integer storage works, above 2^63 it would require special coding, but it can be done; and seeing as that’s over 9 quintillion, it’s not likely to be a problem any time soon—though I am vaguely tempted to write a short story about an interplanetary corporation that gets thrown into turmoil by an integer overflow error).

This isn’t just an effect of our accounting either. Capital is scalable in a way that labor is not. When your contribution to production is owning a factory, there’s really nothing to stop you from owning another factory, and then another, and another. But when your contribution is working at a factory, you can only work so hard for so many hours.

When a phenomenon is highly scalable, it can take on a wide range of outcomes—as we see in molecules, animals, and capital income. When it’s not, it will only take on a narrow range of outcomes—as we see in atoms, cells, and labor income.

Exponential growth is also part of the story here: Animals certainly grow exponentially, and so can capital when invested; even some polymers function that way (e.g. under polymerase chain reaction). But I think the scalability is actually more important: Growing rapidly isn’t so useful if you’re going to immediately be blocked by a scalability constraint. (This actually relates to the difference between r- and K- evolutionary strategies, and offers further insight into the differences between mice and elephants.) Conversely, even if you grow slowly, given enough time, you’ll reach whatever constraint you’re up against.

Indeed, we can even say something about the probability distribution we are likely to get from random processes that are scalable or non-scalable.

A non-scalable random process will generally converge toward the familiar normal distribution, a “bell curve”:

[Image from Wikipedia: By Inductiveload – self-made, Mathematica, Inkscape, Public Domain, https://commons.wikimedia.org/w/index.php?curid=3817954]

The normal distribution has most of its weight near the middle; most of the population ends up near there. This is clearly the case for labor income: Most people are middle class, while some are poor and a few are rich.

But a scalable random process will typically converge toward quite a different distribution, a Pareto distribution:

[Image from Wikipedia: By Danvildanvil – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=31096324]

A Pareto distribution has most of its weight near zero, but covers an extremely wide range. Indeed it is what we call fat tailed, meaning that really extreme events occur often enough to have a meaningful effect on the average. A Pareto distribution has most of the people at the bottom, but the ones at the top are really on top.

And indeed, that’s exactly how capital income works: Most people have little or no capital income (indeed only about half of Americans and only a third(!) of Brits own any stocks at all), while a handful of hectobillionaires make utterly ludicrous amounts of money literally in their sleep.

Indeed, it turns out that income in general is pretty close to distributed normally (or maybe lognormally) for most of the income range, and then becomes very much Pareto at the top—where nearly all the income is capital income.

This fundamental difference in scalability between capital and labor underlies much of what makes income inequality so difficult to fight. Capital is scalable, and begets more capital. Labor is non-scalable, and we only have to much to give.

It would require a radically different system of capital ownership to really eliminate this gap—and, well, that’s been tried, and so far, it hasn’t worked out so well. Our best option is probably to let people continue to own whatever amounts of capital, and then tax the proceeds in order to redistribute the resulting income. That certainly has its own downsides, but they seem to be a lot more manageable than either unfettered anarcho-capitalism or totalitarian communism.

The fragility of encryption

Feb 13 JDN 2459620

I said in last week’s post that most of the world’s online security rests upon public-key encryption. It’s how we do our shopping, our banking, and paying our taxes.

Yet public-key encryption has an Achilles’ Heel. It relies entirely on the assumption that, even knowing someone’s public key, you can’t possibly figure out what their private key is. Yet obviously the two must be deeply connected: In order for my private key to decrypt all messages that are encrypted using my public key, they must, in a deep sense, contain the same information. There must be a mathematical operation that will translate from one to the other—and that mathematical operation must be invertible.

What we have been relying on to keep public-key encryption secure is the notion of a one-way function: A function that is easy to compute, but hard to invert. A typical example is multiplying two numbers: Multiplication is a basic computing operation that is extremely fast, even for numbers with thousands of digits; but factoring a number into its prime factors is far more difficult, and currently cannot be done in any reasonable amount of time for numbers that are more than a hundred digits long.


“Easy” and “hard” in what sense? The usual criterion is in polynomial time.

Say you have an input that is n bits long—i.e. n digits, when expressed as a binary number, all 0s and 1s. A function that can be computed in time proportional to n is linear time; if it can only be done in time proportional to n2, that is quadratic time; n3 would be cubic time. All of these are examples of polynomial time.

But if instead the time required were 2n, that would be exponential time. 3n and 1.5n would also be exponential time.

This is significant because of how much faster exponential functions grow relative to polynomial functions, for large values of n. For example, let’s compare n3 with2n. When n=3, the polynomial is actually larger: n3=27 but 2n=8. At n=10 they are nearly equal: n3=1000 but 2n=1024. But by n=20, n3 is only 8000 while 2n is over 1 million. At n=100, n3is a manageable (for a modern computer) 1 million, while 2nis a staggering 1030; that’s a million trillion trillion.

You may see that there is already something a bit fishy about this: There are lots of different ways to be polynomial and lots of different ways to be exponential. Linear time n is clearly fast, and for many types of problems it seems unlikely one could do any better. But is n100 time really all that fast? It’s still polynomial. It doesn’t take a large exponential base to make for very fast growth—2 doesn’t seem that big, after all, and when dealing with binary digits it shows up quite naturally. But while 2n grows very fast even for reasonably-sized n, 1.0000001n grows slower than most polynomials—even linear!—for quite a long range before eventually becoming very fast growth when n is in the hundreds of millions. Yet it is still exponential.


So, why do we use these categories? Well, computer scientists and mathematicians have discovered that many types of problems that seem different can in fact be translated into one another, so that solving one would solve the other. For instance, you can easily convert between the Boolean satisfiability problem and the subset-sum problem or the travelling salesman problem. These conversions always take time that is a polynomial in n(usually somewhere between linear and quadratic, as it turns out). This has allowed to build complexity classes, classes of problem such that any problem can be converted to any other in polynomial time or better.

Problems that can be solved in polynomial timeare in class P, for polynomial.

Problems that can be checked—but not necessarily solved—in polynomial time are in class NP, which actually stands for “non-deterministic polynomial” (not a great name, to be honest). Given a problem in NP, you may not be able to come up with a valid answer in polynomial time. But if someone gave you an answer, you could tell in polynomial time whether or not that answer was valid.

Boolean satisfiability (often abbreviated SAT) is the paradigmatic NP problem: Given a Boolean formula like (A OR B OR C) AND (¬A OR D OR E) AND (¬D OR ¬C OR B) and so on, it isn’t a simple task to determine if there’s some assignment of the variables A, B, C, D, E that makes it all true. But if someone handed you such an assignment, say (¬A, B, ¬C, D, E), you could easily check that it does in fact satisfy the expression. It turns out that in fact SAT is what’s called NP-complete: Any NP problem can be converted into SAT in polynomial time.

This is important because in order to be useful as an encryption system, we need our one-way function to be in class P (otherwise, we couldn’t compute it quickly). Yet, by definition, this means its inverse must be in class NP.


Thus, simply because it is easy to multiply two numbers, I know for sure that factoring numbers must be in NP: All I have to do to verify that a factorization is correct is multiply the numbers. Since the way to get a public key from a private key is (essentially) to multiply two numbers, this means that getting a private key from a public key is equivalent to factorization—which means it must be in NP.

This would be fine if we knew some problems in NP that could never, ever be solved in polynomial time. We could just pick one of those and make it the basis of our encryption system. Yet in fact, we do not know any such problems—indeed, we are not even certain they exist.

One of the biggest unsolved problems in mathematics is P versus NP, which asks the seemingly-simple question: “Are P and NP really different classes?” It certainly seems like they are—there are problems like multiplying numbers, or even finding out whether a number is prime, that are clearly in P, and there are other problems, like SAT, that are definitely in NP but seem to not be in P. But in fact no one has ever been able to prove that P ≠ NP. Despite decades of attempts, no one has managed it.

To be clear, no one has managed to prove that P = NP, either. (Doing either one would win you a Clay Millennium Prize.) But since the conventional wisdom among most mathematicians is that P ≠ NP (99% of experts polled in 2019 agreed), I actually think this possibility has not been as thoroughly considered.

Vague heuristic arguments are often advanced for why P ≠ NP, such as this one by Scott Aaronson: “If P = NP, then the world would be a profoundly different place than we usually assume it to be. There would be no special value in “creative leaps,” no fundamental gap between solving a problem and recognizing the solution once it’s found.”

That really doesn’t follow at all. Doing something in polynomial time is not the same thing as doing it instantly.

Say for instance someone finds an algorithm to solve SAT in n6 time. Such an algorithm would conclusively prove P = NP. n6; that’s a polynomial, all right. But it’s a big polynomial. The time required to check a SAT solution is linear in the number of terms in the Boolean formula—just check each one, see if it works. But if it turns out we could generate such a solution in time proportional to the sixth power of the number of terms, that would still mean it’s a lot easier to check than it is to solve. A lot easier.

I guess if your notion of a “fundamental gap” rests upon the polynomial/exponential distinction, you could say that’s not “fundamental”. But this is a weird notion to say the least. If n = 1 million can be checked in 1 million processor cycles (that is, milliseconds, or with some overhead, seconds), but only solved in 1036 processor cycles (that is, over a million trillion years), that sounds like a pretty big difference to me.

Even an n2 algorithm wouldn’t show there’s no difference. The difference between n and n2, is, well, a factor of n. So finding the answer could still take far longer than verifying it. This would be worrisome for encryption, however: Even a million times as long isn’t really that great actually. It means that if something would work in a few seconds for an ordinary computer (the timescale we want for our online shopping and banking), then, say, the Russian government with a supercomputer a thousand times better could spend half an hour on it. That’s… a problem. I guess if breaking our encryption was only feasible for superpower national intelligence agencies, it wouldn’t be a complete disaster. (Indeed, many people suspect that the NSA and FSB have already broken most of our encryption, and I wouldn’t be surprised to learn that’s true.)

But what I really want to say here is that since it may be true that P=NP—we don’t know it isn’t, even if most people strongly suspect as much—we should be trying to find methods of encryption that would remain secure even if that turns out to be the case. (There’s another reason as well: Quantum computers are known to be able to factor numbers in polynomial time—though it may be awhile before they get good enough to do so usefully.)

We do know two such methods, as a matter of fact. There is quantum encryption, which, like most things quantum, is very esoteric and hard to explain. (Maybe I’ll get to that in another post.) It also requires sophisticated, expensive hardware that most people are unlikely to be able to get.

And then there is onetime pad encryption, which is shockingly easy to explain and can be implemented on any home computer.

The problem with substitution ciphers is that you can look for patterns. You can do this because the key ultimately contains only so much information, based on how long it is. If the key contains 100 bits and the message contains 10,000 bits, at some point you’re going to have to repeat some kind of pattern—even if it’s a very complex, sophisticated one like the Enigma machine.

Well, what if the key were as long as the message? What if a 10,000 bit message used a 10,000 bit key? Then you could substitute every single letter for a different symbol each time. What if, on its first occurrence, E is D, but then it’s Q, and then it’s T—and each of these was generated randomly and independently each time? Then it can’t be broken by searching for patterns—because there are no patterns to be found.

Mathematically, it would look like this: Take each bit of the plaintext, and randomly generate another bit for the key. Add the key bit to the plaintext bit (technically you want to use bitwise XOR, but that’s basically adding), and you’ve got the ciphertext bit. At the other end, subtracting out each key bit will give back each plaintext bit. Provided you can generate random numbers efficiently, this will be fast to encrypt and decrypt—but literally impossible to break without the key.

Indeed, onetime-pad encryption is so secure that it is a proven mathematical theorem that there is no way to break it. Even if you had such staggering computing power that you could try every possible key, you wouldn’t even know when you got the right one—because every possible message can be generated from a given ciphertext, using some key. Even if you knew some parts of the message already, you would have no way to figure out any of the rest—because there are no patterns linking the two.

The downside is that you need to somehow send the keys. As I said in last week’s post, if you have a safe way to send the key, why can’t you send the message that way? Well, there is still an advantage, actually, and that’s speed.

If there is a slow, secure way to send information (e.g. deliver it physically by armed courier), and a fast, insecure way (e.g. send it over the Internet), then you can send the keys in advance by the slow, safe way and then send ciphertexts later the fast, risky way. Indeed, this kind of courier-based onetime-pad encryption is how the red phone” (really a fax line) linking the White House to the Kremlin works.

Now, for online banking, we’re not going to be able to use couriers. But here’s something we could do. When you open a bank account, the bank could give you a, say, 128 GB flash drive of onetime-pad keys for you to use in your online banking. You plug that into your computer every time you want to log in, and it grabs the next part of key each time (there are some tricky technical details with synchronizing this that could, in practice, create some risk—but, done right, the risk would be small). If you are sending 10 megabytes of encrypted data each time (and that’s surely enough to encode a bank statement, though they might want to use a format other than PDF), you’ll get over 10,000 uses out of that flash drive. If you’ve been sending a lot of data and your key starts to run low, you can physically show up at the bank branch and get a new one.

Similarly, you could have onetime-pad keys on flash drives (more literal flash keys)given to you by the US government for tax filing, and another from each of your credit card issuers. For online purchases, the sellers would probably need to have their own onetime-pad keys set up with the banks and credit card companies, so that you send the info to VISA encrypted one way and they send it to the seller encrypted another way. Businesses with large sales volume would go through keys very quickly—but then, they can afford to keep buying new flash drives. Since each transaction should only take a few kilobytes, the cost of additional onetime-pad should be small compared to the cost of packing, shipping, and the items themselves. For larger purchases, business could even get in the habit of sending you a free flash key with each purchase so that future purchases are easier.

This would render paywalls very difficult to implement, but good riddance. Cryptocurrency would die, but even better riddance.It would be most inconvenient to deal with things like, well, writing a blog like this; needing to get a physical key from WordPress sounds like quite a hassle. People might actually just tolerate having their blogs hacked on occasion, because… who is going to hack your blog, and who really cares if your blog gets hacked?

Yes, this system is awkward and inconvenient compared to our current system. But unlike our current system, it is provably secure. Right now, it may seem like a remote possibility that someone would find an algorithm to prove P=NP and break encryption. But it could definitely happen, and if it did happen, it could happen quite suddenly. It would be far better to prepare for the worst than be unprepared when it’s too late.