Please indulge me for a second in a weird analogy that comes from my Catholic upbringing.
People who haven’t read the gospels before are often surprised to find that Jesus spends a lot of his time giving lectures and then getting bogged down in Q&As with guys who are like “um yeah less of a question and more of a comment, really.” There’s this particular group of scholars called the Pharisees who are always trying to nail Jesus for saying stupid things. (They never succeed, but they get their revenge at the end of the story by nailing Jesus in a more literal sense.)
One day, the Pharisees ask Jesus, “Is it okay to get divorced or nah?” And Jesus says, “Nah.” Specifically, he says, “What God has joined together, let no one separate.” At this, the Pharisees think they've got Jesus dead to rights:
“Why then,” they asked, “did Moses command that a man give his wife a certificate of divorce and send her away?”
Jesus replied, “Moses permitted you to divorce your wives because your hearts were hard. But it was not this way from the beginning.”
This idea stuck deep in my young mind: there is a perfect, virtuous way that God wants us to be, but he knows it’s hard, so he makes exceptions. I'm bringing this up not because I care about the biblical teachings on divorce (I'm good, thanks), but because there's a terrific analogy here for science. Which is: God wants us to discover truths about the universe, but our hearts are hard, so he begrudgingly allows us to use statistics.
CONGRATULATIONS, YOU’VE GIVEN BIRTH TO A BEAUTIFUL ERROR CONTROL BABY
I used to see statistics as the secret numerical language that smarty-pantses use to discover truth.
You can't do anything important, I figured, until you can tell your Cohen's d from your Hedge's g, your Mahalabonis distance from your Cook's distance, and your betas from your etas. Only after I mastered all that would a statistics Yoda slice off my padawan rattail with a lightsaber, pronounce me wise in the ways of the stats, and allow me to create knowledge. “Creating knowledge,” of course, meant running studies and then analyzing the results and reporting p-value, allowing me to claim that my results are “statistically significant”—unlikely to have arisen through chance alone. To me, that's what psychology was: a bunch of stories that end in a number.
(I took to this idea with religious fervor: almost ten years ago, I wrote a now-cringeworthy article for the Association for Psychological Science called Psychology Significantly More Scientific than Previously Thought (p < .01)1 in which I gushed over how science-y psychology is because it has, among other things, error bars.)
This also meant you should show extreme deference to people who have greater statistical knowledge, for they are closer to God. When I was just a pipsqueak research assistant working in a psychology lab and I got stuck on some statistics problem, I would go to a place I called the Den of Hate, a basement room on campus where a grumpy PhD student would give you condescending advice on your data problems. I would sit in his presence like a sinner in front of a parish priest, confessing all of my statistical sins and swearing to do his prescribed penance, which often meant giving up on an idea entirely. He knew God's will, and I didn't, so what else could I do but apologize for my ignorance and follow his orders? This, having grown up Catholic, was all very familiar to me.
Apparently you don't need to spend years as an altar server to end up feeling this way, because the rest of my colleagues were also obsessed with stats. According to this list, the most-cited psychology paper of all time is an explanation of how to do mediation and moderation analysis, two popular statistical techniques (123,000 cites and counting!). Not far down the list is this book on multiple regression. One of the most-cited psychology papers in recent history is “False-Positive Psychology,” a demonstration of how easy it is to manipulate statistics to get whatever result you want. The entire replication crisis was one big argument about numbers and how to use them.
In fact, one of psychologists' favorite pastimes is to complain about how other psychologists don't know how to do statistics. “It's too easy for naughty researchers to juice their stats, so we should lower the threshold of statistical significance from p = .05 to p = .005!” “Actually, we use these new statistics instead!” “In fact, we should ban p-values entirely!” One article advocating for an alternative statistical framework begins with this hilariously dramatic quote:
Dark and difficult times lie ahead. Soon we must all face the choice between what is right and what is easy.
A. P. W. B. Dumbledore
It's not just psychologists, of course. 96% of biomedical articles published between 1990 and 2015 included at least one p-value. Neuroscientists have done a lot of soul-searching since a few of them demonstrated that common techniques could show statistically significant brain activity in a dead salmon. The American Statistical Association has also waded into the fray, publishing an official statement on p-values accompanied by 21 different commentaries sporting such titles as, “Don't throw out the error control baby with the bad data bathwater.”
LESSONS FROM BIRD MURDER
If there's one thing they're really clear on in Catholic school, it's that you shouldn't worship idols. And that's exactly what statistics has become for us—we've turned it from a useful servant into a bad master.
Here's what I wish the Den of Hate guy would have told me:
Hey dude, remember that statistics are just a tool for distinguishing signal from noise. If we really knew what was going on, we wouldn’t need statistics. It’s fine to use them, just like it’s fine to use training wheels and water wingies. But we should aspire to shed them eventually. You’ll never really learn to bike or swim until you ditch the extra wheels and the inflatables, and you'll never really understand the universe if you're stuck studying it with statistics.
At the time, I probably would have protested. The signal is so quiet and the noise is so loud! How could you ever tell the difference between them without a bunch of complicated math?
Well, a lot of people figured out how to do exactly that, because most of scientific history happened before the invention of statistics. Modern stats only started popping off in the late 1800s and early 1900s, which is when we got familiar and now-indispensable concepts like correlation, standard deviation, and regression.2 (In fact, all of those came from one guy, which is why I reviewed his autobiography.) The p-value itself didn't get popular until the 1920s, thanks in part to a paper published pseudonymously by the head brewer at Guinness (yes, the beer company). So the whole ritual of “run study, apply statistical test, report significance” is only about 100 years old, and the people who invented it were probably drunk.
What did scientists do before that? They dropped cannonballs off of towers. They ate meat tainted with cholera in an attempt to win a bet. They stuck a bunch of stuff under a microscope and wrote “unhinged” letters about what they saw. They watched the planets go around, at least, they did until some Teutonic Knights showed up and smashed all their stuff. They made things glow in the dark. They drew pictures of birds, put birds in vacuum chambers, and trained birds to spin around (not the same birds they put in the vacuum chambers).3 Somehow we managed to bank several hundred years of scientific progress without a p-value in sight.
Okay, so, when those guys were dropping their cannonballs and doing all that bird murder, what were they actually doing? Three things, I think:
They were chasing ginormous effects because they had no way of detecting teeny-tiny effects. Most things don't glow in the dark, for instance, so when something does, you go, “Hey, that's glowing in the dark.”
They were studying self-evident phenomena, things that they definitely knew happened and yet could not explain. A big chunk of scientific history can be boiled down to: “One day Saturn is over here and the next day it's over there and excuse me but what the hell is going on.”
They were making gutsy, falsifiable hypotheses about how the world works. None of this namby-pamby “sometimes this sort of thing happens, a little bit, kind of, if the situation is just right in ways that I can't articulate.” No, they were saying things like “Birds need air to live, full stop, and if you find a bird that can survive in my vacuum chamber, I will eat my hat.”
Using statistics isn't inherently at odds with doing any of those three important activities. But in practice, stats can seduce you into studying minuscule effects that may or may not actually exist, which then allows you to spin hand-wavy theories about why those effects sometimes appear and sometimes don't. This can go on for years, a perpetual motion machine powered entirely by an inexhaustible supply of p-values.
Here's an example from psychology. In 2008, researchers showed a random sample of participants a tiny picture of an American flag and found that people “primed” with the flag were more likely to vote for John McCain for president two weeks later, and had less positive attitudes toward Barack Obama's job performance eight months after that. In 2020, the same researchers published another paper where they can't find any effect of flag priming anymore, and wonder whether the effect has declined over time or never existed at all.
This “flag priming” study became one of the poster children for the replication crisis, and the overwhelming reaction is now, “People should use statistics correctly!” But that misses the point: studies like only exist because of statistics. In this case—
The effect is not ginormous. In fact, it's so tiny—if it exists at all—that you can only see it with the aid of stats.
It's not self-evident. It's not like people sometimes see an American flag and go “I must go vote for a Republican right away!” and hurry off to their nearest precinct. You would only go looking for a wobbly, artificial phenomenon like this when you have statistical tools for finding it.
It doesn't have a gutsy, falsifiable hypothesis behind it. If you run another flag priming study and fail to find an effect, you don't have to eat your hat. You can just say, “Well of course there's a lot of noise and it's hard to find the signal. Maybe it's the wrong context, or the wrong group of people, or maybe flags mean something different now.”
Would studies like this be better if they always did all their stats perfectly? Of course. But the real improvement would be not doing this kind of study at all.
I get why the social sciences, and psychology in particular, fight about stats so much. Using numbers makes you look smart. “Did they crunch the numbers correctly?” is a much easier question to answer than “Does this study matter?” Most of all, though, I think it's because statistics are largely the rules of the academic game; they decide what gets published and who gets hired. So arguing about statistics is much like arguing about whether a dice roll still counts if it falls off the table, or whether proper nouns should be allowed in Scrabble—the kind of thing that can turn normal people into table-flipping maniacs.
That's what I think, anyway! But perhaps that's just because, eight months ago, I saw a little flag with “BE GRUMPY ABOUT STATS” written on it.
DR. DOWN VS. DR. UP
I've steeped for so long in the idea that doing science = doing statistics that it's hard for me to think in any other way, so here's an example that helped me see both how discoveries can happen without needing lots of numbers, and how employing numbers might have actually slowed that discovery down.
Today we know that if you have an extra copy of chromosome 21, you will have a collection of symptoms we call Down syndrome. That's a guarantee, no statistics necessary. To figure that out, though, you need to know what a chromosome is, and you have to be able to count them. That's why J. Langdon Down, the English physician who first described the syndrome in 1866, had no idea what caused it—he figured it was because the child's parents had tuberculosis. We didn't discover chromosomes until the 1880s, we didn't know how many humans had until 1956, and we didn't connect Down syndrome to having an extra chromosome until 1958.
The original Dr. Down could have spent the rest of his life trying to understand the cause of his eponymous syndrome and he would have failed. He could have run study after study and employed all kinds of statistical tests—not that they had been invented yet—and he almost certainly would have found something. It's not hard to happen across a significant p-value—a bit of confirmation bias, a hint of selection effect, some sloppy measurement, and bingo bango, you get p < .05, a statistically significant result. Then another researcher tries something similar and finds nothing, and that's not a failure; that's the beginning of a field! You can support several careers on a dispute like that! It could have looked like this:
Dr. Down: Parental tuberculosis is associated with Down syndrome in children, p < .05.
Dr. Up: I studied a larger sample and I didn't find your effect, p = .78!
Dr. Down: The parents in your sample didn't have severe enough tuberculosis! You have to analyze tuberculosis severity!
Dr. Up: Your original study was underpowered for an analysis like that! It's spurious!
Dr. Down: I find the same thing using a structural equation model and a flat Bayesian prior!
Dr. Up: Your model is overfitted and your prior is wrong!
Dr. Down: I HATE YOU
Dr. Up: GO DIE
And so on, forever, consuming people's lives and millions of dollars in scientific funding along the way. All of this might look like very important scientific discourse, but remember that, no matter how long Dr. Down and Dr. Up argue, they are are never going to figure out what causes Down syndrome. Their problem is that they don't know what a chromosome is, and that can't be solved with statistics.
In fact, statistics are counterproductive here, because they create the illusion that there's a discoverable truth, and that the doctors are coming closer and closer to discovering it, if they can just figure out how to do the numbers right. That is to say: statistics can, paradoxically, prevent the accumulation of ignorance signals.
GRINK AND BEAR IT
If I had heard this argument even a few years ago, I know how I would have responded:
“Sure, we can discover important truths without statistics in areas like chemistry or medicine, where there are ironclad laws that create huge effects. But the social sciences aren't like that. Once you're trying to study humans walking around and talking to each other and making governments and economies and what not, there's simply too much complexity. It's like trying to predict the weather more than a few days out—the signals get drowned out by randomness. That's why, in the social sciences, we'll always need statistics to help us detect our delicate little effects.”
I don't feel that way anymore, and I'll explain why with the assistance of The Grink and a hole in the head.
I hang out a lot with improv people, who are all about bits—silly little jokes and riffs on things people say. It's kinda like this:
Occasionally there will be a non-comedy person in the mix who will say something like “Haha you guys are so random!” and everybody will wince because we're not being random at all. We're laughing because each bit is a play on something. If you think it's random, it's because you have no idea what's going on. And that's fine, there's no shame in not knowing stuff. But you shouldn't assume that just because the conversation makes no sense to you, it must not make sense at all. C'mon, dude! You thought “grink” was just some sound we like to make?
The lesson here is things seem random until you understand 'em.
If Dr. Down had a hundred years to study his syndrome, but never discovered chromosomes, he would probably end up thinking that Down syndrome arises pretty much randomly. What gives rise to this condition, Dr. Down? “I don't know, it's mostly noise!”
We've made it farther than he did, but we are again stuck staring into a void that looks like randomness: we don't really know why some people end up with an extra chromosome 21. The National Institute of Health will tell you that it's because of a “random error in cell division.” Well, things seem random before you understand 'em. Something causes those errors; we just don't know what it is yet.
So yes, today the social sciences mainly produce a bunch of flimflam phenomena that you can only detect with statistics and that disappear with the tiniest change in context. But that doesn't mean we're doomed to do that forever. Plenty of perfectly respectable scientific endeavors have begun this way! It's all flimflam until you start to see the regularities in the randomness.
For several thousand years, the state of the art in treating head wounds was to cut an even larger hole in your skull. If you had been a canny observer of medicine between Hippocrates and the 1800s, you might have assumed that we'd progressed as far in cranial medicine as we were ever going to get. What can doctors do for patients who bonked their heads big time? Who knows, it's too complicated, too random! And yet, today a person with a cracked-open skull can visit their local hospital and receive surgical interventions that might actually save their life, or at least not kill them faster than their injuries. Progress: it can happen even when it looks impossible!
We will never stop producing flimflam phenomena, however, if that's all we ever look for. And it's our reliance on statistics that sends us looking for them. Nobody ever explicitly tells you to search for itsy bitsy, teenie weenie results, but if someone's like “Here's a microscope made out of numbers, go discover truth with it,” of course you're going to point it at tiny things. Do that long enough and you might come to believe, like I did, that only tiny things exist.
At that point, changing your mind might require some divine intervention. So let's get some.
THE SOMETIMES SCIENCE
Imagine God came thundering down to Earth and was like: “Hey guys! Oh, um, no, I’m not here to do the rapture. I just want to make a point about statistics. I’m tired of you guys arguing about them! From now on, I forbid you from using statistical tests. You may only say that things happen always, sometimes, or never. That’s all, peace out, I’m God.”
There are some swaths of science where this wouldn't be a big deal. An atom of lithium always has three protons. Things never go faster than the speed of light. Animal cells never have cell walls.
(That's not to say we're right about all of these things! You can, in fact, trace the evolution of our understanding by following the transformation of sometimes, always, and never. Like this:
200s BC - 1600s AD: Life sometimes arises from non-life! Oysters come from mud, etc.
1600s - 1800s: No it doesn't! Life always comes from life!
1800s - now: Actually, life sometimes arises from non-life! But not in the “oysters sometimes come from mud” sense, more in the “chains of amino acids can form in the right kind of soup” sense.)
We psychologists would look around and realize that we mainly have a big pile of sometimes. Does cognitive-behavioral therapy help with anxiety disorders? Sometimes. Do people feel more free to do something naughty after doing something nice? Sometimes. Does adding a third option make people better at discriminating between two other options? Sometimes. Do people prefer universally implementing a policy rather than testing it with a randomized-controlled trial? Sometimes. Can “growth mindset” interventions improve students' performance? Our current answers are sometimes, but said with a smile, and sometimes, but said with a frown.
That's not a dig, exactly, because it‘s not that sometimes is inherently uninteresting. If a treatment works only sometimes, that's still better than never. And if you expect something to happen always or never, then discovering it happens sometimes is pretty cool. For instance, how often should you be able to get regular people to shock other regular people to death just by asking them to do it? Seems like that would happen pretty much never, right? Well, it’s sometimes.4
But it's not enough to just stack up sometimes-es forever. In the great jigsaw puzzle of science, always and never are the corner pieces. Does psychology have any of those?
Well, here’s one: everything we consider human psychology—thoughts, attitudes, memories, language, all that stuff—is created by the nervous system. You should never be able to find mental activity without corresponding neuronal activity. That may seem obvious now, but it wasn’t obvious to our ancestors. Aristotle thought the purpose of the brain was to cool the blood, like a big meaty radiator. In his model, all sensation happened in the heart. (Presumably Aristotle would have predicted that heart transplants would create a Freaky Friday-type situation, where the personality of the donor would be transferred to the body of the recipient. This, as far as I can tell, never happens.) Even up until Leonardo da Vinci's time, people believed that one of the brain's purposes was producing semen.5
(This is also apparently still big news to the neuroscientists. I used to watch lots of neuroscience talks where the presenter would be like, “How do humans do this amazing thing?” and then they'd show a lit-up patch of brain and go “I found it, it's right here!” After the talk, one of my colleagues would always say, “Where did they think it was going to be? The elbow?”)
Here's another one: some visual illusions seem pretty always.
It kinda looks like there's an invisible triangle nestled between those Pac-Men, right? And it doesn't go away, even when you rationally explain to yourself that there's no triangle there at all. The point is that your brain doesn't just peek out through your eyes and behold the world as it is; it adds a lot of assumptions to make incoming visual information more meaningful.
I could keep listing psychological always and nevers, but not for very long, because we don't have that many of them yet. That's fine, more for us to discover! But we won't discover them with statistics alone, for that is the language of sometimes.
When used wisely, statistics can instead be like a trail of breadcrumbs that leads us out of the dark woods into the light of understanding. “There are more cancer cases in this town than we would expect by chance, what's going on there?” We might have to follow that trail of bread crumbs for a very long time, and there's nothing wrong with that. But the point is to leave the woods, not to spend our lives eating the crumbs. “Man shall not live on bread[crumbs] alone.”
HOW TO GET A BIG KISS ON THE LIPS FROM ME
I’m thinking about all this right now because I recently finished William Powers’ Behavior: The Control of Perception, a book from 1973 that lays out a theory-of-everything for psychology. There’s too much to go into right now, but here's something crazy that he says toward the end, when he's talking about how to test his theory:
The experimental methods we will look at in this chapter are largely nonstatistical. They do not concern the behavior of masses of subjects, but of single individuals considered one at a time. They are all aimed at testing or elaborating on one organizational model; that model, not random guessing, is the source of hypotheses to be tested, and the tests, not being statistical, make room for no counterexamples. All counterexamples are critical and require revision of the theory.
Daaaamn! Go off, Willie P! I don’t think he figured everything out, but this is definitely the right way to figure things out. If you can find me a psychologist today who is willing to say “All counterexamples are critical and require revision of my theory,” I’d give ‘em a big kiss on the lips.
If we're going to turn psychology into a paradigmatic science, not just a science of sometimes, this is where we have to get eventually. Godspeed to us all, see you in heaven.
There are sporadic examples of people using proto-statistics to answer scientific questions before this. For instance, in 1710, Dr. John Arbuthnot used some probability theory to show that women give birth to more boys than girls every year because, uh, God makes it happen.
Okay this paper is from 1948, but a) rule of three, you gotta have a third bird thing, and b) it doesn't include any statistical tests.
I know people have attempted to debunk the Milgram shock studies, but unlike many other classic psychology experiments, I believe these remain bunked.
da Vinci includes a tube from the brain to the penis in an earlier drawing, but omits it in a later drawing. Perhaps he changed his mind, most likely after draining it of semen.
In undergrad statistics, the prof began the first class by saying, "Within a narrow margin of error, on average everyone has one tit and one testicle". He then spent the semester demonstrating how we could do better.
“If your experiment needs statistics, you ought to have done a better experiment.” -Ernest Rutherford.
I’m not sure if he was joking at the time.