Statistics and psychology: Multiple comparisons give spurious results.

Do Women Really Wear Red When They’re Fertile?

Do Women Really Wear Red When They’re Fertile?

The state of the universe.
July 24 2013 12:37 PM

Too Good to Be True

Statistics may say that women wear red when they’re fertile … but you can’t always trust statistics.

(Continued from Page 1)

And this happens all the time. For example, the July 2013 issue of Psychological Science included a paper claiming an association between men's upper-body strength and their attitudes about economic redistribution. The authors wrote, "We showed that upper-body strength in modern adult men influences their willingness to bargain in their own self-interest over income and wealth redistribution. These effects were replicated across cultures and, as expected, found only among males." Actually, two of their three studies were of college students, and they did not actually measure anybody's upper-body strength; they just took measurements of arm circumference. It's a longstanding tradition to do research studies using proxy measures on students—but if it's OK to do so, you should be open about it; instead of writing about "upper-body strength" and "men," be direct and say "arm circumference" and "students." Own your research choices!

But, to return to the main theme here, these researchers had enough degrees of freedom for them to be able to find any number of apparent needles in the haystack of their data. Most obviously, the authors report a statistically significant interaction with no statistically significant main effect. That is, they did not find that men with bigger arm circumference had more conservative positions on economic redistribution. What they found was that the correlation of arm circumference with opposition to redistribution of wealth was higher among men of high socioeconomic status. But, had they seen the main effect (in either direction), I’m sure they could have come up with a good story for that, too. And if there had been no main effect and no interaction, they could have looked for other interactions. Perhaps, for example, the correlations could have differed when comparing students with or without older siblings?

Researchers’ decisions about which variables to analyze may make perfect sense, but they indicate the difficulty of taking these p-values at anything like face value. There is no reason to assume the researchers were doing anything nefarious or trying to manipulate the truth. Rather, like sculptors, they were chipping away the pieces of the data that did not fit their story, until they ended up with a beautiful and statistically significant structure that confirmed their views.


At this point you might be thinking that I'm getting pretty picky here. Sure, these researchers made some judgment calls. But are a few judgment calls enough to conjure up publishable findings in the absence of any real effect?

The answer is yes, and in fact there have been famous cases in recent years of scientists demonstrating that, with enough researcher degrees of freedom, you can get publication-quality statistical results when there is nothing going on at all. I will give two examples.

The first example was inadvertent on the scientist's part, and it is rather sad. Daryl Bem, a distinguished research psychologist at Cornell, garnered headlines two years ago when he published a paper in the Journal of Personality and Social Psychology (another leading journal in the field) purporting to find ESP. The paper included nine different experiments and many statistically significant results. Unfortunately (or perhaps fortunately, for those of us who don't want the National Security Agency to be reading our minds as well as our emails), these experiments had multiple degrees of freedom that allowed Bem to keep looking until he could find what he was searching for. In his first experiment, in which 100 students participated in visualizations of images, he found a statistically significant result for erotic pictures but not for nonerotic pictures. But consider all the other possible comparisons: If the subjects had identified all images at a rate statistically significantly higher than chance, that certainly would have been reported. Or what if performance had been higher for the nonerotic pictures? One could easily argue that the erotic images were distracting and only the nonerotic images were a good test of the phenomenon. Or what if participants had performed statistically significantly better in the second half of the trial than in the first half? That would be evidence of learning. Or if they performed better on the first half? Evidence of fatigue. Bem reports, "There were no significant sex differences in the present experiment." If there had been (for example, if men had performed better with erotic images and women with romantic but nonerotic images), this certainly could have been presented as convincing evidence. And so on.

Lots of people want to believe in ESP. After all, it would be cool to read minds. (It wouldn’t be so cool, maybe, if other people could read your mind and you couldn’t read theirs, but I suspect most people don’t think of it that way.) And ESP seems so plausible, in a wish-fulfilling sort of way. It really feels like if you concentrate really hard, you can read minds, or predict the future, or whatever. Heck, when I play squash I always feel that if I really, really try hard, I should be able to win every point. The only thing that stops me from really believing this is that I realize that the same logic holds symmetrically for my opponent.

In the years since Bem's experiment, researchers have tried and failed to replicate his results. It is too bad that resources had to be wasted on this, but perhaps this is all worth it if it brings more attention to the ubiquitous problem of researcher degrees of freedom.

My final example is a wonderful study by psychologist Craig Bennett and colleagues, who found statistically significant correlations in a functional MRI scan of a dead salmon. They were using the same sort of analysis that non-joking political scientists use in making claims such as "Red Brain, Blue Brain: Evaluative Processes Differ in Democrats and Republicans," but the difference is that Bennett and his colleagues are open about the fact that these imaging studies have hundreds of thousands of degrees of freedom. The salmon study is beautiful because everyone knows a dead fish can't be thinking, but it's still possible to find patterns if you look hard enough.

In one of his stories, science fiction writer and curmudgeon Thomas Disch wrote, "Creativeness is the ability to see relationships where none exist." We want our scientists to be creative, but we have to watch out for a system that allows any hunch to be ratcheted up to a level of statistical significance that is then taken as scientific proof.

Even if something is published in the flagship journal of the leading association of research psychologists, there's no reason to believe it. The system of scientific publication is set up to encourage publication of spurious findings.

Update, July 31, 2013: Jessica Tracy and Alec Beall, the authors of the red clothing study, posted a response to this story, and Andrew Gelman republished their comments along with his reactions on his blog.  

Correction, July 24, 2013: Due to an editing error, this story misnamed the Association for Psychological Science.