Too Good to Be True

Statistics may say that women wear red when they’re fertile … but you can’t always trust statistics.

July 24, 201312:37 PM

Woman wearing a red shirt. — Does the red shirt mean she’s ovulating? Not so fast …

Photo by Ghenadie Rusu/Thinkstock

Are women three times more likely to wear red or pink when they are most fertile? No, probably not. But here’s how hardworking researchers, prestigious scientific journals, and gullible journalists have been fooled into believing so.

The paper I’ll be talking about appeared online this month in Psychological Science, the flagship journal of the Association for Psychological Science, which represents the serious, research-focused (as opposed to therapeutic) end of the psychology profession.*

“Women Are More Likely to Wear Red or Pink at Peak Fertility,” by Alec Beall and Jessica Tracy, is based on two samples: a self-selected sample of 100 women from the Internet, and 24 undergraduates at the University of British Columbia. Here’s the claim: “Building on evidence that men are sexually attracted to women wearing or surrounded by red, we tested whether women show a behavioral tendency toward wearing reddish clothing when at peak fertility. … Women at high conception risk were more than three times more likely to wear a red or pink shirt than were women at low conception risk. … Our results thus suggest that red and pink adornment in women is reliably associated with fertility and that female ovulation, long assumed to be hidden, is associated with a salient visual cue.”

Pretty exciting, huh? It’s (literally) sexy as well as being statistically significant. And the difference is by a factor of three—that seems like a big deal.

Really, though, this paper provides essentially no evidence about the researchers’ hypotheses, for three little reasons and one big reason.

First, some specific problems with this particular study:

1. Representativeness. What color clothing you wear has a lot to do with where you live and who you hang out with. Participants in an Internet survey and University of British Columbia students aren’t particularly representative of much more than … participants in an Internet survey and University of British Columbia students.

2. Measurement. The researchers asked people when their last menstrual period started. People might not remember. The interviewers ask for respondents’ certainty, but respondents often overstate their certainty.

3. Bias. The article defines the “high-conception risk group” as women who had onset of menses six to 14 days earlier. I saw this and was suspicious. (I have personal experience with fertility schedules because my wife and I had a child in our mid-40s.) According to womenshealth.gov, the most fertile days are between days 10 and 17 of a 28-day menstrual cycle. Babycenter.com says days 12 to 17. I looked at Beall and Tracy’s paper and followed some references, and it appears they followed a 2000 paper by Penton-Voak and Perrett, which points to a 1996 paper by Regan, which points to the 14^th day as the best estimate of ovulation. Regan claims that “the greatest amount of sexual desire was experienced” on Day 8. So my best guess (but it’s just a guess) is that Penton-Voak and Perrett misread Regan, and then Beall and Tracy just followed Penton-Voak and Perrett.

4. And now the clincher, the aspect of the study that allowed the researchers to find patterns where none likely exist: “researcher degrees of freedom.” That’s a term used by psychologist Uri Simonsohn to describe researchers’ ability to look at many different aspects of their data in a search for statistical significance. This doesn’t mean the researchers are dishonest; they can be sincerely looking for patterns in the data. But our brains are such that we can, and do, find patterns in noise. In this case, the researchers asked people “What color is the shirt you are currently wearing?” but they don’t say what they did about respondents who were wearing a dress, nor do they say if they asked about any other clothing. They gave nine color options and then decided to lump red and pink into a single category. They could easily have chosen red or pink on its own, and of course they also could’ve chosen other possibilities (for example, lumping all dark colors together and looking for a negative effect). They report that other colors didn’t yield statistically significant differences, but the point here is that these differences could have been notable. The researchers ran the comparisons and could have reported any other statistically significant outcome. They picked Days 0 to 5 and 15 to 28 as comparison points for the time of supposed peak fertility. There are lots of degrees of freedom in those choices. They excluded some respondents they could’ve included and included other people they could’ve excluded. They did another step of exclusion based on responses to a certainty question.

I just gave a lot of detail here, but in a sense the details are the point. The way these studies fool people is that they are reduced to sound bites: Fertile women are three times more likely to wear red! But when you look more closely, you see that there were many, many possible comparisons in the study that could have been reported, with each of these having a plausible-sounding scientific explanation had it appeared as statistically significant in the data.

The standard in research practice is to report a result as “statistically significant” if its p-value is less than 0.05; that is, if there is less than a 1-in-20 chance that the observed pattern in the data would have occurred if there were really nothing going on in the population. But of course if you are running 20 or more comparisons (perhaps implicitly, via choices involved in including or excluding data, setting thresholds, and so on), it is not a surprise at all if some of them happen to reach this threshold.

The headline result, that women were three times as likely to be wearing red or pink during peak fertility, occurred in two different samples, which looks impressive. But it’s not really impressive at all! Rather, it’s exactly the sort of thing you should expect to see if you have a small data set and virtually unlimited freedom to play around with the data, and with the additional selection effect that you submit your results to the journal only if you see some catchy pattern.

In focusing on this (literally) colorful example, I don’t mean to be singling out this particular research team for following what are, unfortunately, standard practices in experimental research. Indeed, that this article was published in a leading journal is evidence that its statistical methods were considered acceptable. Statistics textbooks do warn against multiple comparisons, but there is a tendency for researchers to consider any given comparison alone without considering it as one of an ensemble of potentially relevant responses to a research question. And then it is natural for sympathetic journal editors to publish a striking result without getting hung up on what might be viewed as nitpicking technicalities. Each person in this research chain is making a decision that seems scientifically reasonable, but the result is a sort of machine for producing and publicizing random patterns.

There’s a larger statistical point to be made here, which is that as long as studies are conducted as fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant, we will see lots of dramatic claims based on data patterns that don’t represent anything real in the general population. Again, this fishing can be done implicitly, without the researchers even realizing that they are making a series of choices enabling them to over-interpret patterns in their data.

And this happens all the time. For example, the July 2013 issue of Psychological Science included a paper claiming an association between men’s upper-body strength and their attitudes about economic redistribution. The authors wrote, “We showed that upper-body strength in modern adult men influences their willingness to bargain in their own self-interest over income and wealth redistribution. These effects were replicated across cultures and, as expected, found only among males.” Actually, two of their three studies were of college students, and they did not actually measure anybody’s upper-body strength; they just took measurements of arm circumference. It’s a longstanding tradition to do research studies using proxy measures on students—but if it’s OK to do so, you should be open about it; instead of writing about “upper-body strength” and “men,” be direct and say “arm circumference” and “students.” Own your research choices!

But, to return to the main theme here, these researchers had enough degrees of freedom for them to be able to find any number of apparent needles in the haystack of their data. Most obviously, the authors report a statistically significant interaction with no statistically significant main effect. That is, they did not find that men with bigger arm circumference had more conservative positions on economic redistribution. What they found was that the correlation of arm circumference with opposition to redistribution of wealth was higher among men of high socioeconomic status. But, had they seen the main effect (in either direction), I’m sure they could have come up with a good story for that, too. And if there had been no main effect and no interaction, they could have looked for other interactions. Perhaps, for example, the correlations could have differed when comparing students with or without older siblings?

Researchers’ decisions about which variables to analyze may make perfect sense, but they indicate the difficulty of taking these p-values at anything like face value. There is no reason to assume the researchers were doing anything nefarious or trying to manipulate the truth. Rather, like sculptors, they were chipping away the pieces of the data that did not fit their story, until they ended up with a beautiful and statistically significant structure that confirmed their views.

At this point you might be thinking that I’m getting pretty picky here. Sure, these researchers made some judgment calls. But are a few judgment calls enough to conjure up publishable findings in the absence of any real effect?

The answer is yes, and in fact there have been famous cases in recent years of scientists demonstrating that, with enough researcher degrees of freedom, you can get publication-quality statistical results when there is nothing going on at all. I will give two examples.

The first example was inadvertent on the scientist’s part, and it is rather sad. Daryl Bem, a distinguished research psychologist at Cornell, garnered headlines two years ago when he published a paper in the Journal of Personality and Social Psychology (another leading journal in the field) purporting to find ESP. The paper included nine different experiments and many statistically significant results. Unfortunately (or perhaps fortunately, for those of us who don’t want the National Security Agency to be reading our minds as well as our emails), these experiments had multiple degrees of freedom that allowed Bem to keep looking until he could find what he was searching for. In his first experiment, in which 100 students participated in visualizations of images, he found a statistically significant result for erotic pictures but not for nonerotic pictures. But consider all the other possible comparisons: If the subjects had identified all images at a rate statistically significantly higher than chance, that certainly would have been reported. Or what if performance had been higher for the nonerotic pictures? One could easily argue that the erotic images were distracting and only the nonerotic images were a good test of the phenomenon. Or what if participants had performed statistically significantly better in the second half of the trial than in the first half? That would be evidence of learning. Or if they performed better on the first half? Evidence of fatigue. Bem reports, “There were no significant sex differences in the present experiment.” If there had been (for example, if men had performed better with erotic images and women with romantic but nonerotic images), this certainly could have been presented as convincing evidence. And so on.

Lots of people want to believe in ESP. After all, it would be cool to read minds. (It wouldn’t be so cool, maybe, if other people could read your mind and you couldn’t read theirs, but I suspect most people don’t think of it that way.) And ESP seems so plausible, in a wish-fulfilling sort of way. It really feels like if you concentrate really hard, you can read minds, or predict the future, or whatever. Heck, when I play squash I always feel that if I really, really try hard, I should be able to win every point. The only thing that stops me from really believing this is that I realize that the same logic holds symmetrically for my opponent.

In the years since Bem’s experiment, researchers have tried and failed to replicate his results. It is too bad that resources had to be wasted on this, but perhaps this is all worth it if it brings more attention to the ubiquitous problem of researcher degrees of freedom.

My final example is a wonderful study by psychologist Craig Bennett and colleagues, who found statistically significant correlations in a functional MRI scan of a dead salmon. They were using the same sort of analysis that non-joking political scientists use in making claims such as “Red Brain, Blue Brain: Evaluative Processes Differ in Democrats and Republicans,” but the difference is that Bennett and his colleagues are open about the fact that these imaging studies have hundreds of thousands of degrees of freedom. The salmon study is beautiful because everyone knows a dead fish can’t be thinking, but it’s still possible to find patterns if you look hard enough.

In one of his stories, science fiction writer and curmudgeon Thomas Disch wrote, “Creativeness is the ability to see relationships where none exist.” We want our scientists to be creative, but we have to watch out for a system that allows any hunch to be ratcheted up to a level of statistical significance that is then taken as scientific proof.

Even if something is published in the flagship journal of the leading association of research psychologists, there’s no reason to believe it. The system of scientific publication is set up to encourage publication of spurious findings.

Update, July 31, 2013: Jessica Tracy and Alec Beall, the authors of the red clothing study, posted a response to this story, and Andrew Gelman republished their comments along with his reactions on his blog.

Correction, July 24, 2013: Due to an editing error, this story misnamed the Association for Psychological Science.

Psychology