Are women three times more likely to wear red or pink when they are most fertile? No, probably not. But here's how hardworking researchers, prestigious scientific journals, and gullible journalists have been fooled into believing so.
The paper I'll be talking about appeared online this month in Psychological Science, the flagship journal of the Association for Psychological Science, which represents the serious, research-focused (as opposed to therapeutic) end of the psychology profession.*
"Women Are More Likely to Wear Red or Pink at Peak Fertility," by Alec Beall and Jessica Tracy, is based on two samples: a self-selected sample of 100 women from the Internet, and 24 undergraduates at the University of British Columbia. Here's the claim: "Building on evidence that men are sexually attracted to women wearing or surrounded by red, we tested whether women show a behavioral tendency toward wearing reddish clothing when at peak fertility. ... Women at high conception risk were more than three times more likely to wear a red or pink shirt than were women at low conception risk. ... Our results thus suggest that red and pink adornment in women is reliably associated with fertility and that female ovulation, long assumed to be hidden, is associated with a salient visual cue."
Pretty exciting, huh? It’s (literally) sexy as well as being statistically significant. And the difference is by a factor of three—that seems like a big deal.
Really, though, this paper provides essentially no evidence about the researchers' hypotheses, for three little reasons and one big reason.
First, some specific problems with this particular study:
1. Representativeness. What color clothing you wear has a lot to do with where you live and who you hang out with. Participants in an Internet survey and University of British Columbia students aren't particularly representative of much more than ... participants in an Internet survey and University of British Columbia students.
2. Measurement. The researchers asked people when their last menstrual period started. People might not remember. The interviewers ask for respondents' certainty, but respondents often overstate their certainty.
3. Bias. The article defines the "high-conception risk group" as women who had onset of menses six to 14 days earlier. I saw this and was suspicious. (I have personal experience with fertility schedules because my wife and I had a child in our mid-40s.) According to womenshealth.gov, the most fertile days are between days 10 and 17 of a 28-day menstrual cycle. Babycenter.com says days 12 to 17. I looked at Beall and Tracy's paper and followed some references, and it appears they followed a 2000 paper by Penton-Voak and Perrett, which points to a 1996 paper by Regan, which points to the 14th day as the best estimate of ovulation. Regan claims that "the greatest amount of sexual desire was experienced" on Day 8. So my best guess (but it’s just a guess) is that Penton-Voak and Perrett misread Regan, and then Beall and Tracy just followed Penton-Voak and Perrett.
4. And now the clincher, the aspect of the study that allowed the researchers to find patterns where none likely exist: "researcher degrees of freedom." That's a term used by psychologist Uri Simonsohn to describe researchers’ ability to look at many different aspects of their data in a search for statistical significance. This doesn't mean the researchers are dishonest; they can be sincerely looking for patterns in the data. But our brains are such that we can, and do, find patterns in noise. In this case, the researchers asked people "What color is the shirt you are currently wearing?" but they don't say what they did about respondents who were wearing a dress, nor do they say if they asked about any other clothing. They gave nine color options and then decided to lump red and pink into a single category. They could easily have chosen red or pink on its own, and of course they also could've chosen other possibilities (for example, lumping all dark colors together and looking for a negative effect). They report that other colors didn't yield statistically significant differences, but the point here is that these differences could have been notable. The researchers ran the comparisons and could have reported any other statistically significant outcome. They picked Days 0 to 5 and 15 to 28 as comparison points for the time of supposed peak fertility. There are lots of degrees of freedom in those choices. They excluded some respondents they could've included and included other people they could've excluded. They did another step of exclusion based on responses to a certainty question.
I just gave a lot of detail here, but in a sense the details are the point. The way these studies fool people is that they are reduced to sound bites: Fertile women are three times more likely to wear red! But when you look more closely, you see that there were many, many possible comparisons in the study that could have been reported, with each of these having a plausible-sounding scientific explanation had it appeared as statistically significant in the data.
The standard in research practice is to report a result as “statistically significant” if its p-value is less than 0.05; that is, if there is less than a 1-in-20 chance that the observed pattern in the data would have occurred if there were really nothing going on in the population. But of course if you are running 20 or more comparisons (perhaps implicitly, via choices involved in including or excluding data, setting thresholds, and so on), it is not a surprise at all if some of them happen to reach this threshold.
The headline result, that women were three times as likely to be wearing red or pink during peak fertility, occurred in two different samples, which looks impressive. But it's not really impressive at all! Rather, it's exactly the sort of thing you should expect to see if you have a small data set and virtually unlimited freedom to play around with the data, and with the additional selection effect that you submit your results to the journal only if you see some catchy pattern.
In focusing on this (literally) colorful example, I don’t mean to be singling out this particular research team for following what are, unfortunately, standard practices in experimental research. Indeed, that this article was published in a leading journal is evidence that its statistical methods were considered acceptable. Statistics textbooks do warn against multiple comparisons, but there is a tendency for researchers to consider any given comparison alone without considering it as one of an ensemble of potentially relevant responses to a research question. And then it is natural for sympathetic journal editors to publish a striking result without getting hung up on what might be viewed as nitpicking technicalities. Each person in this research chain is making a decision that seems scientifically reasonable, but the result is a sort of machine for producing and publicizing random patterns.
There's a larger statistical point to be made here, which is that as long as studies are conducted as fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant, we will see lots of dramatic claims based on data patterns that don't represent anything real in the general population. Again, this fishing can be done implicitly, without the researchers even realizing that they are making a series of choices enabling them to over-interpret patterns in their data.