Slate's Bizbox




everyday economics: How the dismal science applies to your life.

Women Are ChokersStudies show they cave under pressure. Why?


Among the highest paid corporate executives, only 2.5 percent are women. Among the most elite scientists (those who have been elected to the National Academy of Sciences), fully 9 percent are women. Depending on your biases, you can read that as evidence that women are better at science than business, that corporations discriminate against women, or (if you believe that profit-maximizing corporations get everything just right) that the National Academy discriminates against men.

If you have access to the World Wide Web, you'll have no problem finding theories, evidence, counterevidence, and polemics galore on this subject. Here I just want to talk about one bit of evidence regarding one of the many factors that might be in play: Women—especially high-achieving women—choke under pressure.

You can observe a lot of high achievers under pressure at a Grand Slam tennis tournament. Better yet, you can observe them under variable pressure: Things are a lot tenser when the score is 5-5 than when it's 0-0. Professor Daniele Paserman of Hebrew University made good use of this variability at the 2006 French Open, U.S. Open, and Wimbledon tournaments. First, he assigned an "importance" to each point in each match. He did this by assigning probabilities to every way the match might unfold, accounting for players' ratings, the surface they were playing on, and the identity of the server. That allowed him to say things like, "If Roger Federer wins this point, he has a 60 percent chance to win the match; if he loses the point, he has a 55 percent chance." The 5 percent difference measures the point's importance.



It turns out that by at least one measure—the number of unforced errors—men play equally well throughout the match. They make unforced errors on about 30 percent of the most important points, about 30 percent of the least important, and about 30 percent of all those in between. But women show a very different pattern: 34 percent unforced errors on the least important points, steadily rising to almost 40 percent on the most important. That's almost surely too big a difference to be mere coincidence.

What, besides choking, could explain those numbers? Maybe the closest games are usually played late in the match, when players are more fatigued; maybe more of those games involve weak players; maybe more of them occur at the French Open, where the court is harder to play. But professor Paserman tests all these theories, and none stands up to statistical analysis.

Another countertheory: Maybe women play more defensively when the score is tight. If both players just keep lobbing the ball back and forth, there can't be any forced errors, so all errors are recorded as unforced. In support of this theory, professor Paserman observes that women do play more defensively when the score is tight. (He measures defensive play by speed of serve, length of rallies, and so forth.) But, unfortunately for the countertheory, so do men. When the pressure's on, both men and women get more defensive (and by about the same amount)—but only women make more errors.

Meanwhile, another band of researchers (Uri Gneezy, Muriel Niederle, and Aldo Rustichini, of the University of California at San Diego, Stanford, and the University of Minnesota) has been running experiments to see how men and women perform in competitive environments. First they have subjects solve mazes on their own; then they pit the same subjects against each other in maze-solving contests. The result?

Competition—against anyone—improves men's performance.

Competition against women improves women's performance.

But in competition against men, women do no better than when they're working in isolation.

In spirit, that seems opposite to what professor Paserman is telling us. Women in championship tennis tournaments are always pitted against women, so based on the maze research you'd expect heightened performance—and it's a reasonable guess that when the competition gets tougher, performance should get even better.

But of course there's no real contradiction here: It's perfectly possible that same-sex competition in general improves women's performance, while extremely tight competition degrades it. And of course, these are very different populations of women. The subjects in the maze experiments were engineering students at first-class universities, which is to say they were a pretty elite bunch. But that's still a far cry from being a world-class tennis player.

And of course, just as tennis players differ from engineering students, both groups differ from scientists and business executives. So it would be a bit of a stretch to conclude that what happens on the tennis court must happen in the boardroom or the biology lab. But it might be worth looking into.

Print This ArticlePRINTDiscuss this in The FrayDISCUSSEmail to a FriendE-MAIL
Share on FacebookPost to MySpace!Share with MixxDigg ThisShare with RedditShare with del.icio.usShare with FurlShare with Ma.gnolia.comShare with SphereShare with Stumble Upon
Steven E. Landsburg is the author, most recently, of More Sex Is Safer Sex: The Unconventional Wisdom of Economics. You can e-mail him at .
Join the Fray: our reader discussion forum
What did you think of this article?
POST A MESSAGE | READ MESSAGES

Dear Slate Editors,

As a long time reader of your site and scientist, I am very disappointed in your decision to publish the article "Women Are Chokers. Studies show they cave under pressure. Why?" by Steven E. Landsburg. I am furthermore amazed that Steven E. Landsburg, who got his Ph. D. from the University of Chicago , an economics school famous for its mathematical rigor, would be taken in by such amateurish scholarship.

There are two claims being made by the article. First, that women fold under competitive pressure more than men in professional tennis and second that this weakness is somehow generalizable to other professions. Need I point out to the editors that this is no small claim? Surely some caution is warranted before telling your readers that science has now demonstrated that women "can't take the heat".

I hope I can show you in this short letter how M. Daniele Passerman's research, which is the basis for this article, is filled with sloppy thinking and poor applications of standard techniques. I believe that it is important to show how such pseudoscientific analyses are flawed. First, however, I would like to begin with some more general remarks that I hope will convince you that the basic errors in Passerman's research should be evident to even a non-scientific observer.

Consider, for a moment, that you have been assigned to test who chokes more, men or women. The first thing you will need is some test of skill. Since you only want to test whether or not one group chokes more that the other, you should pick a test which the two groups perform equally well on when not in competition. Sounds easy enough, but Passerman, for his competition, chooses professional tennis. Not only do men and women not play at the same level in tennis, but there isn't even any meaning to non-competitive tennis which would be the tennis equivalent of one hand clapping. (If you think this is a problem for every sport, I remind you of golf.) There is thus no notion of a control to compare the two groups when we are sure that there is no choking.

Without knowing that (when not in competition) both groups perform equally well, any comparison between the groups must somehow subtract out the differences in ability. This is difficult enough, but women and men also play tennis in different styles, emphasizing their own strengths, making such ad hoc subtractions suspicious. Surely if Passerman was comparing the extent to which male ping pong champions and women racquetball champions choke under pressure you would treat his results with scorn, but what if one group was playing tennis on grass and the other on clay courts? Are the differences then minor enough not to matter? If we can't remove the built in confounds from the setup of the experiment, which in Passerman's case is completely fixed, how can be sure we are comparing like things when we compare the rate of choking?

In spite of these problems it is still possible that a study of this kind can succeed. It simply requires that the effect one is looking for (women choke more) is so large that it is clearly evident in spite of all the noise in the background. Now, if you are a sensible scientist, your next move is not to start crunching numbers and creating various statistics about tennis matches. There's no need, since the only trustable effect in such an unfortunate experimental design would be a qualitatively large effect. In other words, if you haven't already noticed the effect from just watching men's and women's tennis, the effect probably doesn't exist.

Now I have watched women's and men's tennis and I have not noticed that women choking more than men. That as editors, you would support Passerman's research can only lead me to believe that you have. But if this is so, I suggest that, as honest journalists, you just come out and say it instead of hiding behind pseudoscientific arguments. Stripping away all the junk statistics, you have just published an article which reads simply, "We think women choke in key moments in tennis. In fact, we think they choke pretty much all the time and that's why they aren't CEOs." Does Slate stand for this?

Let me now turn to the details of Passerman's study.

The paper is online here:
http://economics.huji.ac.il/facultye/paserman/Paserman_TennisGender_January2007.pdf

Passerman uses two tests of choking. The first is that he checks whether in "crucial" sets women make more unforced errors than men. This is a pretty obvious idea. In important sets, we expect that people are much more likely to choke and if Passerman's idea is right, women should choke even more. There is some ambiguity here since Passerman has invented a rather suspicious measure of choking, but it doesn't matter yet since Passerman finds no result in this analysis:

"The set-level analysis indicates that both men and women perform less well in the final and decisive set of the match. This result is robust to controls for the length of the match and to the inclusion of match and player-specific fixed effects. The drop in performance of women in the decisive set is slightly larger than that of men, but the difference is not statistically significant at conventional levels."

This non-result is "explained" by Passerman in the following remarkable paragraph:

"Overall, the results from the set-level analysis indicate that both men and women perform less well in high-stakes situations. Part of the explanation for men's drop in performance may be due to fatigue, but a fatigue-based explanation does not seem very plausible for women. It therefore appears possible that women cope less well with pressure, but most of the gender differences are small and not statistically significant."

As I argued before, comparing two unlike things presents some difficulty. Here Passerman would have us believe that, even though women and men choke at the same rate (according to his own made up measure), men have an excuse. By the end of the game, men have played more sets and so are more tired and, thus should be making more unforced errors. That they don't is because men choke less. Even Passerman is not particularly sold on this argument, claiming only that it "appears possible". If you aren't even sure if something is possible, let alone true, is it publishable? Reading this convoluted attempt to support his own failed hypothesis, I have to ask: just what kind of scientist is Passerman anyways? Did Slate check?

Passerman's second analysis is more fine grained. Instead of just looking set by set, he looks point by point using a concocted measure of the importance of the point; we are now comparing a dubious measure of choking with a fishy measure of importance. His analysis is summarized in the last table of the paper, Figure 3a and Figure 3b, which I invite you to look at as I discuss it.

Looking at this figure, which plots unforced errors (which supposedly measure a decrease in performance from stress) vs. importance, Passerman's result amounts to the difference in slope between the two lines. For men he shows us a mostly flat line, indicating that higher importance means only slightly more errors and for women we see a significant slope meaning that on more important points, they are making more errors.

First note Passerman's classic bad science use of a change of axes to heighten his effect. For men, the range spreads from .3 to .45, which has the convenient effect of squashing his data to make it appear that the line is flatter than it is while for women it only runs from .35 to .45. On a proper set of axes, these plots would not appear as dissimilar. This sloppy trick is so basic I've seen it explained on a children's cartoon on PBS. Still, this does not affect the slope, which is indeed higher for women. So, does statistics prove Passerman's point?

Passerman's lines are obtained from a standard technique, however as is visually apparent, they do not accurately convey the distribution of the points. Consider that, for both men and women, the largest number of unforced errors occurs on points which are supposedly of only moderate importance. For men there is the enormous number of unforced errors on 0-40, while for women it is 40-30.

Note also that, for men, the score 0-30 is of higher importance than 0-40 yet has by far the lowest number of unforced errors for men. Naively, I might suggest that the huge jump in errors between 0-30 and 0-40, the largest feature visible in either of the plots is a demonstration of how men choke right at the moment when someone is about to break serve. You are welcome to come to your own pseudo-science explanations of this jump, but, before you do, you should recognize that there are no error bars on these plots. For all I can tell, this might be noise.

More broadly, looking at these plots, you should notice how dissimilar they are. Even the ordering, from left to right of the scores is not preserved between men and women. The means (i.e. the averages) of the two data sets are clearly not the same, nor is the spread of the points (which is larger for men).

At the very least, before comparing the two sets of data, we should allow the possibility that the relation between the number of unforced errors and the extent to which a player is choking is not the same between men and women. Unforced errors, after all, are also a measure of athletic ability, not just player psychology, and the athletic ability is not the same between the two groups. This is a very large problem for Passerman's result since if we allow a rescaling and shifting of the axes, it is possible to change the slopes of the fitted lines arbitrarily, erasing any difference between the two groups. Turning this argument around, if we are to believe Passerman that the slopes of these lines are important, we must believe that unforced errors are an absolute measure of player psychology that is not affected by differences in the way men and women play tennis. I personally doubt that this is a reasonable assumption.

Passerman also presents his data in a less user friendly form in table 6. Here, again, we have the percent of unforced errors as a function of importance of the point. Passerman has grouped all the results into four quartiles of importance, so we cannot see as directly the data as we could with the plot, but this table is what is quoted in the Slate article so I will discuss it:

Importance 1 2 3 4
% Unforced errors Women: 34.25, 37.06, 38.68, 39.67
% Unforced errors Men: 30.54, 29.75, 30.79, 30.61

Note that 1 is the least important, while 4 is the most important. Once again, we are struck by the difference in magnitude of these two groups of numbers. Should we subtract some number from the women's data before comparing it to the men's? Should they be rescaled by some number? It is not clear.

For men, we see that the numbers go up and down in a random manner. This is nicely explained if we note that the error on these numbers should be about 1%. (the four decimals of precision shown are not statistically meaningful). Thus, Passerman's data is claiming that there is no choking in men's tennis, a result which would seem to go against the common sense belief that choking is in fact a common phenomenon in sports. Could it be that this statistical measure of choking is in fact bunk? Something to ponder as we look at the data for women.

For women, Passerman would like us to note (as is quoted in Slate) that between the first and fourth quartile, the percent of errors jumps from 34.25 to 39.67. Looking at the full data, though, this result is much more muddled that Passerman lets on. Consider that the last 3 data points are all within error of each other and thus represent a sort of plateau. The big jump in unforced errors happens between the first and second quartiles. In other words, we are supposed to believe that when a point becomes mildly important, there is a jump in the number of unforced errors; however, when the game really gets important, e.g. in a tie-breaker, there is very little change in the percent of unforced errors. Does this fit with our intuition about choking? Doesn't choking happen only in the big moments of the game? If Passerman were seeing an effect due to choking I would expect a big jump from third to fourth quartiles, not from the first to the second. (Note that you can see this same bizarre behavior in Figure 3.)

One of the most basic things one would expect Passerman to achieve is a demonstration that stressful situations in sports can lead to precipitous declines in ability. Does he achieve this? It is certainly not clear from the numbers he presents. Instead we are only left with doubts about his techniques, and his definitions. Why is importance really a good measure of when an athlete feels pressure? Why is the unforced error such a good measure of choking? Passerman presents many arguments but there is little actual science to back them up. At the end, looking at his final results, I see very little. I certainly do not see a revelation about gender differences in sports.

Let me conclude with some of Passerman's own words about his research:

"To what extent then can we draw from this study more general lessons about gender differences in the labor market? An unforced error is by definition an error that cannot be
attributed to any factor other than poor judgment by the player. Can we extrapolate from our findings that in general women's judgment becomes more clouded as the stakes become higher, and this may hinder their advancement to the upper echelons of management, science, and the professions? Clearly, the answer must be negative. The results are only relevant for the specific context, and it is questionable whether the conclusions can be even extended to athletes in other sports, let alone to managers, surgeons, or other professionals who must make quick and accurate decisions in high pressure situations."

Aside from Passerman's amazing faith in the importance of the unforced error, I would like you to note how he seems to be admitting that his research has little applicability. In the next paragraph, however, his true intentions are revealed:

"Nevertheless, there are at least two striking features in this study that still deserve attention. First, the women in our sample are among the very best in the world in their
profession, and are without question extremely competitive. They are probably quite distant from the typical woman in experimental studies, which underperforms in competitive settings and shies away from competition. Therefore, it is doubly surprising that even these highly competitive women exhibit a decline in performance in high pressure situations. In many respects, this sample is more representative of the extreme right tail of the talent distribution that is of interest for understanding the large under-representation of women in top corporate jobs, prestigious professions and academia. Second, some experimental studies (e.g., Gneezy, Niederle, and Rustichini, 2003) found that women's tendency to underperform in competitive environments occurs only when they compete against men. By contrast, here we find that women's performance deteriorates as competitive pressure rises, even when the competition is clearly restricted to women alone. This may have implications for educational policies such as single-sex schooling, and deserves further investigation."

Is Passerman a serious researcher whose results should be published on Slate? I don't think so. No serious researcher would dare extrapolate so far from such fuzzy and muddled results.

I sincerely hope that Slate reviews why it would publish such poor scholarship who's suggested implications were so obviously offensive to many of its readers.

Thank you for you time,

-Ian T. Ellwood

--eolianwold

(To reply, click here.)

(2/14)





Washington Post
The Washington Post
OPINIONS
Over the Line
Harold Ford Jr. | I know what it's like to be smeared by your opponent.
: The Positive in Negative Ads
PLUS » Milbank: The President's Lullaby