An Extraordinary Kind of Stupid

The weirdest thing about the Apple security bug is how simple it was.

Feb 25, 20144:24 PM

Some iPads are among the devices affected by Apple's security bug. — Some iPads are among the devices affected by Apple’s security bug.

Photo by Jessica Rinaldi/Reuters

So for the past 18 months, there has been a horrific security hole in many of Apple’s products that has allowed “man in the middle” attacks on supposedly secure Internet communications. Most iPhone, iPad, and iPod Touch devices were affected, as well as tremendous numbers of Macs running current and recent versions of OS X (any version of the 10.9 Mavericks release). (You can go to gotofail.com to see if your device is affected.) This vulnerability is exceptionally bad and ubiquitous, but it’s still the same sort of bug that gets patched constantly in various pieces of software and for which hacker groups are constantly on the lookout.

Aside from its severity, though, this bug has another extraordinary quality: It’s extremely simple. (Simple enough that the bug is already on a T-shirt.) Stupid, even. Ninety-nine percent of the time, these sorts of stupid mistakes aren’t that damaging. But that 1 percent of the time, the gods won’t save you.

Because the code in question was open source, some folks on YCombinator quickly located it; they pegged it as popping up first in the 10.9 release of OS X code. Google Web security guru Adam Langley posted a good technical analysis of the bug. But noncoders should know something about it too, because this bug is an object lesson in just how fragile the code that increasingly controls our lives can be. The simplicity with which a single mistaken line of code snowballed into one of the biggest security holes ever strikes fear into the hearts of engineers. It’s good to peek under the hood.

Below is the C code containing the bug, which occurs deep down in a security function called SSLVerifySignedServerKeyExchange. This function makes sure that the site your computer is talking to over an encrypted line (like google.com or chase.com) is really that site, rather than some “man in the middle” pretending to be that site. The bug causes the function to claim that the site is legit, even if it’s not.

OSStatus err;

            if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0)
                goto fail;
            if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0)
                goto fail;
            if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
                goto fail;
          if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
                goto fail;
                goto fail;
            if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
                goto fail;

fail:
return err;

Even if you’ve never seen code before, you might pick up on a glaring structural anomaly here, which is that one “if” statement is followed by two “goto fail”s instead of one. If that jumped out at you, congratulations! You found the bug.

Let me simplify the code a bit to get at the essence of the bug:

OSStatus status = EVERYTHING_IS_GREAT;

            if ((status = DoSomeSecurityStuff) is DANGER)
                goto fail;
            if ((status = KeepDoingSecurityStuff) is DANGER)
                goto fail;
                goto fail;
            if ((status = DoTheMostImportantSecurityStuff) is DANGER)
                goto fail;

fail:
return status;

These lines perform some calculations that test the validity of the authenticating data that the server (real or fake) has sent back to you, the client. On each “if” line, a variable labeled “status” is set to either EVERYTHING_IS_GREAT or DANGER. If status is still EVERYTHING_IS_GREAT by the end of the code, the function tells the rest of the program that everything is indeed great and authentication took place. If something went wrong, then it returns DANGER to the rest of the program. Whenever any part of the check returns DANGER, the code doesn’t bother finishing the security check—it’s already failed, so why bother? It just jumps to the end of the code via the “goto fail” statement, which causes the computer to jump to the end of the code with the “fail:” label right above it.

As long as the status remains EVERYTHING_IS_GREAT, the computer will continue going through the “if” statements and, because there’s no DANGER, it skips the “goto fail” statements.

Except it doesn’t. What the computer makes of that above code is actually this:

OSStatus status = EVERYTHING_IS_GREAT;

            if ((status = DoSomeSecurityStuff) is DANGER)
                goto fail;
            if ((status = KeepDoingSecurityStuff) is DANGER)
                goto fail;
            goto fail;
           ~~if ((status = DoTheMostImportantSecurityStuff) is DANGER)~~
               ~~goto fail;~~

fail:
return status;

An “if” statement only controls the first statement after it. So if there’s one after it—like a second “goto fail”—that statement will execute all the time. So in other words, that last “if” statement, with DoTheMostImportantSecurityStuff in it, never executes (which is why I crossed it out). Even worse, if KeepDoingSecurityStuff returned EVERYTHING_IS_GREAT, then status will be EVERYTHING_IS_GREAT, even though DoTheMostImportantSecurityStuff never happened. And so the security check “succeeds” when it shouldn’t have.

How on earth did that second “goto fail” get added? Some conspiracy buffs have suggested it’s a clever National Security Agency hack to enable them to spy on everyone that much more easily. I don’t buy it. I also don’t think that an engineer made the change in a manual edit—I hope not, at any rate. Incremental changes to crucial security code are generally made with great care, and in any professional environment, the changes are always reviewed by another engineer. Slip ups like this one are more common when a programmer isn’t actually working seriously on the code but doing some sort of maintenance or housekeeping on it.

Most likely (based on this diff, or set of changes, between the nonbugged and bugged versions of the code), the extra line crept in when a programmer did a partially automated merge of two different versions of this source code, and the resulting file was approved without anyone looking too closely at it. Code bases contain different “versions” and “branches” of code so that people can work on long-term projects and short-term fixes without stepping on one another’s toes. From time to time, you have to merge the changes made between branches. So if I were at Apple, I might have to merge two different versions of this file, and I’d look it over to make sure it was OK. I might not pay too close attention to every change, because I’d assume that a change simply reflected a difference between the two versions, and not the introduction of new, buggy code. I could be wrong, but this is mission-critical code, and I cannot believe Apple was going at it with a hacksaw.

Preventing bugs like these is one of the biggest challenges of software engineering, and this incident should make it pretty damn clear why. A single extra line of code compromised the security of millions and millions, and no one caught it for more than a year.

Even in a rigorous environment of code reviews, automated testing, and high-quality development, bugs like this do slip through. But the sheer complexity of today’s code makes it very difficult to catch everything. Some people have bashed on the code, saying that it’s too sloppy and careless. While I’m not thrilled with all the stylistic and structural choices, I think the code is reasonably good by today’s standards. Apple wouldn’t have released the code as open source if it weren’t good, and even if they had, there would have been quite an outcry from the open-source community if they’d looked it over and found it to be garbage. The people who wrote it knew what they were doing. The bug is indeed the result of negligence, but the sad and scary thing is that, for software developers, this is not a “What were they thinking?” moment. It’s an “Oh my God, that could have been me” moment.