Bitwise

Err Engine Down

What really went wrong with healthcare.gov?

What went wrong with healthcare.gov

Photo by Karen Bleier/AFP/Getty Images

Of all the terrible websites I’ve seen, healthcare.gov ranks somewhere in the middle. It has been difficult if not impossible to sign up, and customer service has been inadequate. But it’s certainly better than the NYC Department of Education site that I attempted to help a friend navigate two years ago, in hopes of her getting paid her actual salary instead of a default salary; the blatantly inept Web code got the best of us. And it’s better than the evanescent Web encyclopedia Cpedia, which rolled out with pages that literally consisted of nonsense (such as “Clickbooth Cuil but not avail due to flooding traffics and making their servers ‘too hot’ to handle”). The problems plaguing healthcare.gov aren’t due to a unique coding failure or a unique government failure—plenty of products have had similar early deficits, such as the Electronic Arts server bugs that rendered SimCity unplayable by most for more than a week after it was released this March. So healthcare.gov’s failures are not uncommon—they’re just exceptionally high-profile.

The Redditors picking apart the client code have found some genuine issues with it, but healthcare.gov’s biggest problems are most likely not in the front-end code of the site’s Web pages, but in the back-end, server-side code that handles—or doesn’t handle—the registration process, which no one can see. Consequently, I would be skeptical of any outside claim to have identified the problem with the site. Bugs rarely manifest in obvious forms, often cascading and metamorphizing into seemingly different issues entirely, and one visible bug usually masks others.

There are a few clues, however. The site’s front end (the actual Web pages and bits of script) doesn’t look too bad, but it is not coping well with whatever scaling issues the back end (account storage, database lookups, etc.) is having. I tried to sign up for the federal marketplace six days after rollout. The site claimed to be working, but after I started the registration process, I sat on a “Please Wait” page for 10 minutes before being redirected to an error page:

“Sorry, we can’t find that page on HealthCare.gov.”

Except that wasn’t the problem, since the error message immediately below read:

“Error from: https%3A//www.healthcare.gov/oberr.cgi%3Fstatus%253D500%2520errmsg%253DErrEngineDown%23signUpStepOne.”

To translate, that’s an Oracle database complaining that it can’t do a signup because its “engine” server is down. So you can see Web pages with text and pictures, but the actual meat-and-potatoes account signup “engine” of the site was offline. A good site would have translated that error into a more helpful error message, such as “The system is temporarily down,” or “President Obama personally apologizes to you for this engine failure.” But it didn’t.

This failure points to the fundamental cause of the larger failure, which is the end-to-end process. That is, the front-end static website and the back-end servers (and possibly some dynamic components of the Web pages) were developed by two different contractors. Coordination between them appears to have been nonexistent, or else front-end architect Development Seed never would have given this interview to the Atlantic a few months back, in which they embrace open-source and envision a new world of government agencies sharing code with one another. (It didn’t work out, apparently.) Development Seed now seems to be struggling to distance themselves from the site’s problems, having realized that however good their work was, the site will be judged in its totality, not piecemeal. Back-end developers CGI Federal, who were awarded a much larger contract in 2010 for federal health care tech, have made themselves rather scarce, providing no spokespeople at all to reporters. Their source code isn’t available anywhere, though I would dearly love to take a gander (and so would Reddit). I fear the worst, given that CGI is also being accused of screwing up Vermont’s health care website.

So we had (at least) two sets of contracted developers, apparently in isolation from each other, working on two pieces of a system that had to run together perfectly. Anyone in software engineering will tell you that cross-group coordination is one of the hardest things to get right, and also one of the most crucial, because while programmers are great at testing their own code, testing that their code works with everybody else’s code is much more difficult.

Look at it another way: Even if scale testing is done, that involves seeing what happens when a site is overrun. The poor, confusing error handling indicates that there was no ownership of the end-to-end experience—no one tasked with making sure everything worked together and at full capacity, not just in isolated tests. (I can’t even figure out who was supposed to own it.) No end-to-end ownership means that questions like “What is the user experience if the back-end gets overloaded or has such-and-such an error?” are never asked, because they cannot be answered by either group in isolation. Writing in Medium in defense of Development Seed, technologist and contractor CTO Adam Becker complains of “layers upon layers of contractors, a high ratio of project managers to programmers, and a severe lack of technical ownership.” Sounds right to me.

Likewise, the bugs around username and password standards—for example, the fact that the username required a number but the user interface didn’t tell the user about it—are not problems of scale. They’re problems of poor cross-group communication. I’d bet that plenty of people knew what was going to happen when the site rolled out, but none of them were in a position to mitigate the damage.

I imagine there was a dialogue last Monday afternoon that went something like this:

FRONT-END DEVELOPER: Why does the username have to have a number in it?

BACK-END DEVELOPER: It’s in the government username regulations. Didn’t you read them?

FRONT-END DEVELOPER: No, we don’t do accounts, we just hand the input to you.

BACK-END DEVELOPER: And we told you your front-end the input was no good! See the ErrEngineDown in the URL?

FRONT-END DEVELOPER: Fine, fine. Sigh. Nice to finally talk to you, by the way.

BACK-END DEVELOPER: Yeah, you too. Are you in D.C.?

FRONT-END DEVELOPER: San Francisco.

BACK-END DEVELOPER: Know any good jobs in D.C.? I hate this place and they’re furloughing me as soon as we fix this mess.

Each group got its piece “working” in isolation and prayed that when they hooked them together, things would be okay. When they didn’t, it was too late. It is entirely possible that back-end developer CGI is primarily at fault here, but no one will care because they just see that the whole thing doesn’t work. As you learn early on in software development, there is no partial credit in programming. A site that half-works is worse than one that doesn’t work at all, which is why the bad error handling is so egregious. You always handle errors.

There’s also evidence that the problems go beyond technical considerations. The “Please Wait” page source had this line in it:

In a hurry?  You might be able to apply faster at our Marketplace call center.  Call 1-800-318-2596 to talk with one of our trained representatives about applying over the phone.

—>

Most users will never see this message, because it was commented out—i.e., marked to be ignored by the browser instead of displayed in the HTML, forcing you to wait the full 10 minutes to get the error before they even give you the phone number. Again, this sort of problem was clearly not anticipated.

Bugs can be fixed. Systems can even be rearchitected remarkably quickly. So nothing currently the matter with healthcare.gov is fatal. But the ability to fix it will be affected by organizational and communication structures. People are no doubt scrambling to get healthcare.gov into some semblance of working condition; the fastest way would be to appoint a person with impeccable engineering and site delivery credentials to a government position. Give this person wide authority to assign work and reshuffle people across the entire project and all contractors, and keep his schedule clean. If you found the right person—often called the “schedule asshole” on large software projects—things will come together quickly. Sufficient public pressure will result in things getting fixed, but the underlying problems will most likely remain, due to the ossified corporatist structure of governmental contracts.

We live in a world that embodies the paradox of George W. Bush’s responsibility society (aka the “ownership society”), where authority and accountability are increasingly separated. Keep an eye out to see if CGI suffers any consequences whatsoever. Vermont just decided to double CGI’s contract pay, so I’m not optimistic. Power flows upward while responsibility flows downward, which is why you couldn’t pay me to work as a government contractor. It’d be like going back to Microsoft.