Visit Your Local PBS Station PBS Home PBS Home Programs A-Z TV Schedules Watch Video Donate Shop PBS Search PBS
I, Cringely - The Survival of the Nerdiest with Robert X. Cringely
Search I,Cringely:

The Pulpit
The Pulpit

<< [ Valley of the Nerds ]   |  Better to Start a Single Generator Than to Curse the Darkness:  |   [ Bah, Humbug! ] >>

Weekly Column

Better to Start a Single Generator Than to Curse the Darkness:: Reflections on a Seven-Hour Blackout

Status: [CLOSED]
By Robert X. Cringely
bob@cringely.com

Last week, San Francisco was without power for more than seven hours. A Pacific Gas and Electric substation less than a mile from my office in San Mateo went down. The utility's backup system failed to, well, backup. Cringely Intergalactic HQ was plunged into darkness, uninterruptible power supplies blaring for 20 minutes or so until they died, too. My mail server, Web server, FTP server and an application server shut down in a semi-orderly process. But up in San Francisco, things were anything but orderly. The Financial District went dark and so did South of Market, where all those Internet companies are. People were trapped in elevators. The trains stopped running, sometimes in tunnels underneath the city. Train passengers had to walk out guided by flashlight-carrying workers. Smart people just walked home. People who were at home hoped their notebook computer batteries wouldn't give out and thanked God that the phone system was still up and running.

In computer rooms and phone closets across the city, backup generators started up. Where generators didn't exist, they were immediately written on shopping lists and on next year's budget. For almost the entire workday, the power stayed off. San Francisco learned yet another lesson about emergency planning. It's a lesson we all could learn.

Down in San Mateo, my power was back up in just over an hour, and only my mail server failed to automatically reboot. Three times as much UPS capacity would have got me through the entire episode, but what if the next outage was longer? For the cost of three times as much UPS capacity I could buy a backup generator, so that's what I'm buying myself for Christmas.

It takes an experience like this to make us aware of both how dependent we are on other people's systems like the power grid and on the Internet in general. The best planning generally doesn't start until it is already too late. I realized, for example, that I no longer have a backup Internet Service Provider. That has to change, and it would be a good idea if the alternate provider used some different delivery technology like wireless, just in case the next disaster involves the loss of phone lines.

For the most part, the big server farms in San Francisco stayed up and running, but that doesn't mean network administrators can relax. No matter what they do, the Internet as it is used today will still be inherently unreliable. It's a matter of design.

Providers of Internet services should worry about the problem of maintaining mission-critical service in a world where bad weather, someone else's network error, and rogue backhoes will always exist. Beyond providing backup power, the usual solution for making Web service more reliable is by purchasing redundant network connections or building mirror sites. But what if those redundant network connections are hit by the same earthquake or the same sewer construction crew? It's the same kind of design problem that plagued the DC-10 airliner, which had redundant control systems that could be disabled at the same time by a turbine failure in the middle engine. Oops.

Turbine failure isn't an issue for Internet service providers, but keeping mission-critical applications up and running is an issue, and one that isn't generally being handled well. Mainframe computer terms like "fault tolerant," "nonstop," and "load-balancing" are bandied about, but the people saying those words often don't really mean them. Even the best of efforts are still just good enough to be criticized.

Web hosting companies like Conxion.com offer large Internet connections to mirrored servers in secure network operations centers. This solution is expensive and not particularly network-efficient, since it relies on a single point of presence. And there is always the problem of a single site being vulnerable to natural or man-made catastrophes. I remember the 1989 earthquake, which would have easily toppled a Yahoo or Excite, much less cringely.com.

The final line of defense, of course, is the mirror server running on some other backbone in some other network operations center somewhere on the other side of the country. But even mirrors have their problems. They generally aren't as fast as the primary system because most companies don't want to pay for fully redundant capacity. And they ultimately rely on the end-user to figure out that something is amiss and click on the mirror. If that end-user is buying something from you and they can just as easily buy from someone else, it would be nice to do that clicking for them, automatically.

Automatically switching to a mirror turns out to be harder than you'd guess. There are several ways to do it, none of them very good. MCI and UUNET both offer a feature called "diverse routing," which allows more than one machine to share a single IP address, but this is inherently limited to a single backbone. But what if that's the backbone that is out of service? IBM has a technique that's not often promoted called "ping triangulation" that very cleverly maps in multidimensional space the location of every mirror and the end-user, sending that user not only to an open server, but to the closest open server. But ping triangulation is such a network resource hog, and takes so much time to run, that it isn't practical.

What we need is a model for truly distributed network service for companies that are serious about making all their money on the World Wide Web. These companies need to be up and running all the time no matter what. They need to be immune to any disaster. They need to distribute their services in such a way that not only is service guaranteed, but service is faster. As far as I can tell, the technology to do this doesn't really exist despite lots of contrary marketing claims. There continues to be a single point of failure, which I wouldn't like if I was amazon.com.

Please tell me if I am wrong in this, but it seems that most companies are simply ignoring this problem, pretending that it doesn't exist. As for me, I'm getting my generator and my wireless backup ISP, and hoping for the best.

Comments from the Tribe

Status: [CLOSED] read all comments (0)