Wednesday, March 05, 2008

RE: 99.9% Uptime



I seem to recall a recent Slashdot article asking why less than 99.9% uptime is no longer unacceptable, and I think I have finally formulated a response in easy-to-chew form: because the digital age is full of acceptable errors.

To put it more colloquially, the Analog Age was full of engineers while the Digital Age is full of programmers.

I don't mean there are necessarily fewer engineers; I'm sure that's quite the contrary. I only mean that they have been thrust to the back of the crowd behind the programmers in all the company photos.

Imagine if you will, fifty or so years ago, when electrical engineering was at its peak, and computers were warehouses, not set-top boxes. They had these lovely vacuum tubes instead of transistors, and were constantly blowing out. Tolerances were so tight because you only had a few hundred thousand tubes (if you were lucky) of computing power. Ahh, the good 'ole days.

Fast-forward ten years, when transistors really started taking off. Now you could get a few hundred thousand instructions per second (same joke applies) on a silicon wafer instead of a room full of tubes! Who cared if they weren't as reliable as vacuum tubes - since they saved about 10,000% of the space, electricity and manpower to run, it was a fair trade-off. Behold, the beginning of the end of Analog-scale uptime.

Stay with me here. Every computer in this day and age processes a gazillion errors per day, but we never see them. They come in the form of missing memory addresses, instruction collisions, bottlenecking, etcetera. They have been ingrained upon our very souls that these things are unavoidable; chalk them up to chaos theory. The irony of all this is - at least to me - the smaller our processors become, the more unstable they theoretically become. It's my understanding that each new iteration of smaller silicon lithography is exponentially more difficult than the last, which does nothing to bolster my confidence in a next-generation product. I'm serious: who cares that Silverthorne nee Atom is brutally fast when it produces more errors than the last piece of droolworthy hardware? (I'm speaking out of deductive logic here. I'm assuming that the more and smaller transistors there are, and assuming the rate of error return is the same [~0.00001%], the more errors will occur naturally.)

Programmers must account for these errors. The lower-level the language, the harder it becomes to program, the greater amount of detail is required, and the margin for error decreases at an inverse proportion. So while you're posting on the virtues of Ruby or Drupels, there are at least an equal number of souls sweating over thirty thousand lines of Assembler, wondering aloud where the bits go. Hell, they damn near have to bitchslap a computer to get anything accomplished at the machine level.

Analog networks don't have these problems. They're either on or they're off. Even when computer-controlled, at the end of the day it's just a circuit.

And so the very issue of uptime was introduced with the inception of the digital computer. The digital equivalents of analog networks are laughable in terms of uptime. I wonder what a chart would look like showing AT&Ts landline uptime versus its wireless service, month over month. I have a feeling it would be disgustingly lopsided in favor of the wired service. (And would only depress me further since I am an AT&T customer.)

So there you have it. Now that I've blogged myself into a ball of plasma, I'm going to go hide in the corner until the gestapo arrive.

No comments: