Talk About Quality

Tom Harris

Archive for the ‘Exception handling’ Category

The Day they Broke the Lever

leave a comment »

We may never know all the details of the recent Toyota failures, because, unlike the Challenger and NASA, Toyota is a commercial entity rather than taxpayer-funded government. And it’s a bit early to be looking for definitive analysis. What we can do, as embedded software developers, is realize what we’re up against, and how it didn’t start yesterday.

When microprocessors first came out the buzz on the news was that soon they were going to be inside everything. I remember wondering what do we need them in everything for? My washer and my microwave oven, for example, worked just fine with a mechanical dial. But I was wrong and now they are in everything.

Much earlier on, complex systems were based on the Six Simple Machines. I can’t say which is simplest, but maybe it’s the lever. What’s important here is that the behavior of the lever is governed by physical properties of solids, which by now are pretty well understood. Push down on one end, and the other end goes up. Exceptions exist: solids can stretch under tension, deform under pressure, and if you push too hard, they can break. Apparently Toyota had problems with this too, in their gas pedal problem and fix.

Technical advances brought hydraulics — based on the properties of fluids and gases. I also remember replacing a brake line on an old car once, and getting to see how a leak in the line leads to spongy braking, and, if I hadn’t been refilling it weekly before the replacement, to brake failure.

In both of those technologies, the basic idea is that the system user (e.g. the driver) pushes on the control, force is transmitted immediately through a physical medium, and the desired effect occurs. Failure modes are well understood and (usually) are prevented.

Now replace the lever with software. No more physical properties: the behavior of the software transmission line is governed by the behavior of people — software developers writing lists of instructions which are translated several times before reaching the hardware. Now you push on the control, and a cacophony of instructions (okay, I’m being unfair — if the software is well-written, it’s a concert of instructions) rush down a wire to an actuating motor at the other end.

Let’s be clear about this: real-time software is a myth. The metal of the lever, or even the fluid of the hydraulic line, has been removed and replaced by your entire software development organization. Materials science out the door, organizational sociology in its place. If you want to know where that leaves us, read this short post about drive-by-wire, and the many comments there.

What’s the answer? In the code itself — perhaps it’s avoiding interrupts and multi-threading, and moving to synchronous programming where everything proceeds in lock-step. At least then a modern car would run like clockwork. Other than that, notice that many of the suggestions in the discussion talk about mechanical back-up systems, which are basically putting the lever back into the system.

But software development is not, as commonly believed, about technology. It is about writing, reading, and communication in ever-larger groups. Those who have said, “The pen is mightier than the sword” were thinking of politics. Now writing, in the form of software, interposes itself between every hand and hammer. We have broken the lever; do we know how to wield its replacement?

Cancel that!

leave a comment »

A medical radiation machine operator types the letter ‘x’, realizes it’s an error, backspaces, types ‘e’, and continues. Consequences of the error and related defects lead to the patient’s untimely death.

Ten years later, a Japanese stock broker mistakenly switches the share price and amount on the sell order of a new stock. He tries to cancel the order, but fails. His employers lose $225 million, and are involved in lawsuits for years.

A blogger clicks the “back arrow” by mistake, gets a warning message, concludes he meant to continue writing, and clicks “OK” to continue. He loses his work, and has to rewrite it.  Here, the consequences are annoying, if trivial by comparison.

What do all these cases have in common? Canceling a request. If it’s not hard enough to program computers to do what we want them to do, who would have thought that telling them not to do it would be hard too?

The cancellation scenario — or “use case” as it’s called in software design — is the silent partner of every positive request supported by a piece of software. It has to cover giving the user clear options, executing the cancellation, and rolling back any partial results. Things get more complicated if authorization is required, or if the transaction has already gone through (both of those requirements figured into the Japanese broker error story). Cancellation and rollback are also part of automatic requests that may occur if one software module (the “server”) cannot complete a request by another (the “client”) and has to make sure to put everything back the way it was, and send the proper response code.

So the next time you’re designing a piece of software, no matter how simple, think what it’s supposed to do, but also what it will do if the user or client calls out, “Cancel that!”

Errors are always cumulative

leave a comment »

Errors are always cumulative
Nobody likes to write error-handling code, but at least it’s easy (if boring): check inputs and results with “if” statements, and reject or recover on failure. But is it really that simple?
A little thought shows that errors are cumulative, and that failure is always the gathering or intensification of some faulty condition. Let’s prove that by contradiction. One of the simplest and most common cases of error handling is input data checking in a user interface. For example, password checking for your bank account login. The simple error-handling code is:
if username.password <> password then reject login
That should work fine, right? Nothing cumulative there. Every time I submit a wrong password, it tells me “wrong password” and prompts for a retry. But software developers (and many bank website users) will recognize the problems with that solution, among them:
1. Unbounded retry loop — if I keep getting it wrong, I can’t escape login
2. Denial of service — bring down the server by overwhelming it with bad logins
3. Eliminating wrong passwords — if it tells me “wrong”, I cross that try off my list
All of these real outcomes have something cumulative in them:
1. Time — user may run out of patience
2. Load — too much for server to handle
3. Learning — revealing more and more information about correct password
Take another common example: the elevator. Would simple limit-checking work for stopping at the right floor? Let’s try.
if floor.location <> floor.desired then keep descending
Bang! I wouldn’t want to be on that elevator. What’s cumulative? The momentum of the elevator, and the decreasing distance between current and desired location. (Even though nothing is wrong, that distance is commonly called “error”.) But a better example is the elevator’s door-closing protection. It started out with a simple:
if not (door.can_close) then door.re-open
Wasn’t it fun back then to keep waving your hand in the door and watch it open again, and keep everyone on the other floors waiting? Quickly, though, elevator software designers realized that even if it’s totally unacceptable to ignore a deliberate foot in the door and start moving, it’s equally wrong to go on closing and re-opening forever. So they added that unpleasant buzzing sound, triggered when the retries reach a certain number or amount of time. Cumulative again.
Real-life error-handling, then, has to do more than test for the limit and reject it. It has to recognize faults, count or measure them, and prevent them from growing and leading to failure. In fact, since a fault may be just the limit of an otherwise acceptable condition (e.g. buffer almost full — OK; buffer overflow — fault), error prevention requires identifying and tracking resources even before they reach their limits.

Nobody likes to write error-handling code, but at least it’s easy: check inputs and results with “if” statements, and reject or recover on failure. But is it really that simple?

A little thought shows that errors are cumulative, and that failure is always the gathering or intensification of some faulty condition. Let’s prove that by contradiction. One of the simplest and most common cases of error handling is input data checking in a user interface. For example, password checking for your bank account login. The simple error-handling code is:

if username.password <> password login.reject(“wrong password”)

That should work fine, right? Every time I submit a wrong password, it rejects the attempt and prompts for a retry. But software developers (and many bank website users) will recognize the problems with that solution, among them:

  1. Unbounded retry loop — if I keep getting it wrong, I can’t escape login
  2. Denial of service — bring down the server by overwhelming it with bad logins
  3. Eliminating wrong passwords — if it tells me “wrong”, I cross that try off my list

All of these real outcomes have something cumulative in them:

  1. Time — user may run out of patience
  2. Load — too much for server to handle
  3. Learning — revealing more and more information about correct password

Take another common example: the elevator. Would simple limit-checking work for stopping at the right floor? Let’s try.

if floor.location <> floor.desired elevator.descend

Bang! I wouldn’t want to be on that elevator. What’s cumulative? The momentum of the elevator, and the decreasing distance between current and desired location. (Even when nothing is wrong, that distance is commonly called “error”.) But a better example is the elevator’s door-closing protection. It started out with a simple:

if not (door.can_close) door.re-open

Wasn’t it fun back then to keep waving your hand in the door and watch it open again, and keep everyone on the other floors waiting? Quickly, though, elevator software designers realized that even if it’s totally unacceptable to ignore a deliberate foot in the door and start moving, it’s equally wrong to go on closing and re-opening forever. So they added that unpleasant buzzing sound, triggered when the retries reach a certain number or amount of time. Cumulative again.

Real-life error-handling, then, has to do more than test for the limit and reject it. It has to recognize faults, count or measure them, and prevent them from growing and leading to failure. In fact, since a fault may be just the limit of an otherwise acceptable condition (e.g. buffer almost full — OK; buffer overflow — fault), error prevention requires identifying and tracking resources even before they reach their limits.

Written by Tom Harris

July 6, 2009 at 9:26 am

The Tip of the Iceberg

leave a comment »

We all like to think that functional requirements are the main thing, and successfully designing and coding to them is enough. Who wants to worry about all the surprises from users, data, and even hardware?

But as Professor Behrooz Parhami shows, in a short (2-page!) article, Defect, Fault, Error,…, or Failure? (pdf), the “Ideal” state that we focus on is just one of 7 common possibilities. The other 6, descending into unpleasantness, are Defective, Faulty, Erroneous, Malfunctioning, Degraded, and Failed.

Our job is really twofold:

  1. Meet the functional requirements of the ideal state
  2. Keep the system in that ideal state, and avoid failure

Does failure avoidance have to take 86% (6/7) of the code? I don’t know. But it certainly sounds like the bottom half of an iceberg–a lot more than half is underwater.

Don’t get stuck

leave a comment »

Having a standalone consumer application get stuck or crash, requiring reboot, is not the worst thing that can happen. (Worse is incorrect behavior that causes data loss or physical harm.) But requiring a reboot is the most annoying failure in non-safety-critical systems.

If there’s any good news, it’s that the list of fault modes is short:

  • System resources exhausted
  • Mistakenly idling
  • Waiting for acknowledgement that never comes
  • Deadlock

Did I miss any?

Only exception-safe code can avoid these undesired end states.

Design by Contract (DbC) is one way to exception safety.

Failure mode and effects analysis (FMEA) helps you plan a path to get there.

Follow

Get every new post delivered to your Inbox.