Archive for the ‘Recovery’ Category
John Henry the Maintenance Programmer
Thanks to Bulkan Evcimen for tweeting about an article I doubt I would have found otherwise: “A Genetic Programming Approach to Automated Software Repair” (pdf link here), by Stephanie Forrest et al. They describe there how they fed a genetic programming algorithm the following inputs:
- Source code with a known defect
- A negative test case
- Several positive test cases
The authors show how software with the genetic algorithm is capable of repairing the defect such that the software under test no longer fails, while it still passes the positive test cases. They complete the activity with additional software that reviews the change and reduces its scope to the minimum.
Is this the end of the maintenance programmer? If so, this time there would be a happy ending: John Henry lives, and goes on to be a railroad engineer.
Cancel that!
A medical radiation machine operator types the letter ‘x’, realizes it’s an error, backspaces, types ‘e’, and continues. Consequences of the error and related defects lead to the patient’s untimely death.
Ten years later, a Japanese stock broker mistakenly switches the share price and amount on the sell order of a new stock. He tries to cancel the order, but fails. His employers lose $225 million, and are involved in lawsuits for years.
A blogger clicks the “back arrow” by mistake, gets a warning message, concludes he meant to continue writing, and clicks “OK” to continue. He loses his work, and has to rewrite it. Here, the consequences are annoying, if trivial by comparison.
What do all these cases have in common? Canceling a request. If it’s not hard enough to program computers to do what we want them to do, who would have thought that telling them not to do it would be hard too?
The cancellation scenario — or “use case” as it’s called in software design — is the silent partner of every positive request supported by a piece of software. It has to cover giving the user clear options, executing the cancellation, and rolling back any partial results. Things get more complicated if authorization is required, or if the transaction has already gone through (both of those requirements figured into the Japanese broker error story). Cancellation and rollback are also part of automatic requests that may occur if one software module (the “server”) cannot complete a request by another (the “client”) and has to make sure to put everything back the way it was, and send the proper response code.
So the next time you’re designing a piece of software, no matter how simple, think what it’s supposed to do, but also what it will do if the user or client calls out, “Cancel that!”
The Tip of the Iceberg
We all like to think that functional requirements are the main thing, and successfully designing and coding to them is enough. Who wants to worry about all the suprises from users, data, and even hardware?
But as Professor Behrooz Parhami shows, in a short (2-page!) article, Defect, Fault, Error,…, or Failure? (pdf), the “Ideal” state that we focus on is just one of 7 common possibilities. The other 6, descending into unpleasantness, are Defective, Faulty, Erroneous, Malfunctioning, Degraded, and Failed.
Our job is really twofold:
- Meet the functional requirements of the ideal state
- Keep the system in that ideal state, and avoid failure
Does failure avoidance have to take 86% (6/7) of the code? I don’t know. But it certainly sounds like the bottom half of an iceberg–a lot more than half is underwater.
Don’t get stuck
Having a standalone consumer application get stuck or crash, requiring reboot, is not the worst thing that can happen. (Worse is incorrect behavior that causes data loss or physical harm.) But requiring a reboot is the most annoying failure in non-safety-critical systems.
If there’s any good news, it’s that the list of fault modes is short:
- System resources exhausted
- Mistakenly idling
- Waiting for acknowledgement that never comes
- Deadlock
Did I miss any?
Only exception-safe code can avoid these undesired end states.
Design by Contract (DbC) is one way to exception safety.
Failure mode and effects analysis (FMEA) helps you plan a path to get there.