Archive for the ‘Reliability’ Category
The Tip of the Iceberg
We all like to think that functional requirements are the main thing, and successfully designing and coding to them is enough. Who wants to worry about all the surprises from users, data, and even hardware?
But as Professor Behrooz Parhami shows, in a short (2-page!) article, Defect, Fault, Error,…, or Failure? (pdf), the “Ideal” state that we focus on is just one of 7 common possibilities. The other 6, descending into unpleasantness, are Defective, Faulty, Erroneous, Malfunctioning, Degraded, and Failed.
Our job is really twofold:
- Meet the functional requirements of the ideal state
- Keep the system in that ideal state, and avoid failure
Does failure avoidance have to take 86% (6/7) of the code? I don’t know. But it certainly sounds like the bottom half of an iceberg–a lot more than half is underwater.
Don’t get stuck
Having a standalone consumer application get stuck or crash, requiring reboot, is not the worst thing that can happen. (Worse is incorrect behavior that causes data loss or physical harm.) But requiring a reboot is the most annoying failure in non-safety-critical systems.
If there’s any good news, it’s that the list of fault modes is short:
- System resources exhausted
- Mistakenly idling
- Waiting for acknowledgement that never comes
- Deadlock
Did I miss any?
Only exception-safe code can avoid these undesired end states.
Design by Contract (DbC) is one way to exception safety.
Failure mode and effects analysis (FMEA) helps you plan a path to get there.
Keeping Embedded Software on Track
Consider a function in software code, and how to write it properly so that it meets its requirements and doesn’t fail. The standard way of looking at it is that a function has an API — a signature — of input and output variables and types. The code for the function has a pretty standard form. There are checks for invalid input, and a function body which implements a mini-flowchart of conditional checks, and maybe a state machine if needed. All this to take every given input vector to the correct output.
Unit tests, whether they’re written before coding (test-driven development), or after, apply test input vectors to the function and check if the expected output is produced. The same story applies to a module: a collection of functions or classes that implements a feature. There are still valid inputs to check if they give correct output, and invalid inputs that must be rejected. Best-practice test design methods, such as boundary-value checking and equivalence classes, help the developer choose the best test vectors from the sea of possibilities. The ones that will produce good tests that cause failures early, while the code is still in the hands of the developer.
So why do we still have so many “surprise” failures? Unit-tested code that nevertheless comes back from system testing with failures? And the Steps to Reproduce — so simple — like, “I left the system running overnight and when I came back in the morning it had crashed.”
The answer may be in the flow of data. Functions and modules in embedded software don’t just deal with single inputs one by one. Generally they are expected to process streams of data in real time. For example, a modern television set-top box, with a built-in disk, has to process multiplexed video data and metadata, the same from the hard disk, as well as a much slower stream of user input via the TV remote control. If the code slips up, or leaks memory, sooner or later, you get a wrong, stuck, or crash situation that QC proudly reports.
Where have we seen this before? How about those robot car competitions that university engineering departments often hold? Contests which have expanded outward to worldwide challenges to design, for example, an auto-piloted car. (See “The DARPA Urban Challenge“). These contests demand software that will keep a driverless car on track for long periods of time, avoiding stationary and moving obstacles (other cars).
Could it help, then, to see a lowly function in embedded software not as a flowchart of if-then-else statements, but as a little car — or perhaps delivery truck– that must be kept on track, avoiding obstacles in the data stream while delivering packages to the right place? How would that view affect code layout, code review, and unit test design? See you at the track!