Error Handling

26 Aug 2015

On Episode 112 (My Brain Is My Resource), Chris and I briefly answered a listener question about how to handle errors. Our response went something along the lines of “bubble errors up to wherever the system is intelligent enough to handle them.”

This answer was not very actionable and the question of effective error handling is a good one so we dug into it more on Episode 114 with Andrei Chichak (Wild While Loops). Let’s start with some good tactical advice you can use today.

Default to errors and logging

Put error logger calls into all of the default cases of switch statements to pick up stuff that can’t happen but does. For cases that you don’t care about, group them into a don’t-care fall through. Many people don’t handle the default case on the premise that they’d never get there. That is why it is such a great error handling state. This default-is-always-an-error mentality will help with stack errors as the system falls into a weird case that is absolutely impossible.

This assumes you have an error logging subsystem. Error logging is often something I bring from project to project and it goes in very early in my system because it is critical to the way I develop. But often, as a consultant, I get hired at a place where many things are already in place but not necessarily a good, flexible logging system. I get nervous without one as you can develop a complex system and not understand how the code is working, making it much more difficult to fix.

Please, when starting a project, put in an error handling routine and use it immediately, even if it doesn’t do much of anything. At a minimum, blink an LED then reboot. In my book, I used a logging interface example to discuss system architecture and the adapter design pattern. One important part of logging is separating the output method from the logging needs in the code.

Logging can have different ways of outputting an error to the engineer (or user, more on that later). Better, with a good interface, the underlying methodology may change over the life of the product. Initially the prototype boards can pull out a debug serial port that later gets hidden in an enclosure (or under some potting). Later, logging output may move to a tiny ring buffer in flash that doesn’t print strings, only codes which can be read out in case of catastrophe. Perhaps an LED flashing a fault code if you are really constrained. The logging interface shouldn’t depend on the logging method.

Also, I like to have levels of logging (i.e. info, warning, error, critical). This lets me log everything (info and warning) in development, without losing the more severe errors in production. (Note: logging is great and I could go on (and on) about it, but be aware that it can change timing of the system and take up additional code space.)

Detect and handle

By definition, you won’t be able to log an error (or handle an error) until there is a problem. That means you have to detect errors. (Note: Production is not the time to start trying to detect errors, so check early and often.)

It seems like there are infinitely many different kinds of errors: software, hardware, environmental. Some errors are persistent, requiring intervention, some can be solved with retries. Consider the type of error you have as you try to figure out how to handle it.

Say you are working on an I2C driver. If a system is tuned for power optimization, then you may get an I2C error as a temporary glitch even though the system is generally working to spec. Do you:

Retry at the I2C transaction layer?
Immediately ASSERT and watchdog reset?
Retry in the higher driver level for the peripheral controlled via I2C?
Bubble up the error to the user via LED or LCD?

Depending on the situation, a case can be made for each of these. Of course to handle the error, you likely have to check a return code somewhere. On Andrei’s previous episodes, we focused on MISRA (podcast, Linker post). One of the MISRA rules is to always check your return codes. Raise your hand: do you check all the return codes? Even on printf?

If you want to show that you are intentionally ignoring a return, you can cast it to show intentionality.

This provides a little reminder that you are throwing away the result. Personally, I find this ugly. Though, as with most coding style guidelines features, with enough use it becomes a habit and useful to your team if you all are consistent. [Insert rant about how I don’t care what your style guidelines are but I think you should have them.]

On the other hand, verifying that printf sent all of your characters wherever they were supposed to go, especially if it is the debug serial port in the error logging subsystem, seems silly. Where do you draw the line with checking return codes? Some return codes are purely informational and we don’t need to be rigorous about checking them. But which ones? It is a slippery slope that requires judgment.

Of course, you must check the return codes of important things (like malloc, if you are using it, or the success/failure of a subsystem). Mostly, as you consider whether a return code is worth checking, you have to be aware of cascading consequences so you can handle potentially big problems early.

Does error handling clutter your code? Yes. Is it worth it? Almost always. Well-done error handling (and logging) can provide documentation to your code as a side benefit.

Asserting errors do not exist

What about asserts? Asserts are usually implemented as a macro, something like the following.

In an embedded system, the abort() is sometimes implemented as a reset and sometimes implemented as a while(1) {;} which causes a watchdog reset.

When I use asserts, I tend to write my own so that I can write an assembly instruction to breakpoint before resetting the processor. Something like this, though it is processor (and compiler) dependent.

This only happens when the debugger is attached (and it means I don’t need to waste my limited hardware breakpoints.

If you use asserts as your primary error detection then when you get to production and flip the debug definition, you’ve lost all your error detection. Sure, none of that should have happened but shouldn’t we still be checking it? Either errors handled in production are not being handled or the assert checks were around things that really didn’t matter.

Turning off asserts like this also changes the timing of your code and your memory map file. I think it is better to build a logging subsystem that can be turned off more gently. And there is no reason the logging subsystem can’t assert on critical level errors.

Sending Errors to the User

After detecting, handling, and logging the error, the next step may be to report it. Maybe.

Start with hardware: can you tell the user about errors? Should you? In a system with an extremely limited interface, how can you report errors? LEDs can blink Morse code, sometimes handy for you in debugging, but do you really want to torture users and/or your support folk with that?

It really depends on the user. Some are ok with that. Other users don’t necessarily care about errors until something happens: they aren’t looking at your blinking LED until the device has ceased functioning. For example, when my router fails, I don’t know/care until I try to surf the internet and it fails. If I go over there and it is blinking, I power cycle it. But if it isn’t blinking, I also power cycle it. The LED is pointless for me as a user. (Even more so because my LEDs are covered with electrical tape because I don’t like to see them blinking. And, as you design fancy LED interfaces, don’t forget that a large section of the population is colorblind.)

Andrei spoke of a system he was working on that should output 1-3V. When it was not powered, of course, it output zero volts. However, when there is an error that must be handled, he sets the output to 0.5V which makes the system not work (so the user knows) but also tells the support engineer something very useful.

Thus, an important question in error handling the question is “Who are you sending your errors to?”Errors you send yourself are easy and it’s ok if they are obscure. Errors to the user are much more difficult. Assuming you can give errors to the user, what do you show them?

This is the DOS error message when it cannot find a floppy disk drive. Abort means return without a failure code, fail returns an error code to the program, retry is the most useful because the user may have forgotten to put the disk in. But these words don’t mean much to a busy professional.

It would be nice if we’d learned our lesson from DOS. However, in the Windows 10 installation, it is possible to get a dialog with the title “something happened” where the window text again says “something happened”, then includes a hex code and an ok button.

One thing to remember about your users: there is a reason they didn’t go into computers.

Bad error reporting (and detection and handling) are often due to trading off greatness for time to market. One easy-ish thing to do is to bundle the text of all your errors, send it to someone who can act as your user and ask them what they think the error means and what they should do about it. You can’t rely on your QA department to see all errors in testing (and if they do, they may be so familiar with the system that they no longer qualify as typical users). The extra perspective is necessary to making good products.

Sometimes errors are critical and can hurt people (e.g. medical devices). The goal there is fail to a safe state and report the error. With the stakes high, you may not want to reset or take corrective action unless it is possible to be confident that the error was temporary or any necessary intervention was taken. At the other end of the spectrum, when I worked on children’s toys, it was fine to reset if the system got into a truly bad state. There is no panacea that works for all systems.

Development would be easier if there was a way to mandate how to detect, handle, log, and report errors for all systems. Unfortunately, there isn’t, so some care and thought will always be necessary.

Acknowledgements

While I wrote this up based on the Embedded.fm episode, Andrei Chichak put together an excellent list of error handling techniques for us to discuss on the show. And, of course, Chris was an integral and help part of our discussion.