Blog #7: Debugging Maxim’s Firmware Framework Crash

7 Nov 2022

Blog #7: Debugging Maxim’s Firmware Framework Crash

Hello everyone. I welcome you to my seventh blog as part of Maxim Integrated (now ADI) sponsored competition Experimenting with Gesture Sensor Design Challenge. This is seventh blog which I post as part of my journey with MAX25405 Gesture Sensor evaluation Kit. This blog is out of order. It does not depend on any previous blog and most probably no future blog will depend on it. In previous blog I described WebSerial API which I want to use in one of my projects as part of competition and I promised blog describing final implementation of this project next time, but in meantime I helped other challengers on the forum and I faced debugging of very interesting and complex problem and I decided to describe it with more details now. For summary currently there are 6 previous blogs which I used for describing my plans as part of competition, describing MAX25405 evaluation kit from software and hardware point of view including detailed blog post regarding software provided by maxim and serial API provided by firmware which is preprogramed to the microcontroller shipped with the Kit. Here are links to all my previous blog posts. If you skipped or missed some and you are interested in it, then feel free to read them.

Maxim Firmware Framework

Firmware Framework is Arm MBED project which is available at MAX25405 webpage for download. I originally thought that it is the same project which is base for firmware programmed in factory to MAX32620 MCUs shipped with MAX25405 Gesture Sensor Evaluation Kits. It has also Serial API and at the first look it looks compatible with evaluation kit software. But later I and some other challengers realized that there are big differences and Firmware Framework sources are much less powerful in comparison with firmware shipped in EVKITs. Even more it lacks many very important functionalities. The one difference is that Firmware Framework allows you to choose if serial API is implemented using virtual UART like in case of firmware shipped with EVKITs or with physical UART. Because one member was interested using Kit with Arduino over UART, he tried to compile this project, but he very soon face several issues so he ask on the forum and I (and one other challenger) attempted to help him. At beginning we resolved issues with compilation and incompatibility with latest MBED OS, then we faced issues with invalid memory layout of generated firmware and after we resolved all previous issues, we faced another issue. Generated firmware worked, but after executing one specific command (poll) it stopped responding to commands.

In this blog I will describe debugging procedure which I applied for finding root cause of firmware crash.

First idea

My first idea was that firmware stuck at some (possibly infinite) loop and runs there. In case of MCU it is easy to find this loop in case you start ode from some IDE with debugging capability but because I flashed firmware to the device without incorporating any debugger It became significantly more complicated and requires a lot of additional manual tasks.

Connecting debugger

The first step was connecting debugger. MAX32620FTHR has standard ARM Cortex-M 1.27mm pitch 10-pin connector for attaching JTAG/SWD debugger. As a debug probe I used MAX32625PICO board which best fits Maxims MCUs. As a software I used OpenOCD which is bundled with Maxim SDKs which I earlier used when developing for Maxim’s MCU for example as part of previous Experimenting with Current Sense Amplifiers competition.

Physical connection looked as follows:

After starting openocd It successfully found CPU cores and started tcl and gdb server. I started openocd with bindto 0.0.0.0 which will later allow me to connect to these servers from remote computers. I will later connect from Linux machine with GDB.

For a first time I tried to check that firmware work as expected before executing poll command which cause it to crash.

I connected to TCL server using putty telnet client and manually executed halt command. This halt command printed some information including program counter (also known as instruction pointer) indicating at which address CPU execute instruction when halt command was issued. Then I resumed execution and replied this procedure three times.

As expected, I received three different addresses. At random moments when I pressed enters committing halt command my MCU was running instructions at address 0x15A46, 0x13E70 and 0x12FC8. As you can see, they are different address, and they are far from each other (thousands of bytes and instructions). This makes sense, every second this MCU executes 96 000 000 of clock cycles. Because ARM Cortex-M4 instructions can take more than one cycle there is no direct binding to instruction count but it is still in the range of tens of millions of instructions per second. So, it makes sense that we received three different instructions when executed halt commands.

At this stage firmware was working and functional, so I executed poll command which caused it to crash:

And now I tried procedure with halting MCU and resuming it again with following output:

As you can see in broken state our firmware always executed instruction on address 0x123a6. This indicates that firmware stuck at infinite loop which was compiled and placed at this address. Also note that debugger retrieved from CPU information that current mode of CPU operation is executing HardFault Handler which indicates that our firmware faulted. Note that if you try it on your own you may see different addresses. Compiler may every time use different addresses.

At this moment I connected to GDB server from my Linux machine and started gdb-multiarch utility. Gdb-multiarch is GDB compiled with support of different instruction sets (for example ARMv7) instead of only natural instruction set of computer (x86-64) in my case. After starting I remotely connected to openocd running on my PC.

As a next step I reset the MCU using TCL client and in GDB and I let execution continue, triggered faulty command and paused execution by Ctrl+C. Now I see mapping to C source code containing code of fault handler.

As a next step I used bt (back track) command for printing stack. Here we can see that it originates in some assembler code. It is common in ARM world to implement low-level fault handlers in assembler because it allows better management of values dumped to the stack.

I switched from C prints to assembler by layout asm command and TUI show me actually executing instruction. It is instruction at the address 0x123A6. This address is not unknown to us. We already known that code stuck here. But now we can see that it is jump instruction (referred as b.n in ARM instruction set).

I stepped one instruction by stepi command.

Next instruction after jump is WFI. It is power saving instruction which halts CPU core until any interrupt occurs. Next stepi took me back to 0x123A6.

And after executing stepi multiple time again I find that the loop iterates over 0x123a6 and 0x123a4 indefinitely.

So yes, code stuck in this infinite loop which is a consequence of triggering hard fault. Because GDB has some debug symbols loaded we can find that infinite loop is implemented as part of mbed_error function. It is infinite loop highlighted on following screenshot. Note that names do not match but mbed_halt_system was listed in bt command. Most probably compiler inlined this function to mbed_error because it is quite a simple, so inlining make sense.

Let’s look how code which called this function looks like. It started in assembler file except.S. this code is commented a little. The last five instruction is responsible for calling C code. Instruction at line 184 loads function address, next 3 instruction load arguments for function call and last instruction call error handler in C code.

Function called by this assembly code try to print some information about fault and at the end it calls function for halting system:

Called function does minor check but then call mbed_halt_system:

And this is function containing infinite loop:

Analysing hard fault registers

Most ARM Cortex-M cores has register containing more information about hard fault. In case of Cortex-M4 it is register CFSR. You can find its description in ARM Crotex-M4 User Guide available to download in PDF format from ARM website: https://developer.arm.com/documentation/dui0553/b/?lang=en. In case of Cortex-M4 register is at address 0xE000ED28 as mentioned in documentation. So, I read this register from GDB:

Currently there are three types of Hard Faults: UsageFault (for example when executing non-existing instruction), Bus Fault (for example when attempting to read from non-existing memory) and Memory Management Fail (for example when reading from existing memory but not having proper rights for accessing this memory). As you can see last four digits are 0 and only one bit in UFSR part is set. This indicate that we are facing UsageFault.

UsageFault part is described in documentation one page later:

In our case value of UsageFault part is 0x0008, in binary it is 0b0000000000001000, so only NOCP bit set. From the description we can see that it looks like our firmware attempted to use non-existing coprocessor. This is interesting because I did not except that firmware should use any coprocessor at all, the only “coprocessor” used I expected to be only floating point unit. Note that FPU has dedicated instructions, so I was not sure if using FPU can cause this hard fault. And even more, it can’t be caused by non-existing FPU because full featured FPU exists in Maxim’s MCUs…

This was interesting finding to me. At the time I had no idea how to utilize this information, but it will be helpful later.

Retrieving Debug Prints

You maybe noticed that hardfault handlers which I was browsing earlier has some debugging prints. These prints are attached to some Serial. They did not appear on serial line in my case. Maybe they are routed to other UART, maybe they are routed to SWD SWO signals or something like this, but I did not find them. But because I have attached debugger to the CPU I decided to use different approach for retrieving them. I browsed over code responsible for printing and at the end I found function mbed_error_vfprintf. As you can see, my IDE think that content of this function is removed by pre-processor but later, I found that it is compiled and it is compiled. Most probably my VS Code was configured slightly differently than compiler was.

At beginning I placed breakpoint to the beginning of function:

I let code run, reset MCU, run faulty command and wait util it hit breakpoint:

I switched to assembler again.

Now I have seen the assembly code. After quick looking around I decided to place breakpoint after vfprintf function. After this function printed message get evaluated with parameters and get stored in buffer. Buffer is first parameter, so pointer to the buffer is passed in first register (R0). Guys skilled in assembler already noticed that buffer is at the same address as stack pointer points to (this make sense because buffer is allocated on stack as we can see in C source code) and then function is at address 0x1210C called. I decided to place another breakpoint after this function return and then retrieve from the memory content of the buffer.

Then I let code execute to the newly added next breakpoint:

And used x command with s (string) parameter for printing memory at address pointed by stack pointer:

And this is message which would be later send to some debugging terminal normally. Then I remove first breakpoint (which is useless for me now):

And let code continue until it reach my second breakpoint again, then I printed next formatted text.

I repeated previous steps until I get full output which was following:

++ MbedOS Fault Handler ++

FaultType: HardFault

Context:
R1   : 000000E8
R2   : 20013290
R3   : 00000000
R4   : 200021EC
R5   : 20013274
R6   : 00000000
R7   : 000205F8
R8   : 2000101C
R9   : 00000000
R10  : 00000000
R11  : 00000000
R12  : 00013C45
SP   : 20012458
LR   : 000105E1
PC   : 000105E8
xPSR : 610F0000
PSP  : 200123F0
MSP  : 2003FF90
CPUID: 410FC241
HFSR : 40000000
MMFSR: 00000000
BFSR : 00000000
UFSR : 00000008
DFSR : 00000000
AFSR : 00000000
Mode : Thread
Priv : Privileged
Stack: PSP

-- MbedOS Fault Handler --



++ MbedOS Error Info ++
Error Status: 0x80FF013D Code: 317 Module: 255
Error Message: Fault exception
Location: 0x12CEF
Error Value: 0x105E8
Current Thread: Id: 0x200025D8 Entry: 0x12D45 StackSize: 0x10000 StackMem: 0x20002620 SP: 0x2003FF28 
-- MbedOS Error Info –

Finding code which caused Hard Fault

The previous debug output is interesting because it contains register value at the time when crash occurred. For me most important was program counter because it contains address of instruction which caused the crash. It is instruction at the address 0x105E8. So, I placed breakpoint here:

From output we can see that it is instruction in cmd.cpp file which is as part of Firmware Framework application. It is not source code as part of MBED OS.

I reset the MCU, triggered the faulty command and wait until it hit faulty instruction.

Switched to assembly mode again.

This was the most surprising thing as part of debugging. Faulty instruction is the latest instruction of cmd_poll function and it is instruction cdp2 0, 11, cr0, cr13, cr1, {0}. After googling I found that cdp2 is instruction for triggering coprocessor action. Now flag form hardfault register make sense, firmware really executed coprocessor instruction.

Then I was trying to search what is the purpose of this instruction. The last instruction which make sense is instruction bl which calls printf. Note that cmd_poll has printf call:

But all instructions after this call are curious. There are mov instruction in sign-extending variants but note that there are these instructions twice. They write constant and then overwrite the same register with other constant. This do not make sense. Why should you write value to the memory and write it immediately again without any use of the previously written value? These instructions did not make sense to me. Note that instruction above cdp2 are correct instructions of next function. The first instruction for example push 6 registers to stack which is common technique of backuping register values which are later used as variables in local function.

I originally though that maybe some corruption happened when flashing the firmware, so I checked disassembling the function on my computer based on firmware elf image:

As you can see the code is the same including last five curious instructions. Also note that this thought was non-sense because when it would be caused by wrongly flashed firmware it happens only to me and to all contentants and also GDB for code listing used ELF file and do not reread code from MCU because it is waste of time/performance (debugging would be much slower when rereading instructions memory).

So as a next step I tried to deploy different tool for analysing assembly. I originally wanted to better analyse what is happening before calling printf but after opening code in reverse engineering tool Ghidra I noticed some new things. In ghidra code looks as follows:

Here we can see the same assembly code but with more information retrieved. At first we can see that printf call is the last instruction. Now I also understood purpose of curious mov instructions: they are not instructions. They are hardened constant containing absolute address of global variables (gesResult and serial). This is correct approach on ARM. You may be confused by this. In case of x86-64 assembler there is no need for doing so, because x86-64 assembler has variable length instruction and it is possible to include 32-bit constant (for example memory address) as part of instruction but in ARM world instruction are fixed length 2 or 4 bytes, so there is no space for storing wide constants. Instead, compiler saves large constants near to the functions, and then read these cells with relative addressing. It is this case. Now I also demystified coprocessor instruction. It is not coprocessor instruction but rather it is another large constant. In this case it is address of string containing format for printf.

So now we know that curious instructions are not instructions but rather address but this do not solve our problem. Now we know that code generated by compiler is wrong. It calls function and then do not return but rather it start interpreting data as instructions and this cause firmware to crash. I was thinking about it and did not find any reason why did the compiler generate faulty binary code? So, I tried different approach, I used different compiler. As far I used ARM GCC downloaded from ARM website and now I switched to GCC installed from Debian repository. After recompiling I tried to disassemble the function. Look at last instructions:

As you can see, there are different instructions. There is still printf call (which is correct) but code then contains additional two instructions (add which is used for freeing stack variables and pop which is used for restoring saved register values and returning from the function. After these instructions there are also movs and crazy usat16 instructs but now we know they are not instructions, rather they are data. But this code is correct. CPU will never interpret constants as instructions because there is return.

I tried to flash firmware compiled by Debian GCC to the MCU and it run correctly and did not crash after executing poll command.

Undefined behaviour

I was thinking what is wrong with the compiler. Why did flagship ARM compiler generate faulty assembly code? It is simple function calling one other function and compiler generated this mistake. I went to C code again and was looking to it.

After several minutes I noticed the mistake. Function is defined with int return type but has no return command. This tends to undefined behaviour according to C standard. You should never rely on undefined behaviour. In case of undefined behaviour compiler can do whatever it wants. It can generate implicit return 0, implicit return 42, format your hard drive or whatever it wants. It can also generate faulty code as you have seen.

Solution

After adding return 0 (zero is represented by CMD_ACK macro in Firmware Framework. Function for processing commands should return CMD_ACK or CMD_NACK) and recompiling with ARM GCC it generated proper return and code stopped crashing after executing poll command.

Summary

And this is all. As you can see one missing return can result to invalid code generated by compiler, cause execution of curious coprocessor instructions, hard fault due to accessing nonexisting coprocessor and crash of whole firmware. Pay attention to returns! Maxim engineers who write Firmware Framework did not pay attention and cause big troubles to us. Whole debugging procedure as described in this blog took me over 7 hours. Additional collecting screenshots and writing this blog post took me additional three hours. I hoped this would be helpful to you or someone else who find this blog. In blog I described techniques for determining cause of hard fault, searching for location of faulty instruction, collecting logs from memory if serial do not work and shown some techniques for decompiling and analysing assembly code. All of these are generic information and are useful when debugging hard faults or other crashes.

Thank you for reading this blog. Currently I am working on my first project as part of competition, so I think next blog will describe it. If you have any question or feedback, then feel free to write in comments below.

Next blog: Blog #8: 12V Accident

rsjawale24 over 3 years ago in reply to misaz

I understood many things from your blog. I'm now trying to understand the firmware framework code and the missing algorithms. I have also contacted maxim for the same but till now there is no response from Maxim technical support.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
misaz over 3 years ago in reply to dougw

Thank you for feedback.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
dougw over 3 years ago

Wow, Great documentation of in depth troubleshooting techniques. Finding the problem with someone else's sloppy code can be a very tough task. Thanks for posting.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
misaz over 3 years ago in reply to shabaz

https://www.reddit.com/r/ProgrammerHumor/comments/ggj0v7/just_ignore_the_warnings_right/
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
misaz over 3 years ago in reply to shabaz

Thank you for feedback.

Almost every IDE notices it and compiler also outputs this warning but it was simply missed by all of us because there are tons of warnings in this project.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel