Debugging ARM Cortex-M0+ HardFaults

To me, one of the most frustrating things working with ARM Cortex-M cores are the hard fault exceptions. I have lost several hours this week debugging and tracking an instance of a hard fault on an ARM Cortex-M0+ device.

Next assembly step will cause a hard fault
Next assembly step will cause a hard fault

Background is that I’m porting a project (NXP Kinetis KW40Z160) to Eclipse and the GNU tool chain for ARM. The application seems to run fine after downloading it with the debugger.

However, it crashes with a hard fault if either I do a ‘restart’ with the debugger or if I do reset the microcontroller with a SYSRESETREQ (see “How to Reset an ARM Cortex-M with Software“).

Interestingly, it crashed during startup in the ANSI library, in the _init() function:

Next assembly step will cause a hard fault
Next assembly step will cause a hard fault

Pushing the registers in _init() will cause the hard fault exception:

triggered hard fault
triggered hard fault

Interestingly, that same code with the same registers/stack/etc works fine the first time. But does not later after the application is running.

The usual suspects

The first (usual) suspect was to check the usual suspects:

  • Checking the output of the hard fault handler? Did not show anything usable.
  • stack pointer not aligned? No, that’s fine, and works with the same values the first time.
  • ARM/Thumb mode? Nope, all in Thumb mode.
  • stack readable/writable? Yes, and no special protection or other kind of things were used.
  • Interrupt priorities? Nope, it happens with interrupts disabled.
  • Problem with the debugger? Nope, tried both Segger and P&E, same for both, and it happens with and without debugger attached or used.

Well, that’s now the point I run out of ideas :-(. What remains is the desperate attempt to ask colleagues which returned a similar list of points as above. So no progress.

Internet Help?

Next: search the internet. Lots of other people have issues with hard faults too. At least this was revealing an interesting article about imprecise hard faults. Very interesting reading and I did my digging. Unfortunately it only applies to Cortex-M3/M4 and to to the M0+. Anyway I have added an option to the hardfault handler so I can deal with this better in other projects:

Hard Fault Handler with option to disable write buffer
Hard Fault Handler with option to disable write buffer

I still had no clue. What I ended up was trying to do a ‘binary’ search: disabling portions of the application to find out if there is something causing the problem. Time consuming, but that was my only option. So I turned on/off parts of the application to find out what part of the application is causing the problem.

Gotcha!

And indeed, after hours of searching I narrowed it down to the flash programming: The application is storing data internally by reprogramming a 1 KByte flash page, and for this it erases a 1K block in flash. In the linker file it was allocated like this:

  .nvm :
  {
    . = ALIGN(8);
    NV_STORAGE_START_ADDRESS = .;
    . += 1024;
    NV_STORAGE_END_ADDRESS = .;
  } > m_text

Erasing seemed to work fine. But after that block erase the microcontroller would crash after a restart.

Looking closer at the linker file, I finally spotted the problem: it should have been aligned to the 1k Flash block size:

  .nvm :
  {
    . = ALIGN(1024); /* must be 1k aligned! */
    NV_STORAGE_START_ADDRESS = .;
    . += 1024;
    NV_STORAGE_END_ADDRESS = .;
  } > m_text

With this change, the problem went away 🙂

Summary

Dealing with hard faults on ARM is not easy. This particular one was caused by erasing a FLASH memory block which was not aligned. It seems to me that somehow the internals of the processor got screwed up by this. The challenge was that it then crashed in a way not tight to the cause of the problem. This kind of problems are not easy to find and solve. What usually works best and worked in this case is trying to reduce the problem: have a ‘working’ code, and have a ‘failing’ version. Try eliminate all variables (board, debugger, power supply, host machine) and then reduce the problem to narrow down the problem as much as possible.  In any case, I hope this post can help others in such a case. And yes, some luck helps accelerating that process ;-).

Happy Debugging 🙂

Links

13 thoughts on “Debugging ARM Cortex-M0+ HardFaults

  1. Erich, thanks for this post! I would love to see more like it. I’ve had similar struggles narrowing down culprits during the board bring-up phase while bringing many peripherals online. After that phase, it’s usually more obvious as I’m focusing on developing a single peripheral. But, still would love a better method to more quickly narrow down hard fault issues.

    My first thought was that you inadvertently erased your interrupt vector table, and therefore your reset vector, but if this nonvol section is at the “end” as you say, that shouldn’t be the case.
    Also, shouldn’t the FLASH be aligned to 2KB boundary? Or, is it 1KB for the KW40?
    Good luck on the KW40. I’ll be diving into that one myself in the very near future.

    Like

    • Hi Joe,
      ‘problem solving’ is something rarely teached. A lot is about experience. And yes, after finding such a problem things are sometimes obvious too.
      I had that thought on the vector table too (plus I had the vectors in RAM which can complicate things). About the FLASH boundaries: I have checked the reference manual for the KW40Z160 and they mention a sector size of 1 KByte.

      Like

  2. Erich, thanks for your helpful post. I think I asked you about this almost a year ago, although it was RAM access that was problematic. Almost every hard fault I encountered during a 6-month project using the KW40 was due to word alignment problems.
    In particular, I was receiving 16-bit words from BLE that were packed (1-byte aligned) in the OTA packets and were not aligned properly in RAM. Accessing them as 16-bit values caused the hard faults and the solution was to access them as byte arrays instead.

    Like

    • Hi Michael,
      yes, the M0+ hard faults on misaligned access. The M4 tolarates it and does two fetches. I have not entered the OTA area with the KW40 yet, so thanks for the heads up. Similar hard faults I have seen from subtle stack overflows resulting in misaligned accesses too.

      Like

  3. Thanks for this post Erich, really interesting. It’s a bit off-topic but could you give more information about the nature of the problem ? Why the block has to be 1k aligned ? I mean, what exactly goes wrong if it isn’t, that makes the MCU go hardfault ?

    Like

    • Hi Tim,
      I only can speculate. It is common that flash blocks are block oriented, and have to be aligned as such. I would have to re-create the problem to see exactly what is happening. I would have expected that the flash programming code would return an error if the base address is not aligned. And what exactly is causing the hard fault is a question to me too: I only can speculate that things get screwed up internally and then this causes an issue when the startup code tries to push things on the stack. I tried different stack addresses (as I thought the stack memory is the problem), but the problem was still there. I see if I can post some more findings in this article.

      Like

  4. I don’t know if you’re aware, but at least a couple of the more specific Cortex fault handlers are disabled by default, and result in a HardFault instead. From the top of my head I think BusFault and UsageFault are like that. To get these handlers called, and then get access to their diagnostic registers, you need to enable them in some of the core registers. If I ever hit a strange HardFault in the future, that would be my first debugging step.

    Like

    • Thanks for that point. And I’m aware of these :-). But BusFault and UsageFault exist on ARMv7M (Cortex-M3/M4), but not on ARMv6-M (Cortex-M0(+)) which this article is about. So on a Cortex-M0+.

      Like

  5. Pingback: New NXP MCUXpresseo IDE V10.3.0 Release | MCU on Eclipse

What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.