It happens to me that I run into a really, really nasty problem. I spend hours (if not weeks) to get it resolved. Strong coffee and the problem keeps me up at long nights. I think every embedded system engineer knows what I’m talking about. Yeah, most of the time it is my fault or an oversight. But once in a while I’m convinced that I have found a real bug. Then I report it back to the vendor to fix it. I hope my report will prevent another engineers to run into the same problem. Or that I learn something else as a by-product. Oh yes….
I was running into this kind of nasty problem a while back. The circumstances made things really bad for me, wasting me a lot of time. And the consequences of that problem have expanded my approach to problems, and I have learned a lot beside of the real problem.
I ported an application from a Freescale ColdFire V1 to a Freescale ColdFire V2 (MCF52259) microcontroller. If you are a frequent reader of this blog, then you know that I’m using MCU10.x with Processor Expert for this. Processor Expert makes the porting easy and straight forward. The application was a DC motor controller with a wireless transceiver, some quadrature encoding and the usual stuff like and operating system, some additional timers and so on. Not that simple application, but well: pretty decent. Running the application on the new processor seemed to work fine, until…… Booom! An interrupt fired which I did not expect. Read ‘Oh my! An Interrupt…‘ how to catch interrupts with Processor Expert.
Interestingly this was a USB interrupt! But I was not using USB in my application! I verified that I have the USB block not active and disabled. USB interrupts were not active. Retry. And this time it was an Ethernet interrupt! Again, I’m not using Ethernet at all in that application. And so on: the processor fired interrupts in a random fashion.
Geee, I guessed at that time that this was going to be difficult. All the usual suspects were clean: stack, hardware, power supply, etc. Interestingly removing functionality of my application seemed to ‘fix’ the problem, or it took longer until it happened. First I was thinking that the debugger was causing the problem. In a first scenario it only happened while debugging. So is the debugger guilty? Maybe the debugger accidentally raises an interrupt? That would be really bad. But then I managed to reproduce the problem without debugging: it was just harder to get it. What now?
I was able to reduce the problem to small program, and to reproduce it with just two timer interrupts: having two timers firing interrupts with a high frequency, this caused all kind of spurious interrupts. Gotcha! And while debugging the timing was slightly different, thus it was somehow more likely to reproduce the problem. So here I have found a bug in the silicon! At least this was my conclusion. Time to send that case to the support team.
Support acknowledged that they can reproduce the problem I see. After a while, I received an answer, and that the they checked with the silicon design team. They pointed me to the following part in the CPU reference manual:
16.3.6 Interrupt Control Registers (ICRnx)
Each ICRnx, where x = 1, 2,…, 63, specifies the interrupt level (1–7) and the priority within the level (0–7).
It is the responsibility of the software to program the ICRnx registers with unique and non-overlapping level and priority definitions. Failure to program the ICRnx registers in this manner can result in undefined behavior.
(Source: MCF52259RM Reference Manual, MCF52259RM.pdf, p. 259)
Outsch: RTFM! The silicon designers did not spend the extra gates on the processor to implement a proper arbitration of the interrupt signals, if two interrupts have the same priority? :-(. I’m still wondering if this was an oversight in the silicon design? Or a missing feature? I used many controllers, including the smaller ColdFire V1, and this was new to me. Even more, the ColdFire V4 has proper interrupt arbitration implemented. And I was pulling out my hair with that ColdFire V2 issue :-(.
And now the source of the problem was clear: If I add an interrupt with Processor Expert, it is using the default interrupt priority of 4.3:
If I add again another interrupt (e.g. another timer), it gets again the 4.3 interrupt priority by default. So I end up having an overlap of interrupt priorities. That might run fine, until by chance the two interrupts happen at the same time: boom as the ‘undefined behaviour’! Using the debugger and stepping just causes some extra interrupt delays, thus increasing the chance of having two pending interrupts with the same priority.
So what now? Because Processor Expert is not catching this interrupt overlapping, I can check the interrupts with their priorities in Generated_Code\vectors.c:
With quickly browsing through the vector table hopefully I can spot the problem. So finally assigning really different interrupt priorities and not overlapping them removed the problem.
- Even if I am using Processor Expert: RTFM!
- Whenever I add an interrupt source with Processor Expert and ColdFire V2, I change the priority to something else than the default.
- Always use the vector table in vectors.c to check if there are any overlapping interrupt priorities.
- If weird things happens with interrupts on a microcontroller: it could be that the interrupt arbitration is not working as you expect it to work. See above.
- Don’t make assumptions how things are working. Things might be different between different controllers even from the same vendor.
- Report always possible bugs to the silicon/tool/software vendor/producer. If it is not a bug, I might learn something new.
Maybe I need to add something: I usually report a bug back to the vendor, but sometimes I’m lacy. Maybe this is just some minor thing, or something I can live with it. Surprisingly, these kind of things tend to bite back on me. Too many times I was thinking “oh, that’s just a minor thing”, or “well, that’s just a one time thing”, and then it was a serious big problem behind it. But hey: reporting things and providing a reproducible case takes time, and there is hope that somebody else might report it? There is hope. Maybe.
What is your experience?
There is excellent news on this problem: Processor Expert in MCU10.3 is able to catch this problem and gives an error.