Be aware: Floating Point Operations on ARM Cortex-M4F

My mantra is *not* to use any floating point data types in embedded applications, or at least to avoid them whenever possible: for most applications they are not necessary and can be replaced by fixed point operations. Not only floating point operations have numerical problems, they can lead to performance problems as in the following (simplified) example:

#define NOF  64
static uint32_t samples[NOF];
static float Fsamples[NOF];
float fZeroCurrent = 8.0;

static void ProcessSamples(void) {
  int i;

  for (i=0; i < NOF; i++) {
    Fsamples[i] = samples[i]*3.3/4096.0 - fZeroCurrent;

ARM designed the Cortex-M4 architecture in a way it is possible to have a FPU added. For example the NXP ARM Cortex-M4 on the FRDM-K64F board has a FPU present.



The question is: how long will that function need to perform the operations?

Looking at the loop, it does

Fsamples[i] = samples[i]*3.3/4096.0 - fZeroCurrent;

which is to load a 32bit value, then perform a floating point multiplication, followed by a floating point division and floating point subtraction, then store the result back in the result array.

The NXP MCUXpresso IDE has a cool feature showing the number of CPU cycles spent (see Measuring ARM Cortex-M CPU Cycles Spent with the MCUXpresso Eclipse Registers View). So running that function (without any special optimization settings in the compiler takes:

Cycle Delta

Cycle Delta

0x4b9d or 19’357 CPU cycles for the whole loop. Measuring only one iteration of the loop takes 0x12f or 303 cycles. One might wonder why it takes such a long time, as we do have a FPU?

The answer is in the assembly code:

This actually shows that it does not use the FPU, but instead uses software floating point operations from the standard library?

The answer is the way the operation is written in C:

Fsamples[i] = samples[i]*3.3/4096.0 - fZeroCurrent;

We have here a uint32_t multplied with a floating point number:


The thing is that a constant as ‘3.3’ in C is of type *double*. As such, the operation will first convert the uint32_t to a double, and then perform the multiplication as double operation.
Same for the division and subtraction: it will be performed as double operation:


Same for the subtraction with the float variable: because the left operation result is double, it has to be performed as double operation.

samples[i]*3.3/4096.0 - fZeroCurrent

Finally the result is converted from a double to a float to store it in the array:

Fsamples[i] = samples[i]*3.3/4096.0 - fZeroCurrent;

Now the library routines called should be clear in above assembly code:

  • __aeabi_ui2d: convert unsigned int to double
  • __aeabi_dmul: double multiplication
  • __aeabi_ddiv: double division
  • __aeabi_f2d: float to double conversion
  • __aeabi_dsub: double subtraction
  • __aeabi_d2f: double to float conversion

But why is this done in software and not in hardware, as we have a FPU?

The answer is that the ARM Cortex-M4F has only a *single precision* (float) FPU, and not a double precision (double) FPU. As such it only can do float operations in hardware but not for double type.

The solution in this case is to use float (and not double) constants. In C the ‘f’ suffix can be used to mark constants as float:

Fsamples[i] = samples[i]*3.3f/4096.0f - fZeroCurrent;

With this, the code changes to this:

Using Single Precision FPU Instructions

Using Single Precision FPU Instructions

So now it is using single precision instructions of the FPU :-). Which only takes 0x30 (48) cycles for a single iteration or 0xc5a (3162) for the whole thing: 6 times faster :-).

The example can be even further optimized with:

Fsamples[i] = samples[i]*(3.3f/4096.0f) - fZeroCurrent;

Other Considerations

Using float or double is not bad per se: it all depends on how it is used and if they are really necessary. Using fixed-point arithmetic is not without issues, and standard sin/cos functions use double, so you don’t want to re-invent the wheel.


One way to use a float type say for a temperature value:

float temperature; /* e.g. -37.512 */

Instead, it might be a better idea to use a ‘centi-temperature’ or ‘milli’ integer variable type:

int32_t centiTemperature; /* -37512 corresponds to -37.512 */

That way, normal integer operations can be used.

Gcc Single precision Constants

The GNU gcc compiler offers to treat double constants as 3.0 as single precision constants (3.0f) using the following option:

-fsingle-precision-constant causes floating-point constants to be loaded in single precision even when this is not exact. This avoids promoting operations on single precision variables to double precision like in x + 1.0/3.0. Note that this also uses single precision constants in operations on double precision variables. This can improve performance due to less memory traffic.



The other consideration is: if using the FPU, it means potentially stacking more registers. This is a possible performance problem for an RTOS like FreeRTOS (see The ARM Cortex-M4 supports a ‘lacy stacking’ (see So if the FPU is used, it means more stacked registers. If no FPU is used, then it is better to selecte the M4 port in FreeRTOS:

M4 and M4F in FreeRTOS

M4 and M4F in FreeRTOS


I recommend not to use any float and double data types if not necessary. And if you have a FPU, pay attention if it is only a single precision FPU or if the hardware supports both single and double precision FPU. If having a single precision FPU only, using the ‘f’ suffix for constants and casting things to (float) can make a big difference. But keep in mind that float and double have different precision, so this might not solve every problem.

Happy Floating 🙂

PS: if in need for a double precision FPU: have a look at the ARM Cortex-M7 (e.g. First steps: ARM Cortex-M7 and FreeRTOS on NXP TWR-KV58F220M or First Steps with the NXP i.MX RT1064-EVK Board)


38 thoughts on “Be aware: Floating Point Operations on ARM Cortex-M4F

  1. I reported this when I reviewed the M4 way back on the old NXP discussion board. I was a “Forum Expert” at the time (paid consultant to NXP). May need the Wayback Machine to find that post. There was also a surprise interview with me about that. It was a surprise because the interviewer told me he was doing it later and then recorded us just chatting (with me not prepared yet and not knowing it was being recorded).

    My current fun and games has been with the LPC844. It would be a nice part if the documentation and header files matched the actual chip. I also found what appears to be hardware bugs in both the SCT and USART. Such is the life of an embedded programmer on a tight schedule.


  2. I don’t use float, like you I prefer integer / fixed point solutions. But for those that need float, this article might be enhanced with some discussion of compiler directives … a quick look through gcc docs suggested:


    • There are flags for gcc to treat all fp constants as single precision, and single precision floating point operations take the same amount of time (single cycle) or are faster than fixed point operations (for instance when you are trying to divide 32bit fixed point values, you have to use a 64bit variable and do some black magic to get the result right and fast), you have to handle arithmetic saturation and not mentioning trigonometric functions and square roots… Floating point is just a tool (like a hammer) you have to know how to use it and when but not tell people to avoid hammers because they are evil…


      • the -fsingle-precision-constant flag indeed is very useful. And yes, in some cases using float and double is completely fine if knowing what it takes to execute things.
        I agree that a full fixed point library is not an option in all cases. But using double for a sensor value which might have up to 3 digits after the dot is an overkill, as an integer data type might be used instead.


  3. the solution example looks like this in both Firefox and Brave browsers.

    Fsamples[i] = samples[i]*3.3f/4096.0f – fZeroCurrent;


  4. I can’t agree with you recommending not to use floating point arithmetics at all. Sure, there are many cases where floats should be optimized out for real-time computing. But most code (regarding LOC-count) has non-critical time constraints, even for embedded sw. This means that unconditionally eliminating floats where possible is a typical case of premature optimization.

    Still a very useful example of float-limitations!


    • There are cases where indeed float and double have to be used, e.g. because the math libraries are using them. I recommend to avoid using them if possible if there are better alternatives. I see using float and double data types for things like sensor values where using an integer data type would be a better option imho.


  5. I like the suggestion by Ian C to use the -Wunsuffixed-float-constants switch.

    However, is there a switch for the compiler that will force floating-point operations to be done by default in single precision (vs double)?


  6. The other issue here is that if HW floats are enabled, even for a single usage, then ALL push/pop stack operations will take considerably longer across the entire runtime as the register stack to be saved is considerably larger.


    • Yes, thanks for that reminder, I missed that point. The M4 can do some ‘lacy stacking’ which helps in some cases. But still, it means more work for the CPU to do a context switch because of the extra registers to save.
      I have now extended the article with an extra section about this, thank you!


      • Courtesy of my upbringing back in the dark ages when clock speeds were stated in single digit MHz and forth was the answer to the question 😉
        Thank you for all of the excellent work.


        • My Forth knowledge and usage are very, very minimal. I never was a fan of that language, but I know it is still in use in many older applications.


    • Yes, in that case the compiler can do ‘constant folding’, and this is usually how things should be written. Due the nature of floating point numbers, the result might not be the same due rounding issues, but that’s yet another danger zone using floating point numbers.

      Liked by 1 person

  7. Eric

    If floats are needed it is also easy to check the code in visual studio and quickly look at it in disassembled form.
    Also watch out for any warnings which give hints that (unexpected and potentially time consuming conversions are taking place)

    Fsamples[i] = samples[i] * 3.3 / 4096.0 – fZeroCurrent;

    This compiler warning that is generated should already ring bells:
    warning C4244: ‘=’: conversion from ‘double’ to ‘float’, possible loss of data

    The disassembled code on a PC already looks over-complicated
    00482D4D mov eax,dword ptr [i]
    00482D50 mov ecx,dword ptr samples (05C85F0h)[eax*4]
    00482D57 mov dword ptr [ebp-100h],ecx
    00482D5D cvtsi2sd xmm0,dword ptr [ebp-100h]
    00482D65 mov edx,dword ptr [ebp-100h]
    00482D6B shr edx,1Fh
    00482D6E addsd xmm0,mmword ptr __xmm@41f00000000000000000000000000000 (053EE70h)[edx*8]
    00482D77 mulsd xmm0,mmword ptr [__real@400a666666666666 (05946E0h)]
    00482D7F divsd xmm0,mmword ptr [__real@40b0000000000000 (05946F0h)]
    00482D87 cvtss2sd xmm1,dword ptr [fZeroCurrent]
    00482D8C subsd xmm0,xmm1
    00482D90 cvtsd2ss xmm0,xmm0
    00482D94 mov eax,dword ptr [i]
    00482D97 movss dword ptr Fsamples (05C86F0h)[eax*4],xmm0

    Consistently casting (the point of your article) helps control operations:

    Fsamples[i] = (float)((float)samples[i] * (float)3.3 / (float)4096.0 – fZeroCurrent);

    and this is immediately reflected in the PC’s subsequent disassembled code.

    00482D4D mov eax,dword ptr [i]
    00482D50 mov ecx,dword ptr samples (05C85F0h)[eax*4]
    00482D57 mov dword ptr [ebp-100h],ecx
    00482D5D cvtsi2sd xmm0,dword ptr [ebp-100h]
    00482D65 mov edx,dword ptr [ebp-100h]

    Although the VS compiler already did this optimisation (even at lowest setting), consider also

    static float fConversion = (float)((double)3.3 / (double)4096.0); // get the compiler to do the calculation work [at best precision] rather than doing it at run time
    Fsamples[i] = (float)((float)((float)samples[i] * fConversion) – fZeroCurrent);

    since it may remove the (potentially high overhead) vdiv in the loop, and/or ensure optimisation is high to avoid that such calculations that need to be done just once are not “repeated” in the loop.

    If highest speed is needed and you have a bit of free memory (in this particular example case) calculate the 4096 possible levels on initialisation to a look up table and then

    Fsamples[i] = flook_up[samples[i]];

    will fly at run time (even on a baby processor)

    And yes I agree to avoiding float/double in embedded systems whenever possible – but when absolutely needed then THINK carefully about how the calculation will be performed (based on the known C rules) and keep it well under your control to avoid inefficiencies.




    • Hi Mark,
      good hint about using the Visual Studio for this. And using a table instead of calculating data is always a good option for me, especially if the table can be stored in read-only memory and can be rather small.


  8. When I write code that’s _supposed_ to be portable (*cough*), I’ll create a typedef, as in:

    typedef mylib_float_t float

    … and then carefully use mylib_float_t throughout. That way, if I ever switch to an architecture that uses native doubles, I only need to change on line of code.


    • The HCS08 (and HCS12) compiler had an option to specify if float/doubles are allowed in the code, and options to set float and/or double to either 32bit or 64bit.
      I like your approach to use a typedef for floating point data types. Similar as using int32_t instead of plain ‘int’.


  9. Well, at least its not like some Microchip compilers (for PIC24 etc.), where doubles are by default treated as float (unless you specify -fno-short-double). I’ve seen programmers horribly surprised by that behavior (why are my results so bad ???)…


  10. ARM Cortex offerings made double-precision FPU an option (maybe not M0, don’t remember), which IIRC none of the early adopters took (too much power & space). Some more recent chips have double precision, for example ST32F M7 and H7 series. So for those of us the sometimes really need double a few options exist. IIRC PIC32 series had double FPU from the beginning (MIPS cores).


    • To my knowledge: M3 did not had FPU. M0: no FPU at all for cost/die size reasons. M4 had the option to have single precision FPU added by vendor, so it is an option.
      M7 has single and double precision FPU by default (I believe it is not possible to remove it by the silicon vendor).


  11. I have not had luck with -fsingle-precision-constant, it does not seem to do what I would expect. I am using a Kinetis K64. When I compile with that option and then set a breakpoint on a line of code that looks like this:

    (float)(val)/10.0 + 233.0

    If I hover over the either one of the constants, they show as “double” in the debugger.

    On the other hand, If I use the “f” suffix, it works as expected and those constants show as “float”.

    I know the compilier flag is doing something because I run into some issues in another unrelated part of my code, but it doesn’t seem to do what it’s supposed to do, or I’m not using the right method to confirm.


    • If you hover over the constant in the source code, then the debugger will take that *text* as an expression.
      And 10.0 or 233.0 is taken as a double value (for the debugger expression).
      This has nothing to do what the compiler sees as type.
      It is the same as you would type ‘10.0’ in the debugger expression window, and the debugger expression parser will take this as double.
      And it will take 10.0f as float.
      I hope I’m able to express myself, but the thing is that the debugger expression parser is not the same as what the comiler does.
      I hope this makes sense?


      • Yes that makes sense. I suppose I need to look at the assembly to confirm and not rely on the debugger parser. Thanks!


  12. Why didn’t the compiler optimize out the 3.3/4096.0 and simply use the result? That removes division entirely from the equation and compilers have been optimizing like this for a long time.


  13. There is another useful gcc warning option which can help in this situation: -Wdouble-promotion. It occurs when float to double (implicit) promotion happens. It is not included with -Wall so needs
    use this warning option explicitly. One more insteresting rule from C99 standard: if a function was called without prototype the float type parameters are implicitly promoted to double.

    Liked by 1 person

    • Thanks for that note! Yes, usually I do turn on this option too as it is able to flag such promotions.
      As for the float-to-double promotion in C99: I believe that rule already existed in C89: if there is no prototype (a programmer error anyway), it uses the actual parameters types for the type assumption and that the function is returning an ‘int’. So if you pass 3.5 as parameter, that would be a double, but if you would use 3.5f, that would be a float type.


What do you think?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.