Choosing GNU Compiler Optimizations

Tool chains like the GNU compiler collection (gcc) have a plethora of options. The probably most important ones are the ones which tell the compiler how to optimize the code. Running out of code space, or the application is not performing well? Then have a look at the compiler optimization levels!

However, which one to select can be a difficult choice. And the result might very well depend on the application and coding style too. So I’ll give you some hints and guidance with an autonomous robot application we use at the Lucerne University for research and education.

INTRO Sumo Robot
INTRO Sumo Robot

Optimization Levels

The GNU Compiler Collection (or GCC) provides a set of different options to optimize the executable. In general they are numbered in levels, from 0 (none) to 3 (highest level):

  • -O0: At this level the compiler compiles the source code in a straightforward way: Each statement basically ends up in a sequence of instructions with loads and store, with no instruction re-arrangement. This level is the default level too.
  • -O1: This level includes some standard and common optimizations which should not have any trade-offs between code size and execution time. So you can expect that code size gets reduced and execution time improved.
  • -O2: this one includes the -O1 optimizations and adds even more, including instruction scheduling and re-arrangement. It only uses optimizations with no size or speed tradeoffs, so you can expect a reduction in code size with an improvement of speed. Code might be harder to debug, and build time might be longer, and the compiler itself might use more memory. Many ‘released’ binaries including the GNU packages are delivered with this option.
  • O3: With this option, the compiler uses more aggressive optimizations, including function inlining. This option should further improve the execution speed, but with the penalty of larger code size. But it could be that depending on the application code the execution speed might be slower than with -O2.
  • -Os: this option selects optimizations which target a low code size, at the cost of execution time. If caches are involved, then this might speed up execution time too if code can run longer inside the cache.
  • Og: Because many optimizations might impact the information available for debugging (source and line information, variables, …), this option uses optimizations for good code size and execution performance, but not impacting debugging.

In the next sections, I’ll show you the impact with a robot application implemented mostly in C.

Sumo Robot Application

To show the impact of the different optimization levels, I’m using here an Eclipse (NXP MCUXpresso IDE 11.5.1) based project. The project is for an autonomous Sumo robot.

Sumo Robot

The application is fairly complex with using USB stack, WiFi and lots of different sensors, and it runs FreeRTOS.

The compiler version I’m using here is as below:

gcc version 10.3.1 20210621 (release) (GNU Arm Embedded Toolchain 10.3-2021.07)

No Optimizations (-O0)

In a first step, I’m compiling it with the -O0 optimization level:

This tells the compiler ‘not to use any optimization’. For that Sumo robot project, it gives the following code and data size, with a build time of 15s:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:      220968 B       512 KB     42.15%
      SRAM_UPPER:       44848 B        64 KB     68.43%
      SRAM_LOWER:         64 KB        64 KB    100.00%
        FLEX_RAM:          0 GB         4 KB      0.00%

   text	   data	    bss	    dec	    hex	filename
 219628	   1340	 108848	 329816	  50858	ADIS_MK22FX512xxx12_SumoV2.axf

The compiler produces ‘straight’ code which is easy to follow and debug. If you are not familiar with that output and information: have a read at text, data and bss: Code and Data Size Explained.

Let’s check a simple example:

int multiply(int a, int b) {
  return a*b;
}

The code it produces with -O0 is this:


00000000 <multiply>:
   0:	b480      	push	{r7}
   2:	b083      	sub	sp, #12
   4:	af00      	add	r7, sp, #0
   6:	6078      	str	r0, [r7, #4]
   8:	6039      	str	r1, [r7, #0]
   a:	687b      	ldr	r3, [r7, #4]
   c:	683a      	ldr	r2, [r7, #0]
   e:	fb02 f303 	mul.w	r3, r2, r3
  12:	4618      	mov	r0, r3
  14:	370c      	adds	r7, #12
  16:	46bd      	mov	sp, r7
  18:	f85d 7b04 	ldr.w	r7, [sp], #4
  1c:	4770      	bx	lr

For a ‘simple’ multiplication, it does a lot of things:

The assembly code shows the ‘entry’ code: setting up the stack frame (allocating space for temporary and local variables, saving parameters on the stack), followed by the multiplication code (loading the variables into registers, doing the multiplication and store it in a register), followed by the ‘exit’ code which unwinds the stack space and frame and returns to the caller.

As you might notice, this is a lot of code. But easy to follow and because everything is stored in memory, easy to debug too.

Debugging with -O0

The other advantage of using -O0 is that compilation time should be shorter compared to optimized code generation. But I would say this is usually not a concern, especially for embedded application development.

Performance

With compiler optimizations, there is usually a trade-off between code size and speed. In general, smaller code can reduce in better performance (execution speed), because less code needs to be executed. But that’s not always the case: for example loop unrolling will increase the code size *and* increase the performance, as less loop overhead. To measure the performance with the different compiler optimizations, I’m using FreeRTOS with the SEGGER SystemView. The more time or percentage the application can be in IDLE mode, the better the performance.

For my tests, I’m running the application for more than 1 minute and then use the Idle percentage time as performance indicator: the higher the idle time, the better the performance.

Performance view in SEGGER SystemView

A similar information can be found in the Eclipse FreeRTOS task view:

Task list in Eclipse

Because the idle time is just one indicator, another view is to measure the execution of a code sequence. This again can be easily done with measurement markers in the SEGGER SystemView. Below it measures the time needed to do do an update of the display:

Linker Dead Code and Data Elimination

So what happens if I do not call that multiply() function (or: do not use it)? The linker will simply remove it, because it is not used. For this, the code and data needs to be placed in to separate sections with the commands below:

-fdata-sections -ffunction-sections -Wl,--gc-sections

This is called ‘dead code elimination’ or ‘dead stripping’. This applies as well for any other ‘objects’ including variables and constants. This makes a lot of sense and is done by default by the GNU linker.

The removed objects are listed in the linker .map file, they are listed in the ‘Discarded input sections’:

If (for whatever reason) you want to keep unreferenced objects, you can remove the -fdate-section and/or -ffunction-sections options, or use the KEEP in the linker script file. See as well Placing Code in Sections with managed GNU Linker Scripts.

-O1 Optimization

The next level is to use the -O1 compiler optimization.

-O1 compiler optimization

This optimization already does a lot of different things, see https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for the list of optimizations.

-fauto-inc-dec 
-fbranch-count-reg 
-fcombine-stack-adjustments 
-fcompare-elim 
-fcprop-registers 
-fdce 
-fdefer-pop 
-fdelayed-branch 
-fdse 
-fforward-propagate 
-fguess-branch-probability 
-fif-conversion 
-fif-conversion2 
-finline-functions-called-once 
-fipa-modref 
-fipa-profile 
-fipa-pure-const 
-fipa-reference 
-fipa-reference-addressable 
-fmerge-constants 
-fmove-loop-invariants 
-fmove-loop-stores
-fomit-frame-pointer 
-freorder-blocks 
-fshrink-wrap 
-fshrink-wrap-separate 
-fsplit-wide-types 
-fssa-backprop 
-fssa-phiopt 
-ftree-bit-ccp 
-ftree-ccp 
-ftree-ch 
-ftree-coalesce-vars 
-ftree-copy-prop 
-ftree-dce 
-ftree-dominator-opts 
-ftree-dse 
-ftree-forwprop 
-ftree-fre 
-ftree-phiprop 
-ftree-pta 
-ftree-scev-cprop 
-ftree-sink 
-ftree-slsr 
-ftree-sra 
-ftree-ter 
-funit-at-a-time

It is possible to disable some optimizations too, see the -fno-… options listed on https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

With this -O1 option, the code size is now:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:      145128 B       512 KB     27.68%
      SRAM_UPPER:       44332 B        64 KB     67.65%
      SRAM_LOWER:         64 KB        64 KB    100.00%
        FLEX_RAM:          0 GB         4 KB      0.00%

   text	   data	    bss	    dec	    hex	filename
 144468	    660	 108844	 253972	  3e014	ADIS_MK22FX512xxx12_SumoV2.axf

Flash usage dropped from 42.15% down to 27.68%, with a build time of 17s.

The impact can be very well seen with our ‘multiply()’ example, which is now reduced to this:

00000000 <multiply>:
   0:	fb01 f000 	mul.w	r0, r1, r0
   4:	4770      	bx	lr

Wow! All the entry/exit/load/store code from above is gone :-).

But debugging might not be as easy. The compiler tries to provide as much information as possible, but because the parameters are in registers, it might be harder to follow the program flow. For example there might be additional symbols shown as local variables:

The compiler already does some inlining. So if the call to multiply() is in the same module, it will inline that code:

Inlined Code

Which means in that case setting a breakpoint in multiply() will not stop the debugger, as the code is directly executed on the caller side.

Build time might be reduced too, because with the optimizations less data probably needs to be processed by the compiler internally.

Optimize more with -O2

The optimization -O2 turns on all -O1 optimizations plus others which increase the compilation time but as well the performance of the generated code. -O2 should does not much affect the size-speed tradeoff: so it should increase the performance of the code without adding too much (if any) code penalty.

If I compile the Sumo robot application, I get this:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:      148160 B       512 KB     28.26%
      SRAM_UPPER:       44332 B        64 KB     67.65%
      SRAM_LOWER:         64 KB        64 KB    100.00%
        FLEX_RAM:          0 GB         4 KB      0.00%

   text	   data	    bss	    dec	    hex	filename
 147500	    660	 108844	 257004	  3ebec	ADIS_MK22FX512xxx12_SumoV2.axf

So code size here even increased from 27.68% to 28.26%. The reason is because the compiler is doing more inlining. While this improves the code speed, it adds some extra code size. Build time is still at 17s.

Optimize Most (-O3)

The -O3 optimization level includes -O2 and adds even more optimization, but at the expense of code size. So this is mainly an optimization for speed.

For the Sumo application it needs 19s for the build and gives:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:      166108 B       512 KB     31.68%
      SRAM_UPPER:       44336 B        64 KB     67.65%
      SRAM_LOWER:         64 KB        64 KB    100.00%
        FLEX_RAM:          0 GB         4 KB      0.00%
Finished building target: ADIS_MK22FX512xxx12_SumoV2.axf
 
   text	   data	    bss	    dec	    hex	filename
 165448	    660	 108848	 274956	  4320c	ADIS_MK22FX512xxx12_SumoV2.axf

As expected, code size increased again from 28.26% to 31.68%.

Optimize for Size (-Os)

If code size is the biggest concern, then -Os should be used. It needs 17s for building and shows the following size information:

  PROGRAM_FLASH:      131700 B       512 KB     25.12%
      SRAM_UPPER:       44328 B        64 KB     67.64%
      SRAM_LOWER:         64 KB        64 KB    100.00%
        FLEX_RAM:          0 GB         4 KB      0.00%
Finished building target: ADIS_MK22FX512xxx12_SumoV2.axf
 
   text	   data	    bss	    dec	    hex	filename
 131040	    660	 108840	 240540	  3ab9c	ADIS_MK22FX512xxx12_SumoV2.axf

Compared to -O1, this is a reduction from 27.68% to 25.12%.

Optimize for Debug (-Og)

There is even an option in the gcc to ‘optimize for debug’: this one should use optimization not affecting the debug experience.

Below are the result, with a build time of 16s:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:      147244 B       512 KB     28.08%
      SRAM_UPPER:       44336 B        64 KB     67.65%
      SRAM_LOWER:         64 KB        64 KB    100.00%
        FLEX_RAM:          0 GB         4 KB      0.00%

   text	   data	    bss	    dec	    hex	filename
 146584	    660	 108848	 256092	  3e85c	ADIS_MK22FX512xxx12_SumoV2.axf

Risks

Are there risks or problems expected with optimizations? As anything, the compiler can have bugs too. So in any case I recommend to use the same optimization level during development as the one you intend the application to be released. It does not make sense IMHO to use -O0 during development and then release it with -O3, even with some testing: along the line of “eating your own dog food”. In my experience, the GNU compiler is very mature now, but there are higher risks in higher optimization levels.

If an application does not work correctly with increased optimization levels, then this is in many cases a sign of bad programming: for example not initialized variables or relying on hard-coded timing. Optimizations then just expose these bugs in the application code.

Another problem could be that tools or debuggers expect certain symbols be present in the application, for example for FreeRTOS debugging: if the compiler (or linker) removes the symbols by in-lining, this can cause problems. But this can be handled by adding gcc __attribute__ markers as used in the McuLib library which includes a FreeRTOS usable even in the highest optimization levels.

Summary

Using compiler optimizations can have a have a big impact on code size and speed. Of course this depends on the application and coding style used. Below is the summary with the Sumo application used in this article.

OptionIdle TimeUpdateFlash UsedBuild Time
-O095.3%259.9 ms42.15%15s
-O197.1%146.7 ms27.68%17s
-O297.3%134.4 ms28.26%17s
-O397.4%132.2 ms31.68%19s
-Os97.1%139.4 ms25.12%17s
-Og96.9%145.6 ms28.08%16s
Impact of different optimization levels
  • The Option lists the compiler optimization level.
  • Idle Time is the percentage the FreeRTOS application is idle (the application is just refreshing the OLED display): the higher the number, the better.
  • The Update column tells how much time the update needs, the lower the better
  • Flash Used tells about the code size and how much of the total memory is used: the lower the better.
  • Build Time is the time needed to make the build: the lower the better.

Personally, I end up with -O1 for most of my projects: it greatly improves code size and code speed, and while debugging experience is affected, it is still manageable for me.

I hope you find that information useful.

Happy optimizing 🙂

Links

15 thoughts on “Choosing GNU Compiler Optimizations

  1. “use the same optimization level during development as the one you intend the application to be released” – yes absolutely! It horrifies me that people might do other than this, and pretend to themselves that they have tested the code!

    Liked by 1 person

  2. My own early optimization testing while writing BOOT code gave 8500bytes and 900us execution at O0, 5700bytes and 525us at 01 … My notes about Osize said “resulting code pretty hard to follow [and debug]”; an early “skeleton” project went from 27K code at O0 doesn’t to 16.5K at O1 and 14K at Osize … for the debug challenges and the increased concerns whether the code is doing what I coded, I do not use Osize, I stay with O1 throughout development and production.

    Liked by 1 person

  3. One other comment about optimization that caught me out once; I had “array bounds checking” selected in the compiler options but discovered sadly that it doesn’t work with O0 and O1 optimization levels, only with O2 and higher! That has sadly forced me to use O2 for production, but it definitely makes debug a bit harder!

    Liked by 1 person

  4. Some time ago I had problems when I tried to use the optimization, there were programs that stopped working because some variables lost their value. In the end I declared all those variables as VOLATILE and then the optimization worked for me.

    I never knew the reason, I commented on it in some forum and they told me that it was bad programming, but without specifically specifying where the problem was and how to solve it.

    Liked by 1 person

  5. I am using the following trick
    using preprocessor directives for functions in a program
    For example
    void __attribute__((optimize(“Os”))) FcnSpeed(){……}
    For other functions, optimization by default.

    Liked by 1 person

  6. Erich, I know you are aware of this. We should say a bit more of -Os ‘small’.

    That ‘small’ can come at a price. -Os will become very aggressive at
    eliminating any code that does not have any side-effects in the C
    Abstraction Machine. Since the code has no side-effects GCC thinks
    the code is unneeded and removes it.

    Typical things that GCC will kill off are busy loops and Interrupt Functions.
    For example this empty ‘delay loop’ will be removed, because the
    compiler has no sense of temporal time and this code has no
    side-effects:

    for( uint16_t loop_u16 = 0U; loop_u16 < 1000U; ++loop_u16 )
    {
    }

    Declaring the loop variable as 'volatile' will prevent the loops removal.

    Interrupt functions should be marked with the 'USED" attribute.
    Especially if Link Time Optimization is enabled.

    Also how optimization works can depend to some degree on the back end.
    In my experience the AVR port of GCC -Os is far more aggressive than
    what I've seen from the ARM back end for -Os.

    Many projects and Makefiles come with a default setting of -Os. So
    let's head off 'Optimization broke my code with -Os'. The compiler
    probably did what you asked it to do, it made the code smaller.
    Probably in ways a new Embedded System programmer was not expecting…

    Liked by 2 people

    • Hi Bob,
      Many thanks for your input and thoughts! Yes, such delay loops are exactly what can be considered as ‘careless programming’, and the compiler is correct to eliminate them. Interrupts should not be removed if they belong to a special ‘isr’ or similar section which is ‘kept’ in the linker file or marked as ‘used’.

      Like

What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.