Tool chains like the GNU compiler collection (gcc) have a plethora of options. The probably most important ones are the ones which tell the compiler how to optimize the code. Running out of code space, or the application is not performing well? Then have a look at the compiler optimization levels!
However, which one to select can be a difficult choice. And the result might very well depend on the application and coding style too. So I’ll give you some hints and guidance with an autonomous robot application we use at the Lucerne University for research and education.

Optimization Levels
The GNU Compiler Collection (or GCC) provides a set of different options to optimize the executable. In general they are numbered in levels, from 0 (none) to 3 (highest level):
- -O0: At this level the compiler compiles the source code in a straightforward way: Each statement basically ends up in a sequence of instructions with loads and store, with no instruction re-arrangement. This level is the default level too.
- -O1: This level includes some standard and common optimizations which should not have any trade-offs between code size and execution time. So you can expect that code size gets reduced and execution time improved.
- -O2: this one includes the -O1 optimizations and adds even more, including instruction scheduling and re-arrangement. It only uses optimizations with no size or speed tradeoffs, so you can expect a reduction in code size with an improvement of speed. Code might be harder to debug, and build time might be longer, and the compiler itself might use more memory. Many ‘released’ binaries including the GNU packages are delivered with this option.
- –O3: With this option, the compiler uses more aggressive optimizations, including function inlining. This option should further improve the execution speed, but with the penalty of larger code size. But it could be that depending on the application code the execution speed might be slower than with -O2.
- -Os: this option selects optimizations which target a low code size, at the cost of execution time. If caches are involved, then this might speed up execution time too if code can run longer inside the cache.
- –Og: Because many optimizations might impact the information available for debugging (source and line information, variables, …), this option uses optimizations for good code size and execution performance, but not impacting debugging.
In the next sections, I’ll show you the impact with a robot application implemented mostly in C.
Sumo Robot Application
To show the impact of the different optimization levels, I’m using here an Eclipse (NXP MCUXpresso IDE 11.5.1) based project. The project is for an autonomous Sumo robot.

The application is fairly complex with using USB stack, WiFi and lots of different sensors, and it runs FreeRTOS.

The compiler version I’m using here is as below:
gcc version 10.3.1 20210621 (release) (GNU Arm Embedded Toolchain 10.3-2021.07)
No Optimizations (-O0)
In a first step, I’m compiling it with the -O0 optimization level:

This tells the compiler ‘not to use any optimization’. For that Sumo robot project, it gives the following code and data size, with a build time of 15s:
Memory region Used Size Region Size %age Used
PROGRAM_FLASH: 220968 B 512 KB 42.15%
SRAM_UPPER: 44848 B 64 KB 68.43%
SRAM_LOWER: 64 KB 64 KB 100.00%
FLEX_RAM: 0 GB 4 KB 0.00%
text data bss dec hex filename
219628 1340 108848 329816 50858 ADIS_MK22FX512xxx12_SumoV2.axf
The compiler produces ‘straight’ code which is easy to follow and debug. If you are not familiar with that output and information: have a read at text, data and bss: Code and Data Size Explained.
Let’s check a simple example:
int multiply(int a, int b) {
return a*b;
}
The code it produces with -O0 is this:
00000000 <multiply>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
6: 6078 str r0, [r7, #4]
8: 6039 str r1, [r7, #0]
a: 687b ldr r3, [r7, #4]
c: 683a ldr r2, [r7, #0]
e: fb02 f303 mul.w r3, r2, r3
12: 4618 mov r0, r3
14: 370c adds r7, #12
16: 46bd mov sp, r7
18: f85d 7b04 ldr.w r7, [sp], #4
1c: 4770 bx lr
For a ‘simple’ multiplication, it does a lot of things:

The assembly code shows the ‘entry’ code: setting up the stack frame (allocating space for temporary and local variables, saving parameters on the stack), followed by the multiplication code (loading the variables into registers, doing the multiplication and store it in a register), followed by the ‘exit’ code which unwinds the stack space and frame and returns to the caller.
As you might notice, this is a lot of code. But easy to follow and because everything is stored in memory, easy to debug too.

The other advantage of using -O0 is that compilation time should be shorter compared to optimized code generation. But I would say this is usually not a concern, especially for embedded application development.
Performance
With compiler optimizations, there is usually a trade-off between code size and speed. In general, smaller code can reduce in better performance (execution speed), because less code needs to be executed. But that’s not always the case: for example loop unrolling will increase the code size *and* increase the performance, as less loop overhead. To measure the performance with the different compiler optimizations, I’m using FreeRTOS with the SEGGER SystemView. The more time or percentage the application can be in IDLE mode, the better the performance.
For my tests, I’m running the application for more than 1 minute and then use the Idle percentage time as performance indicator: the higher the idle time, the better the performance.

A similar information can be found in the Eclipse FreeRTOS task view:

Because the idle time is just one indicator, another view is to measure the execution of a code sequence. This again can be easily done with measurement markers in the SEGGER SystemView. Below it measures the time needed to do do an update of the display:

Linker Dead Code and Data Elimination
So what happens if I do not call that multiply() function (or: do not use it)? The linker will simply remove it, because it is not used. For this, the code and data needs to be placed in to separate sections with the commands below:
-fdata-sections -ffunction-sections -Wl,--gc-sections
This is called ‘dead code elimination’ or ‘dead stripping’. This applies as well for any other ‘objects’ including variables and constants. This makes a lot of sense and is done by default by the GNU linker.
The removed objects are listed in the linker .map file, they are listed in the ‘Discarded input sections’:

If (for whatever reason) you want to keep unreferenced objects, you can remove the -fdate-section and/or -ffunction-sections options, or use the KEEP in the linker script file. See as well Placing Code in Sections with managed GNU Linker Scripts.
-O1 Optimization
The next level is to use the -O1 compiler optimization.

This optimization already does a lot of different things, see https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for the list of optimizations.
-fauto-inc-dec
-fbranch-count-reg
-fcombine-stack-adjustments
-fcompare-elim
-fcprop-registers
-fdce
-fdefer-pop
-fdelayed-branch
-fdse
-fforward-propagate
-fguess-branch-probability
-fif-conversion
-fif-conversion2
-finline-functions-called-once
-fipa-modref
-fipa-profile
-fipa-pure-const
-fipa-reference
-fipa-reference-addressable
-fmerge-constants
-fmove-loop-invariants
-fmove-loop-stores
-fomit-frame-pointer
-freorder-blocks
-fshrink-wrap
-fshrink-wrap-separate
-fsplit-wide-types
-fssa-backprop
-fssa-phiopt
-ftree-bit-ccp
-ftree-ccp
-ftree-ch
-ftree-coalesce-vars
-ftree-copy-prop
-ftree-dce
-ftree-dominator-opts
-ftree-dse
-ftree-forwprop
-ftree-fre
-ftree-phiprop
-ftree-pta
-ftree-scev-cprop
-ftree-sink
-ftree-slsr
-ftree-sra
-ftree-ter
-funit-at-a-time
It is possible to disable some optimizations too, see the -fno-… options listed on https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
With this -O1 option, the code size is now:
Memory region Used Size Region Size %age Used
PROGRAM_FLASH: 145128 B 512 KB 27.68%
SRAM_UPPER: 44332 B 64 KB 67.65%
SRAM_LOWER: 64 KB 64 KB 100.00%
FLEX_RAM: 0 GB 4 KB 0.00%
text data bss dec hex filename
144468 660 108844 253972 3e014 ADIS_MK22FX512xxx12_SumoV2.axf
Flash usage dropped from 42.15% down to 27.68%, with a build time of 17s.
The impact can be very well seen with our ‘multiply()’ example, which is now reduced to this:
00000000 <multiply>:
0: fb01 f000 mul.w r0, r1, r0
4: 4770 bx lr
Wow! All the entry/exit/load/store code from above is gone :-).
But debugging might not be as easy. The compiler tries to provide as much information as possible, but because the parameters are in registers, it might be harder to follow the program flow. For example there might be additional symbols shown as local variables:

The compiler already does some inlining. So if the call to multiply() is in the same module, it will inline that code:

Which means in that case setting a breakpoint in multiply() will not stop the debugger, as the code is directly executed on the caller side.
Build time might be reduced too, because with the optimizations less data probably needs to be processed by the compiler internally.
Optimize more with -O2
The optimization -O2 turns on all -O1 optimizations plus others which increase the compilation time but as well the performance of the generated code. -O2 should does not much affect the size-speed tradeoff: so it should increase the performance of the code without adding too much (if any) code penalty.
If I compile the Sumo robot application, I get this:
Memory region Used Size Region Size %age Used
PROGRAM_FLASH: 148160 B 512 KB 28.26%
SRAM_UPPER: 44332 B 64 KB 67.65%
SRAM_LOWER: 64 KB 64 KB 100.00%
FLEX_RAM: 0 GB 4 KB 0.00%
text data bss dec hex filename
147500 660 108844 257004 3ebec ADIS_MK22FX512xxx12_SumoV2.axf
So code size here even increased from 27.68% to 28.26%. The reason is because the compiler is doing more inlining. While this improves the code speed, it adds some extra code size. Build time is still at 17s.
Optimize Most (-O3)
The -O3 optimization level includes -O2 and adds even more optimization, but at the expense of code size. So this is mainly an optimization for speed.
For the Sumo application it needs 19s for the build and gives:
Memory region Used Size Region Size %age Used
PROGRAM_FLASH: 166108 B 512 KB 31.68%
SRAM_UPPER: 44336 B 64 KB 67.65%
SRAM_LOWER: 64 KB 64 KB 100.00%
FLEX_RAM: 0 GB 4 KB 0.00%
Finished building target: ADIS_MK22FX512xxx12_SumoV2.axf
text data bss dec hex filename
165448 660 108848 274956 4320c ADIS_MK22FX512xxx12_SumoV2.axf
As expected, code size increased again from 28.26% to 31.68%.
Optimize for Size (-Os)
If code size is the biggest concern, then -Os should be used. It needs 17s for building and shows the following size information:
PROGRAM_FLASH: 131700 B 512 KB 25.12%
SRAM_UPPER: 44328 B 64 KB 67.64%
SRAM_LOWER: 64 KB 64 KB 100.00%
FLEX_RAM: 0 GB 4 KB 0.00%
Finished building target: ADIS_MK22FX512xxx12_SumoV2.axf
text data bss dec hex filename
131040 660 108840 240540 3ab9c ADIS_MK22FX512xxx12_SumoV2.axf
Compared to -O1, this is a reduction from 27.68% to 25.12%.
Optimize for Debug (-Og)
There is even an option in the gcc to ‘optimize for debug’: this one should use optimization not affecting the debug experience.
Below are the result, with a build time of 16s:
Memory region Used Size Region Size %age Used
PROGRAM_FLASH: 147244 B 512 KB 28.08%
SRAM_UPPER: 44336 B 64 KB 67.65%
SRAM_LOWER: 64 KB 64 KB 100.00%
FLEX_RAM: 0 GB 4 KB 0.00%
text data bss dec hex filename
146584 660 108848 256092 3e85c ADIS_MK22FX512xxx12_SumoV2.axf
Risks
Are there risks or problems expected with optimizations? As anything, the compiler can have bugs too. So in any case I recommend to use the same optimization level during development as the one you intend the application to be released. It does not make sense IMHO to use -O0 during development and then release it with -O3, even with some testing: along the line of “eating your own dog food”. In my experience, the GNU compiler is very mature now, but there are higher risks in higher optimization levels.
If an application does not work correctly with increased optimization levels, then this is in many cases a sign of bad programming: for example not initialized variables or relying on hard-coded timing. Optimizations then just expose these bugs in the application code.
Another problem could be that tools or debuggers expect certain symbols be present in the application, for example for FreeRTOS debugging: if the compiler (or linker) removes the symbols by in-lining, this can cause problems. But this can be handled by adding gcc __attribute__ markers as used in the McuLib library which includes a FreeRTOS usable even in the highest optimization levels.
Summary
Using compiler optimizations can have a have a big impact on code size and speed. Of course this depends on the application and coding style used. Below is the summary with the Sumo application used in this article.
Option | Idle Time | Update | Flash Used | Build Time |
-O0 | 95.3% | 259.9 ms | 42.15% | 15s |
-O1 | 97.1% | 146.7 ms | 27.68% | 17s |
-O2 | 97.3% | 134.4 ms | 28.26% | 17s |
-O3 | 97.4% | 132.2 ms | 31.68% | 19s |
-Os | 97.1% | 139.4 ms | 25.12% | 17s |
-Og | 96.9% | 145.6 ms | 28.08% | 16s |
- The Option lists the compiler optimization level.
- Idle Time is the percentage the FreeRTOS application is idle (the application is just refreshing the OLED display): the higher the number, the better.
- The Update column tells how much time the update needs, the lower the better
- Flash Used tells about the code size and how much of the total memory is used: the lower the better.
- Build Time is the time needed to make the build: the lower the better.
Personally, I end up with -O1 for most of my projects: it greatly improves code size and code speed, and while debugging experience is affected, it is still manageable for me.
I hope you find that information useful.
Happy optimizing 🙂
Links
- Options That Control Optimization: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
- GCC -O options: https://www.rapidtables.com/code/linux/gcc/gcc-o.html
- StackOverflow: https://stackoverflow.com/questions/1778698/how-to-turn-off-specific-optimization-flags-in-gcc
“use the same optimization level during development as the one you intend the application to be released” – yes absolutely! It horrifies me that people might do other than this, and pretend to themselves that they have tested the code!
LikeLiked by 1 person
In my experience, a lot of developers are not eating their own dog food as they should.
LikeLike
My own early optimization testing while writing BOOT code gave 8500bytes and 900us execution at O0, 5700bytes and 525us at 01 … My notes about Osize said “resulting code pretty hard to follow [and debug]”; an early “skeleton” project went from 27K code at O0 doesn’t to 16.5K at O1 and 14K at Osize … for the debug challenges and the increased concerns whether the code is doing what I coded, I do not use Osize, I stay with O1 throughout development and production.
LikeLiked by 1 person
Hi Ian,
thanks for sharing your numbers. To me -O1 is good compromise for code size, speed and ability to debug too.
LikeLike
One other comment about optimization that caught me out once; I had “array bounds checking” selected in the compiler options but discovered sadly that it doesn’t work with O0 and O1 optimization levels, only with O2 and higher! That has sadly forced me to use O2 for production, but it definitely makes debug a bit harder!
LikeLiked by 1 person
Yes, indeed. Some checks depend on compiler internal data information which is only available with higher optimization. There is a discussion on this topic on https://stackoverflow.com/questions/6681154/gcc-optimization-affects-bounds-checking.
And indeed I’m using higher optimization settings to get warnings about possible bugs too, just to have an extra check, beside using static analysis tools. For example they can find and spot branches in the code with not-always-initialized local variables.
I did not touch in this article on LTO, but this one has been very useful for me too finding issues: https://mcuoneclipse.com/2018/05/31/gnu-link-time-optimization-finds-non-matching-declarations/
LikeLike
Well, I just turned on “link time optimization” to see what happened, and it threw loads of errors which make no sense and don’t occur with it not on!
LikeLiked by 1 person
Is this an application with FreeRTOS? I know that ‘plain vanilla’ or ‘silicon vendor’ FreeRTOS distributions have issues with LTO because they are not properly marking certain objects. I have marked in the McuLib version so it works fine with LTO. Maybe this is the problem, or what kind of errors are reported?
LikeLike
Some time ago I had problems when I tried to use the optimization, there were programs that stopped working because some variables lost their value. In the end I declared all those variables as VOLATILE and then the optimization worked for me.
I never knew the reason, I commented on it in some forum and they told me that it was bad programming, but without specifically specifying where the problem was and how to solve it.
LikeLiked by 1 person
It really might depend on that code, if it has been really a programmer or compiler issue. I have used in the past volatile too to workaround compiler issues. I had to use this sometimes say about 10 years ago, but recently it seems that GCC has become very mature.
LikeLike
Just in case, on the subject ‘volatile’: https://mcuoneclipse.com/2021/10/12/spilling-the-beans-volatile-qualifier/
LikeLike
I am using the following trick
using preprocessor directives for functions in a program
For example
void __attribute__((optimize(“Os”))) FcnSpeed(){……}
For other functions, optimization by default.
LikeLiked by 1 person
Thanks for sharing! I’m using the __attribute__ for things like this too.
LikeLike
Erich, I know you are aware of this. We should say a bit more of -Os ‘small’.
That ‘small’ can come at a price. -Os will become very aggressive at
eliminating any code that does not have any side-effects in the C
Abstraction Machine. Since the code has no side-effects GCC thinks
the code is unneeded and removes it.
Typical things that GCC will kill off are busy loops and Interrupt Functions.
For example this empty ‘delay loop’ will be removed, because the
compiler has no sense of temporal time and this code has no
side-effects:
for( uint16_t loop_u16 = 0U; loop_u16 < 1000U; ++loop_u16 )
{
}
Declaring the loop variable as 'volatile' will prevent the loops removal.
Interrupt functions should be marked with the 'USED" attribute.
Especially if Link Time Optimization is enabled.
Also how optimization works can depend to some degree on the back end.
In my experience the AVR port of GCC -Os is far more aggressive than
what I've seen from the ARM back end for -Os.
Many projects and Makefiles come with a default setting of -Os. So
let's head off 'Optimization broke my code with -Os'. The compiler
probably did what you asked it to do, it made the code smaller.
Probably in ways a new Embedded System programmer was not expecting…
LikeLiked by 2 people
Hi Bob,
Many thanks for your input and thoughts! Yes, such delay loops are exactly what can be considered as ‘careless programming’, and the compiler is correct to eliminate them. Interrupts should not be removed if they belong to a special ‘isr’ or similar section which is ‘kept’ in the linker file or marked as ‘used’.
LikeLike