Optimizing the Kinetis gcc Startup

The GNU gcc tool chain integration in CodeWarrior/Eclipse MCU10.3 has a nice feature to show the code and data size of my application after linking (see this article how to enable this). So if I create an ’empty’ project with the wizard, get the code and data size without consulting the linker map file:

Console View with Code and Data Size

Console View with Code and Data Size

But wait! 2604 bytes of code for almost doing nothing? That’s not what I want! There are ways to get that puppy much, much slimmer. Down to 284 bytes :mrgreen: .

Startup Code

What is added to my application code is the so called ‘startup code’. That code is executed after reset before it reaches main(). Understanding that startup code allows me to tweak it to my needs, and  making it much, much smaller than what I have above.

What the startup code usually does is:

  1. Initializing critical processor registers (stack or frame register, clock settings, …). This is required for the processor to run properly.
  2. Initializing global memory with zero (named sometimes as ‘zero-out’). This is required by ANSI-C/C++ and means that global variables like ‘int i;‘ gets initialized with zero values.
  3. Initializing global initialized variables with values from ROM (named sometimes as ‘copy-down’). This is required by ANSI C/C++ and means that global initialized variables like ‘int myVar=5;‘ get properly initialized.
  4. Only for C++: calling global constructors. This is required by C++ so my global objects with a constructor gets properly initialized.
  5. calling main(). That’s probably the most obvious thing :-).
  6. Only for C++: If main() returns, calling global destructors. This again is required for C++ applications as counterpart to step 4 above.

So there is quite some stuff performed before my application reaches main(). And this as adding that about 2 KByte of code mentioned above.

So with knowing what I really need, I can slim down my startup code. The good thing is: I have full control and power. But it means as well that I’m fully responsible ;-).

Startup Files

In my FRDM-KL25Zproject, all the startup related files are located in the Startup_Code folder inside my Eclipse project:

Startup_Files inside the project

Startup_Files inside the project

Optimizing Thumb Startup Code

The startup code is located in __arm_start.c, inside the function __thumb_startup():

By default, I have the following:

void __thumb_startup(void)
{
  // Setup registers
  __init_registers();
  // setup hardware
  __init_hardware();
#if defined(__APCS_ROPI) || defined(__APCS_RWPI)
  //  static base register initialization
  __load_static_base();
#endif
#if defined(__APCS_RWPI)
  //  -pid
  //  setup static base for SB relative position independent data
  //  perform runtime relocation
  __init_pid();
#endif
#if defined(__APCS_ROPI)
  //  -pic
  //  perform runtime relocation for position independent code
  __init_pic();
#endif
  //  zero-fill the .bss section
  zero_fill_bss();
#if SUPPORT_ROM_TO_RAM
  if (__S_romp != 0L)
    __copy_rom_sections_to_ram();
#endif
  //      initialize the floating-point library
#ifdef __VFPV4__
  __fp_init();
#endif
  //  call C++ static initializers
  __call_static_initializers();
  // initializations before main, user specific
  __init_user();
#if defined(__SEMIHOSTING)
  // semihost initializations
  __init_semihost();
#endif
  //  call main(argc, &argv)
#if SUPPORT_SEMIHOST_ARGC_ARGV
  exit(main(__argc_argv(argv, __MAX_CMDLINE_ARGS), argv));
#else
  exit(main(0, argv));
#endif
  //  should never get here
  while (1);
}

Most of the stuff is configured with macros and not enabled. A lot of things are generic, but do not apply for my FRDM-KL25Z project, so I’m going to remove things to make match the source to what I need. So I’m going to remove the following:

  • I do not use static base registers: I can remove that part.
  • I do not use position independent code and data. I can remove the register initialization for PIC and PID.
  • I do not have a floating point unit. I can remove the block to initialize the floating point enable register.
  • I do not use C++: I can remove the constructor/destructor code.
  • I do not use semi-hosting (Semihosting is that the debugger will route printf()/console output to a console on the host)

Applying this with some source ‘beautifying’, gives me a more compact startup now:

void __thumb_startup(void) {
  __init_registers(); /* Setup registers */
  __init_hardware();  /* setup hardware */
  zero_fill_bss();    /*  zero-fill the .bss section */
  if (__S_romp != 0L) { /* copy down */
    __copy_rom_sections_to_ram();
  }
  __init_user(); /* initializations before main, user specific */
  exit(main(0, argv)); /*  call main(argc, &argv) */
  while (1); /*  should never get here */
}

So this startup code now does what is really required for my ANSI-C application: register and hardware initialization, zero-out and copy-down, user initialization and calling main(). And the positive effect is that I already saved 74 bytes of code :-):

text       data        bss        dec        hex    filename
2528         36       2076       4640       1220    startup.elf

Calling main() the Embedded Way

Looking at how main() is declared, it shows as:

extern int main(int, char **);

and is called like this:

exit(main(0, argv)); /*  call main(argc, &argv) */
while (1); /*  should never get here */

Wow! That really looks more like for a desktop application ;-). For Embedded System programming, I’m do *not* leave main(), and I do not call it with arguments. But gcc really likes to have it having returning a value, and gcc is smart enough to optimize this, so I change it to this:

extern int main(void);

But I definitely can get rid of the arguments passed to main, so can remove that ‘argv’ global variable and slim it down to this:

extern int main(void);
extern void __init_registers(void);
extern void __init_hardware(void);
extern void __init_user(void);

extern void __copy_rom_sections_to_ram(void);
extern char __S_romp[];

static void zero_fill_bss(void) {
  extern char __START_BSS[];
  extern char __END_BSS[];

  memset(__START_BSS, 0, (__END_BSS - __START_BSS));
}

void __thumb_startup(void) _EWL_NAKED;

void __thumb_startup(void) {
  __init_registers(); /* Setup registers */
  __init_hardware();  /* setup hardware */
  zero_fill_bss();    /*  zero-fill the .bss section */
  if (__S_romp != 0L) { /* copy down */
    __copy_rom_sections_to_ram();
  }
  __init_user(); /* initializations before main, user specific */
  (void)main(); /*  call main() */
  /*  should never get here */
}

The result: saved 168 bytes of code and 16 bytes of RAM:

text       data        bss        dec        hex    filename
2360         36       2060       4456       1168    startup.elf

As I do not use exit() any more, I can as well remove the Startup_Code/__arm_end.c file from my project, although this will not save any thing for me (as not used any more):

Optimizing for Code Size

To squeeze code size a bit more, I set the GNU gcc compiler optimization to optimize my code for size:

Optimize for Code Size

Optimize for Code Size

That saves me again 52 bytes of code size:

text       data        bss        dec        hex    filename
2308         36       2060       4404       1134    startup.elf

Linker and Libraries

But 2308 bytes of code is still *a lot*. There must more fat around. And indeed, checking the linker map file shows something like this:

.text.__pformatter_
                0x00000aa0      0x438 C:/Freescale/CW MCU v10.3_b120917i/MCU/ARM_GCC_Support/ewl/lib/armv6-m\libc.a(printformat_.o)
                0x00000aa0                __pformatter_

What’s that? Sounds like a printf() formatter or something like this? Looks like this is something used for console or semi-hosting support. Indeed, such a symbol is set in the linker options to link with:

Linker defined __pformatter symbol

Linker defined __pformatter symbol

💡 The above seetings look different in the final MCU10.3 release. See this post how the same thing can be accomplished in MCU10.3.

I definitely do not need this for my application, so I remove the two lines above and get:

text       data        bss        dec        hex    filename
 572         36       2060       2668        a6c    startup.elf

Wow! Down to 572 bytes of code, this saved me 1736 bytes of code!

Zero-Out

So far I was still within the ANSI-C boundaries. Means I still have zero-out and copy-down enabled in the startup code. The zero-out initialized all my global memory with zero bytes (as required by the ANSI standard). So there is an opportunity to reduce my startup footprint even further if I my application is not depending on that the global memory is set all to zero.

char buf[32]; /* with zero-out, this will be initialized with zeros */

Usually I have ModulePrefix_Init() functions which do the needed global initialization, so I do not depend on the startup code for this. These Init() functions are called from my application initialization code. With this I can remove the call to zero_fill_bss():

void __thumb_startup(void) {
  __init_registers(); /* Setup registers */
  __init_hardware();  /* setup hardware */
  zero_fill_bss();    /*  zero-fill the .bss section */
  if (__S_romp != 0L) { /* copy down */
    __copy_rom_sections_to_ram();
  }
  __init_user(); /* initializations before main, user specific */
  (void)main(); /*  call main() */
  /*  should never get here */
}

The result is not only that I save startup time, I save 164 code bytes :-):

text       data        bss        dec        hex    filename
 408         36       2060       2504        9c8    startup.elf

Copy-Down

Copy-down is similar: it ensures that my variables in RAM are initialized properly with values from ROM.

int var = 7; /* initialized with value 7 at startup time */
char str[] = "hello"; /* initialized a startup time */

What happens is that the compiler/linker will have all the constant values (‘7’ and ‘hello’ for the above case) in ROM/Flash and will copy it ‘down’ to the variables in RAM.

So if my application does not depend on this, because I do care myself about this global variable initialization, then I can remove that call to __copy_rom_sections_to_ram():

void __thumb_startup(void) {
  __init_registers(); /* Setup registers */
  __init_hardware();  /* setup hardware */
  zero_fill_bss();    /*  zero-fill the .bss section */
  if (__S_romp != 0L) { /* copy down */
    __copy_rom_sections_to_ram();
  }
  __init_user(); /* initializations before main, user specific */
  (void)main(); /*  call main() */
  /*  should never get here */
}

And this again reduces my footprint by 124 code bytes and 12 bytes of RAM 🙂 :

text       data        bss        dec        hex    filename
 284         36       2048       2368        940    startup.elf

Summary

I used here CodeWarrior for MCU10.3 beta. But the problem and approaches presented here are universal and applicable to any embedded system tool-chain. The startup code is 100% under my control. Default projects created by tool chains tend to care about all and everything, which usually bloats the startup: both in the sources and as well in the resulting application code. Knowing the purpose and mechanics behind the startup code allows me to slim it down to my needs.

Happy Slimming :-)!

10 thoughts on “Optimizing the Kinetis gcc Startup

  1. Pingback: Tutorial: Bits and Pins with Kinetis and the FRDM-KL25Z Board | MCU on Eclipse

  2. Pingback: FreeRTOS on the FRDM-KL05Z Board | MCU on Eclipse

  3. Excellent post. Definitely will save a link to this one.

    One minor comment. Instead of saying that the startup code initializes “global” variables, consider using the term “statically allocated” variables. Though all globals are statically allocated, not all statically allocated variables are globals.

    Like

    • That depends a little bit if you are using a bare, SDK, Processor Expert or ProcessorExpert+SDK project. But the principle remains the same: modify the startup code/source, the same way as described in this post.
      I hope this helps,
      Erich

      Like

  4. Pingback: Tutorial: How to Optimize Code and RAM Size | MCU on Eclipse

What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.