Tutorial: How to Optimize Code and RAM Size

It is great if vendors provide a starting point for my own projects. A working ‘blinky’ is always a great starter. Convenience always has a price, and with a ‘blinky’ it is that the code size for just ‘toggling a GPIO pin’ is exaggerated. For a device with a tiny amount of RAM and FLASH this can be concerning: will my application ever fit to that device if a ‘blinky’ takes that much? Don’t worry: a blinky (or any other project) can be easily trimmed down.

Binky on NXP LPC845-BRK Board

Binky on NXP LPC845-BRK Board

I use a ‘blinky’ project here just as an example: the trimming tips can apply to any other kind of projects too.

For this tutorial I’m using the NXP LPC845 on the BRK (breakout) board:

NXP LPC845-BRK Board

NXP LPC845-BRK Board

Blinky

I’m using the Eclipse based NXP MCUXpresso IDE:

SDK board selection

SDK board selection

I have created the ‘blinky’ project with the vendor default settings:

blinky

blinky

A ‘blinky’ is supposed to blink a LED, just a good starter for any project. Building that rather minimal project gives this as code size:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:       10536 B        64 KB     16.08%
            SRAM:        2424 B        16 KB     14.79%

That information is shown in the console that way too, divided up in text, data and bss:

   text	   data	    bss	    dec	    hex	filename
  10532	      4	   2420	  12956	   329c	lpc845breakout_led_blinky.axf

10K for a blinky looks exaggerated. But we are going to trim this now in the next steps.

Size Information

For the meaning of the size information, have a read at “text, data and bss: Code and Data Size Explained“. The normal way to see what is using space on my device is to check the linker map file (*.map):

Linker Map File

Linker Map File

But that map file is rather hard to read and more for the experts: it lists the sections with the address and size:

Linker Map File Content

Linker Map File Content

With the MCUXpresso IDE V11, there is a nice ‘Image Info’ view which is basically a better viewer for the map file information:

Image Info View

I can filter and sort the data which gives me an idea how much space is used for code and data:

Image Info Memory Content

Image Info Memory Content

Of course it requires some knowledge about what the application is supposed to do. I always go through that list of items in the view to see if there is anything there I would not expect: maybe the application is using something which can be removed.

Source Code

For a simple blinky, that is rather not small. The first thing is to check what the program is doing. The main.c has this:

/*
 * Copyright 2017 NXP
 * All rights reserved.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#include "board.h"
#include "fsl_gpio.h"

#include "pin_mux.h"
/*******************************************************************************
 * Definitions
 ******************************************************************************/
#define BOARD_LED_PORT 1U
#define BOARD_LED_PIN 2U

/*******************************************************************************
 * Prototypes
 ******************************************************************************/

/*******************************************************************************
 * Variables
 ******************************************************************************/
volatile uint32_t g_systickCounter;

/*******************************************************************************
 * Code
 ******************************************************************************/
void SysTick_Handler(void)
{
    if (g_systickCounter != 0U)
    {
        g_systickCounter--;
    }
}

void SysTick_DelayTicks(uint32_t n)
{
    g_systickCounter = n;
    while (g_systickCounter != 0U)
    {
    }
}

/*!
 * @brief Main function
 */
int main(void)
{
    /* Define the init structure for the output LED pin*/
    gpio_pin_config_t led_config = {
        kGPIO_DigitalOutput,
        0,
    };

    /* Board pin init */
    BOARD_InitPins();
    BOARD_InitBootClocks();
    BOARD_InitDebugConsole();

    /* Init output LED GPIO. */
    GPIO_PortInit(GPIO, BOARD_LED_PORT);
    GPIO_PinInit(GPIO, BOARD_LED_PORT, BOARD_LED_PIN, &led_config);

    /* Set systick reload value to generate 1ms interrupt */
    if (SysTick_Config(SystemCoreClock / 1000U))
    {
        while (1)
        {
        }
    }

    while (1)
    {
        /* Delay 1000 ms */
        SysTick_DelayTicks(1000U);
        GPIO_PortToggle(GPIO, BOARD_LED_PORT, 1u << BOARD_LED_PIN);
    }
}

Basically the code is initializing the pins, clocks, sets up the SysTick timer and then does the ‘blinky’ in a loop, using the Systick counter to delay the blink period.

Debug Console

But what I can see is that it initializes a debug console (and the UART hardware for it):

BOARD_InitDebugConsole();

Getting rid of that gets us down to:

Memory region Used Size Region Size %age Used
PROGRAM_FLASH: 5616 B 64 KB 8.57%
         SRAM: 2400 B 16 KB 14.65%

πŸ’‘ Look for functions which get called but not used. In many cases demo applications setup some communication channels, but then they are not used. The linker does a good job removing unused objects (functions/variables), but only if they are not referenced.

Semihosting and printf()

The next thing to look at is if there is any semihosting or printf(). The project is using the ‘Redlib’ which is an optimized library compared to the ‘standard’ newlib or the smaller-standard newlib-nano:

Redlib

Redlib

Still, that library might add-up to the code size because it is using semihosting (sending messages through the debugger). Looking at the Memory view I can see all these standard I/O functions needed for that directly or indirectly:

stdio functions

stdio functions

Having all the hooks for that functionality only makes sense if using it, and this is not used by the ‘blinky’. So getting rid of that semihosting and all the unused standard I/O means to use the ‘none’ variant:

Library without standard I/O

Library without standard I/O

This gets us down to this:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        3372 B        64 KB      5.15%
            SRAM:        2208 B        16 KB     13.48%

πŸ’‘ avoid using printf() and all its variants, including semihosting. Or use a smaller variant or implementation. See the links at the end of this article for more background on this.

DEBUG and NDEBUG

The next thing is to check the compiler defines if they have the DEBUG listed. And indeed, this is the case:

DEBUG define

DEBUG define

With that define set, there is a lot of extra code in the SDK and example drivers which checks for good values with the ‘assert()’ macro:

Assert() usage in SDK code

Assert() usage in SDK code

Here again the Image information view is helpful: it shows me all the places where assert() is used:

assert usage

assert usage

It is actually a good practice to have asserts in the code to catch programming errors early. But all the assert() code really adds up. To turn off the extra code (and safety belt!), I change the macro to NDEBUG:

NDEBUG

NDEBUG

This gets us down to this:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        3144 B        64 KB      4.80%
            SRAM:        2208 B        16 KB     13.48%

Interrupts and Vectors

Again the Image Info view is a good starting point. I’m checking the used interrupts. The Blinky is using the SysTick interrupt which is expected. But there are still UART interrupts used?

Interrupts used

Interrupts used

Most interrupts are implemented as ‘weak’: implemented as default/empty, which can be overwritten by the application. But the UART ones do not make sense, as the blinky is not using any UART communication?

It turns out that the NXP SDK has the UART transactional API turned on by default:

UART Transactional API setting

UART Transactional API setting

The transactional API allows to send/receive UART data in communication chunks/transactions. But we don’t need that in our blinky, so let’s turn it off:

Turning Off UART Transactional API

Turning Off UART Transactional API

Which gives:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        2964 B        64 KB      4.52%
            SRAM:        2184 B        16 KB     13.33%

πŸ’‘ There would be now the option to remove CMSIS support which adds up about 300 bytes to the above code. But I consider that CMSIS (setting interrupt priority, common clock settings) as very useful, so I don’t touch it here. The largest function in the application is the one used by the SysTick code to set the priority of the timer to the lowest priority which would save another 220 bytes:

CMSIS as largest single function code size contributor

CMSIS as largest single function code size contributor

Optimizations

So far I have stripped off unwanted or unused functionality. Next I could turn on compiler optimizations. By default, the project is setup to -O0:

Compiler Optimizations

Compiler Optimizations

-O0 means no optimization: code is straight forward and easy to debug.

-O1 mainly optimizes the function entry/exit code and is able to reduce code size a bit without really impacting debugging. In this example it cuts down code size by half!

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        1540 B        64 KB      2.35%
            SRAM:        2184 B        16 KB     13.33%

-O2 optimizes more and tries to keep things in registers as much as possible. Because the functions in the applications are rather small, the improvement is not that big:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        1516 B        64 KB      2.31%
            SRAM:        2184 B        16 KB     13.33%

-O3 optimizes the most with extra inlining. -O3 is targeting speed, so no wonder the code size increases again:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        1792 B        64 KB      2.73%
            SRAM:        2184 B        16 KB     13.33%

The best option for code size optimization is -Os (optimize for size):

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        1456 B        64 KB      2.22%
            SRAM:        2184 B        16 KB     13.33%

That looks now pretty reasonable! Of course there are now ways to cut off more for a ‘bare-bare-blinky’, but everything in place (startup code, clock and GPIO initialization) makes sense for a real application, so I stop here now.

RAM: Heap and Stack

What does not look right is the SRAM usage. The ‘heap’ is using a big chunk:

heap memory usage

heap memory usage

That heap is used for dynamic memory allocation (malloc()). The general rule for embedded programming is to avoid it. But it is here by default. It can be turned off in the linker settings: The demo uses 1K for heap and stack each. As I’m not using malloc(), I can set the heap size to 0x0. For the reserved stack that really depends on the applications. On ARM Cortex the MSP is used for the startup/main and for the interrupts (see “ARM Cortex-M Interrupts and FreeRTOS“). 0x100 (256 bytes) should be plenty for my blinky.

Heap and Stack Size

Heap and Stack Size

This gets me down to this:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        1456 B        64 KB      2.22%
            SRAM:         392 B        16 KB      2.39%

If it is about reducing the stack size further, I can look at the Call Graph information which gives me information about how much stack space is used:

Call Graph with Stack Size

Call Graph with Stack Size

There are a few items with unknown size information (marked with a ‘?’) because they are in the library. A way to verify the real stack usage would be to write a pattern (e.g. 0xffff’ffff) and then run the application for a while:

Used Stack

Used Stack

This shows that 72 bytes are actually used. With a bit of a margin, setting the stack size to 128 bytes in this case looks reasonable. This gives:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        1456 B        64 KB      2.22%
            SRAM:         264 B        16 KB      1.61%

πŸ’‘ Be really careful with this! Stack overflows are the probably the most common problem in embedded applications. If you can, give as much RAM you can spend for the stack. If cutting the size down, make sure you did enough analysis to justify your stack size.

MTB

There is one thing left which uses RAM space: the MTB buffer. The Micro Trace Buffer is used for tracing which can be very useful (see “Debugging ARM Cortex-M0+ Hard Fault with MTB Trace“). The buffer can be disabled with a macro:

mtb.c

mtb.c

__MTB_DISABLE

__MTB_DISABLE

Which gets me down to this:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:        1456 B        64 KB      2.22%
            SRAM:         136 B        16 KB      0.83%

I think here we can be happy πŸ™‚

Summary

Vendor examples are great: they give me a good starting point. They are not optimized, and this is intentional. But they might come with features and functions I don’t need. Knowing different ways to optimize the application with cutting off features or tuning settings can be very useful to optimize RAM and FLASH usage. In this tutorial I showed how to bring a ‘blinky’ down to around 1KB Flash and around 136 bytes of SRAM. Of course this all depends on features and usage, but I think this is a pretty reasonable state now to add extra functionality for my application.

I hope these tips might be useful for your projects.

Happy Optimizing πŸ™‚

Links

4 thoughts on “Tutorial: How to Optimize Code and RAM Size

  1. Erich, your articles are always informative. But this is probably the cleanest, neatest, easiest to understand overview of how to simply strip out unnecessary junk that I have ever read. Brilliant !

    Like

    • Thank you πŸ™‚ ! I try to keep things understandable, but as a non-native speaker this is sometimes hard. And I always feel I might be too much in the details. But I always try to improve, so seems I’m on the right track.

      Like

  2. From command line one can use:
    arm-none-eabi-nm –print-size –size-sort –radix=d target.elf

    It gives interesting result for STM32 HAL:
    134218896 00001100 T HAL_RCC_OscConfig

    More than 1k just for clock initialization! To be fair, this function configures 5 or 6 different clock sources.

    Like

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.