Execute-Only Code with GNU and gcc

“There is no ‘S’ for Security in IoT” has indeed some truth. With all the connected devices around us, security of code should be a concern for every developer. “Preventing Reverse Engineering: Enabling Flash Security” shows how to prevent external read-out of critical code from device. What some microcontroller have built in is yet another feature: ‘Execute-Only-Sections‘ or ‘Execute-Only-Memory‘. What it means is that only instruction fetches are allowed in this area. No read access at all. Similar like ‘read-only’ ‘execute-only’ it means that code can be executed there, but no other access from that memory is allowed.

Locked Code

Locked Code

In this article I describe the challenges for a toolchain like the GNU gcc, and how to compile and link code for such an execute-only memory.

Execute Only Memory

With the complete flash read-out protection as explained in “Preventing Reverse Engineering: Enabling Flash Security“, the door to the memory completely closed: it is not possible to read from the device e.g. for reverse-engineering or to change the firmware on it. The only way to get back access to the device is usually a full erase of the device memory which prevents reading out the content with external tools.

excute only memory

execute only memory

In some cases it would be beneficial to update or load some code into the device. E.g. allow the user to load his own code, program or applet into your device. But you don’t want to allow that code to get access to your secret code in that device.

For example a company sells electricity meters with a secret way to measure and store the billing information. The company sells the meters to electricity companies which add their own communication stacks and software. For this use case it would be necessary to protect the secret code that it cannot be read by any ‘untrusted’ code.

‘Execute-Only’ allows protecting areas in the firmware from read-out, as I only can execute instructions in it, but not reading the code area itself. This allows running untrusted code (e.g. loaded as ‘applet’). The applet still can use and call functions from in the protected area (for example to get the billing information), but the untrusted code cannot ‘spy out’ the protected firmware.

For example secret encryption/decryption routines can be placed in a protected execute-only area, and still allow ‘untrusted’ code to call it. Because it can only be executed, it prevents the ‘untrusted’ code to know what is inside that protected area:

code calling protected firmware

code calling protected firmware

💡 To be clear: this is not a perfect protection, and depending on the hardware implementation (see https://community.arm.com/processors/b/blog/posts/what-is-execute-only-memory-xom) and efforts it might be still possible to do reverse engineering.

The typical implementation in the hardware is that only instruction fetches, but no data fetches are allowed in this area. If the architecture has a dedicated instruction and data bus, then basically the data bus is not connected to that memory. Interrupt execution and interrupt stack frames, as well caches have to be properly designed in the hardware to prevent  read-out of the protected areas (see Meltdown and Spectre).

Code with Embedded Data: Literal Pools and Jump Tables

The code in an execute-only area can only be executed, and there is no data access allowed to it. This can be a challenge with the ARM Cortex (thumb2) instruction set. This can be illustrated with the following example which should be placed into an execute-only section:

int SecretFunction(int i) {
  return i+0x1234567;
}

Looking at the disassembly (see “Creating Disassembly Listings with GNU Tools and Eclipse“) it shows the following:

Disassembly of section .text.SecretFunction:

00000000 <SecretFunction>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
6: 6078 str r0, [r7, #4]
8: 687a ldr r2, [r7, #4]
a: 4b04 ldr r3, [pc, #16] ; (1c <SecretFunction+0x1c>)
c: 4413 add r3, r2
e: 4618 mov r0, r3
10: 370c adds r7, #12
12: 46bd mov sp, r7
14: f85d 7b04 ldr.w r7, [sp], #4
18: 4770 bx lr
1a: bf00 nop
1c: 01234567 .word 0x01234567

The interesting thing is the ldr r3, [pc,#16] which loads the 0x1234567 constant into the register R3. The constant is placed at the end of the function code and is loaded PC relative. This constant is called a ‘literal pool’ is an area in the code which is used to store constants.

The other use case where the compiler is putting data and data reads into the code is with jump tables, illustrated by the following example:

 
int SecretSwitch(int i) {
	switch(i) {
	case 0: return 0;
	case 1: return 1;
	case 2: return 2;
	case 3: return 3;
	case 4: return 4;
	case 5: return 5;
	case 6: return 6;
	default: return i;
	}
}

which produces the following

00000038 <SecretSwitch>:
38: b480 push {r7}
3a: b083 sub sp, #12
3c: af00 add r7, sp, #0
3e: 6078 str r0, [r7, #4]
40: 687b ldr r3, [r7, #4]
42: 2b06 cmp r3, #6
44: d81e bhi.n 84 <SecretSwitch+0x4c>
46: a201 add r2, pc, #4 ; (adr r2, 4c <SecretSwitch+0x14>)
48: f852 f023 ldr.w pc, [r2, r3, lsl #2]
4c: 00000069 .word 0x00000069
4c: R_ARM_ABS32 .text_exec_only
50: 0000006d .word 0x0000006d
50: R_ARM_ABS32 .text_exec_only
54: 00000071 .word 0x00000071
54: R_ARM_ABS32 .text_exec_only
58: 00000075 .word 0x00000075
58: R_ARM_ABS32 .text_exec_only
5c: 00000079 .word 0x00000079
5c: R_ARM_ABS32 .text_exec_only
60: 0000007d .word 0x0000007d
60: R_ARM_ABS32 .text_exec_only
64: 00000081 .word 0x00000081
64: R_ARM_ABS32 .text_exec_only
68: 2300 movs r3, #0
6a: e00c b.n 86 <SecretSwitch+0x4e>
6c: 2301 movs r3, #1
6e: e00a b.n 86 <SecretSwitch+0x4e>
70: 2302 movs r3, #2
72: e008 b.n 86 <SecretSwitch+0x4e>
74: 2303 movs r3, #3
76: e006 b.n 86 <SecretSwitch+0x4e>
78: 2304 movs r3, #4
7a: e004 b.n 86 <SecretSwitch+0x4e>
7c: 2305 movs r3, #5
7e: e002 b.n 86 <SecretSwitch+0x4e>
80: 2306 movs r3, #6
82: e000 b.n 86 <SecretSwitch+0x4e>
84: 687b ldr r3, [r7, #4]
86: 4618 mov r0, r3
88: 370c adds r7, #12
8a: 46bd mov sp, r7
8c: f85d 7b04 ldr.w r7, [sp], #4
90: 4770 bx lr
92: bf00 nop

What is marked in green in above assembly listing is a jump table: a table with data/offsets in the code. Translating the switch statement, the compiler has decided to generate a table with jump offsets, and the code marked in red is loading the constant data with a PC relative instruction. Here again the executed code of this function is reading from its own code memory.

Veneer Functions

Another case where the code might use data in the code memory are ‘trampoline’ or ‘veneer’ functions. The limited opcode length of the ARM assembly code does not allow to jump to anyware in the 32bit address space.

For example the bl (branch and link) assembly instruction uses a 24bit immediate (in word units) for encoding the branch offset from the current PC location. The offset is resolved by the linker in the link phase.

Consider the following case where our ‘secret’ code call a function which is far away:

 
int SecretFarJump(int i) {
  return FarFunction(i);
}

The assembly code for this is the following:

00000020 <SecretFarJump>:
20: b580 push {r7, lr}
22: b082 sub sp, #8
24: af00 add r7, sp, #0
26: 6078 str r0, [r7, #4]
28: 6878 ldr r0, [r7, #4]
2a: f7ff fffe bl 0 <FarFunction>
2a: R_ARM_THM_CALL FarFunction
2e: 4603 mov r3, r0
30: 4618 mov r0, r3
32: 3708 adds r7, #8
34: 46bd mov sp, r7
36: bd80 pop {r7, pc}

If the offset or distance to the called function fits into the 24bit offset, then the linker can directly patch that address to the ‘bl’ instruction offset. For the case that the called function is too far away, the compiler/linker uses the following helper/trampoline/veneer function:

Veneer

Veneer

This veneer function jumps to the destination address using the 32bit address directly placed right after the

ldr.w pc, [pc]

instruction which loads the program counter with that target address using the pc-relative-indirect addressing mode. Here again, the code using data access to the code area which will not be possible if that code runs in execute only memory.

The linker will patch the ‘bl’ to jump to that veneer function:

Veneer Function

Branching to Veneer Function

Pure Code

What is required for code to be placed into execute is to have ‘pure code’: code which does no data access at all. For this the ARM gcc implements the following special commandline option

-mpure-code

From https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html:

-mpure-code do not allow constant data to be placed in code sections. Additionally, when compiling for ELF object format give all text sections the ELF processor-specific section attribute SHF_ARM_PURECODE. This option is only available when generating non-pic code for M-profile targets with the MOVT instruction.

Note: There is a similar option:

-mslow-flash-data Assume loading data from flash is slower than fetching instruction. Therefore literal load is minimized for better performance. This option is only supported when compiling for ARMv7 M-profile and off by default.

Similar as -mpure-code, the -mslow-flash-data option avoids data access in the code, but not 100%. The result of this option is that it can improve performance especially if the flash memory is slower than the instruction fetches for code execution. But with the -mslow-flash-data still data fetches in the code could exist. But this might be a good optimization option.

💡 The ‘pure code’ feature is implemented from GNU ARM Embedded Toolchain 6 2016q4 release, see https://launchpad.net/gcc-arm-embedded/+announcements?memo=5&start=5. KDS V3.2 is using an older version, so you would have to upgrade the compiler, see Switching ARM GNU Tool Chain and Libraries in Kinetis Design Studio

I can add that -mpure-code option to the files I want to put into execute only memory. This can be accomplished in Eclipse/CDT with adding the option to the file settings:

-mpure-code for a source file

-mpure-code for a source file

With ‘-mpure-code’

 
int SecretFunction(int i) {
  return i+0x1234567;
}

does not use any constant loads in the code. Instead it uses movw and movt instructions:

00000000 <SecretFunction>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
6: 6078 str r0, [r7, #4]
8: 687a ldr r2, [r7, #4]
a: f244 5367 movw r3, #17767 ; 0x4567
e: f2c0 1323 movt r3, #291 ; 0x123
12: 4413 add r3, r2
14: 4618 mov r0, r3
16: 370c adds r7, #12
18: 46bd mov sp, r7
1a: f85d 7b04 ldr.w r7, [sp], #4
1e: 4770 bx lr

Same for the jump table previously generated for the switch():

 
int SecretSwitch(int i) {
  switch(i) {
    case 0: return 0;
    case 1: return 1;
    case 2: return 2;
    case 3: return 3;
    case 4: return 4;
    case 5: return 5;
    case 6: return 6;
    default: return i;
  }
}

generates now:

00000038 <SecretSwitch>:
38: b480 push {r7}
3a: b083 sub sp, #12
3c: af00 add r7, sp, #0
3e: 6078 str r0, [r7, #4]
40: 687b ldr r3, [r7, #4]
42: 2b03 cmp r3, #3
44: d015 beq.n 72 <SecretSwitch+0x3a>
46: 2b03 cmp r3, #3
48: dc06 bgt.n 58 <SecretSwitch+0x20>
4a: 2b01 cmp r3, #1
4c: d00d beq.n 6a <SecretSwitch+0x32>
4e: 2b01 cmp r3, #1
50: dc0d bgt.n 6e <SecretSwitch+0x36>
52: 2b00 cmp r3, #0
54: d007 beq.n 66 <SecretSwitch+0x2e>
56: e014 b.n 82 <SecretSwitch+0x4a>
58: 2b05 cmp r3, #5
5a: d00e beq.n 7a <SecretSwitch+0x42>
5c: 2b05 cmp r3, #5
5e: db0a blt.n 76 <SecretSwitch+0x3e>
60: 2b06 cmp r3, #6
62: d00c beq.n 7e <SecretSwitch+0x46>
64: e00d b.n 82 <SecretSwitch+0x4a>
66: 2300 movs r3, #0
68: e00c b.n 84 <SecretSwitch+0x4c>
6a: 2301 movs r3, #1
6c: e00a b.n 84 <SecretSwitch+0x4c>
6e: 2302 movs r3, #2
70: e008 b.n 84 <SecretSwitch+0x4c>
72: 2303 movs r3, #3
74: e006 b.n 84 <SecretSwitch+0x4c>
76: 2304 movs r3, #4
78: e004 b.n 84 <SecretSwitch+0x4c>
7a: 2305 movs r3, #5
7c: e002 b.n 84 <SecretSwitch+0x4c>
7e: 2306 movs r3, #6
80: e000 b.n 84 <SecretSwitch+0x4c>
82: 687b ldr r3, [r7, #4]
84: 4618 mov r0, r3
86: 370c adds r7, #12
88: 46bd mov sp, r7
8a: f85d 7b04 ldr.w r7, [sp], #4
8e: 4770 bx lr

which does not access any data inside the code.

Looking at the veneer function, this one is now ‘pure’ too:

pure veneer function

pure veneer function

How to put code into execute only memory

What remains is how to get the execute code into execute only memory. First, I need to something like this for the

MEMORY
{
  /* Define each memory region */
  PROGRAM_FLASH (rx) : ORIGIN = 0x0, LENGTH = 0x80000 /* 512K bytes (alias Flash) */ 
  EXECUTE_ONLY (x) : ORIGIN = 0x80000, LENGTH = 0x80000 /* 512K bytes (alias Flash2) */ 
  FAR_FLASH (rx) : ORIGIN = 0xa0100000, LENGTH = 0x400 /* 1K bytes (alias Flash3) */ 
  SRAM_UPPER (rwx) : ORIGIN = 0x20000000, LENGTH = 0x30000 /* 192K bytes (alias RAM) */ 
  SRAM_LOWER (rwx) : ORIGIN = 0x1fff0000, LENGTH = 0x10000 /* 64K bytes (alias RAM2) */ 
}

For this I can use __attribute__ to mark a function:

 
int __attribute__((section (".text_EXECUTE_ONLY"))) mySecretCode(int i) {
  /* code */
}  

because in the linker script I have something like this to place things into the EXECUTE_ONLY section:

SECTIONS
{
.text_Flash2 : ALIGN(8)
{
FILL(0xff)
*(.text_Flash2*) /* for compatibility with previous releases */
*(.text_EXECUTE_ONLY*) /* for compatibility with previous releases */
*(.text.$Flash2*)
*(.text.$EXECUTE_ONLY*)
*(.rodata.$Flash2*)
*(.rodata.$EXECUTE_ONLY*)
} > EXECUTE_ONLY
...

An easier way might be to simply do this on a file base. Say if I have all my execute only code in a file named ExecuteOnly.c (producing the object file ExecuteOnly.o), then I can use this

 .text_Flash2 : ALIGN(8)
{
  FILL(0xff)
  *ExecuteOnly.o (.text .text*)
  *(.text_Flash2*) /* for compatibility with previous releases */
  *(.text_EXECUTE_ONLY*) /* for compatibility with previous releases */
  *(.text.$Flash2*)
  *(.text.$EXECUTE_ONLY*)
  *(.rodata.$Flash2*)
  *(.rodata.$EXECUTE_ONLY*)
} > EXECUTE_ONLY

Which places all the .text* from ExecuteOnly.o into my special section (see “Putting Code of Files into Special Section with the GNU Linker“). If using the MCUXpresso IDE which has a nice managed linker script feature, I add the follwing to the Extra linker script input section:

Extra Managed Linker Script Section

Extra Managed Linker Script Section

The question is: what happens with the any veneer functions? The release note text above talks about the SHF_ARM_PURECODE attribute. What I see is that the veneer function gets the name .text_EXECUTE_ONLY.__stub: that way it gets placed into execute-only section too, because I used *(.text_EXECUTE_ONLY*) in the linker script :-).

Veneer Function allocation

Veneer Function allocation

Summary

‘Execute-only’ memory is something which gets implemented in more and more devices and applications which are concerned about code security. It might not be the 100% perfect secure solution for everyone, but to me it looks like a good idea to put walls around the firmware to prevent reverse engineering. But it requires understanding how the compiler is generating code, and how to configure the compiler and linker for execute-only-memory.

Happy Executing 🙂

Links

7 thoughts on “Execute-Only Code with GNU and gcc

  1. Hi Erich,
    What if the secret function uses a large amount of constant data such as a large number of coefficients as used in a DSP filter (for example). The coefficient table would be defined outside the secret program. How do you make the data “secret” as well?

    Like

    • As you cannot read from that constant table in an execute-only memory, you would have to call a function like int Get Coefficient(index i) which implements a switch statement returning the coefficient constant value. Not as affient as a simple table access, but this would keep your coefficient table secret.

      Liked by 1 person

  2. Pingback: Tutorial: MCUXpresso SDK with Linux, Part 3: RAM and XiP Code on i.MX RT1064 | MCU on Eclipse

What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.