“There is no ‘S’ for Security in IoT” has indeed some truth. With all the connected devices around us, security of code should be a concern for every developer. “Preventing Reverse Engineering: Enabling Flash Security” shows how to prevent external read-out of critical code from device. What some microcontroller have built in is yet another feature: ‘Execute-Only-Sections‘ or ‘Execute-Only-Memory‘. What it means is that only instruction fetches are allowed in this area. No read access at all. Similar like ‘read-only’ ‘execute-only’ it means that code can be executed there, but no other access from that memory is allowed.
In this article I describe the challenges for a toolchain like the GNU gcc, and how to compile and link code for such an execute-only memory.
Execute Only Memory
With the complete flash read-out protection as explained in “Preventing Reverse Engineering: Enabling Flash Security“, the door to the memory completely closed: it is not possible to read from the device e.g. for reverse-engineering or to change the firmware on it. The only way to get back access to the device is usually a full erase of the device memory which prevents reading out the content with external tools.
In some cases it would be beneficial to update or load some code into the device. E.g. allow the user to load his own code, program or applet into your device. But you don’t want to allow that code to get access to your secret code in that device.
For example a company sells electricity meters with a secret way to measure and store the billing information. The company sells the meters to electricity companies which add their own communication stacks and software. For this use case it would be necessary to protect the secret code that it cannot be read by any ‘untrusted’ code.
‘Execute-Only’ allows protecting areas in the firmware from read-out, as I only can execute instructions in it, but not reading the code area itself. This allows running untrusted code (e.g. loaded as ‘applet’). The applet still can use and call functions from in the protected area (for example to get the billing information), but the untrusted code cannot ‘spy out’ the protected firmware.
For example secret encryption/decryption routines can be placed in a protected execute-only area, and still allow ‘untrusted’ code to call it. Because it can only be executed, it prevents the ‘untrusted’ code to know what is inside that protected area:
💡 To be clear: this is not a perfect protection, and depending on the hardware implementation (see https://community.arm.com/processors/b/blog/posts/what-is-execute-only-memory-xom) and efforts it might be still possible to do reverse engineering.
The typical implementation in the hardware is that only instruction fetches, but no data fetches are allowed in this area. If the architecture has a dedicated instruction and data bus, then basically the data bus is not connected to that memory. Interrupt execution and interrupt stack frames, as well caches have to be properly designed in the hardware to prevent read-out of the protected areas (see Meltdown and Spectre).
Code with Embedded Data: Literal Pools and Jump Tables
The code in an execute-only area can only be executed, and there is no data access allowed to it. This can be a challenge with the ARM Cortex (thumb2) instruction set. This can be illustrated with the following example which should be placed into an execute-only section:
int SecretFunction(int i) { return i+0x1234567; }
Looking at the disassembly (see “Creating Disassembly Listings with GNU Tools and Eclipse“) it shows the following:
Disassembly of section .text.SecretFunction: 00000000 <SecretFunction>: 0: b480 push {r7} 2: b083 sub sp, #12 4: af00 add r7, sp, #0 6: 6078 str r0, [r7, #4] 8: 687a ldr r2, [r7, #4] a: 4b04 ldr r3, [pc, #16] ; (1c <SecretFunction+0x1c>) c: 4413 add r3, r2 e: 4618 mov r0, r3 10: 370c adds r7, #12 12: 46bd mov sp, r7 14: f85d 7b04 ldr.w r7, [sp], #4 18: 4770 bx lr 1a: bf00 nop 1c: 01234567 .word 0x01234567
The interesting thing is the ldr r3, [pc,#16] which loads the 0x1234567 constant into the register R3. The constant is placed at the end of the function code and is loaded PC relative. This constant is called a ‘literal pool’ is an area in the code which is used to store constants.
The other use case where the compiler is putting data and data reads into the code is with jump tables, illustrated by the following example:
int SecretSwitch(int i) { switch(i) { case 0: return 0; case 1: return 1; case 2: return 2; case 3: return 3; case 4: return 4; case 5: return 5; case 6: return 6; default: return i; } }
which produces the following
00000038 <SecretSwitch>: 38: b480 push {r7} 3a: b083 sub sp, #12 3c: af00 add r7, sp, #0 3e: 6078 str r0, [r7, #4] 40: 687b ldr r3, [r7, #4] 42: 2b06 cmp r3, #6 44: d81e bhi.n 84 <SecretSwitch+0x4c> 46: a201 add r2, pc, #4 ; (adr r2, 4c <SecretSwitch+0x14>) 48: f852 f023 ldr.w pc, [r2, r3, lsl #2] 4c: 00000069 .word 0x00000069 4c: R_ARM_ABS32 .text_exec_only 50: 0000006d .word 0x0000006d 50: R_ARM_ABS32 .text_exec_only 54: 00000071 .word 0x00000071 54: R_ARM_ABS32 .text_exec_only 58: 00000075 .word 0x00000075 58: R_ARM_ABS32 .text_exec_only 5c: 00000079 .word 0x00000079 5c: R_ARM_ABS32 .text_exec_only 60: 0000007d .word 0x0000007d 60: R_ARM_ABS32 .text_exec_only 64: 00000081 .word 0x00000081 64: R_ARM_ABS32 .text_exec_only 68: 2300 movs r3, #0 6a: e00c b.n 86 <SecretSwitch+0x4e> 6c: 2301 movs r3, #1 6e: e00a b.n 86 <SecretSwitch+0x4e> 70: 2302 movs r3, #2 72: e008 b.n 86 <SecretSwitch+0x4e> 74: 2303 movs r3, #3 76: e006 b.n 86 <SecretSwitch+0x4e> 78: 2304 movs r3, #4 7a: e004 b.n 86 <SecretSwitch+0x4e> 7c: 2305 movs r3, #5 7e: e002 b.n 86 <SecretSwitch+0x4e> 80: 2306 movs r3, #6 82: e000 b.n 86 <SecretSwitch+0x4e> 84: 687b ldr r3, [r7, #4] 86: 4618 mov r0, r3 88: 370c adds r7, #12 8a: 46bd mov sp, r7 8c: f85d 7b04 ldr.w r7, [sp], #4 90: 4770 bx lr 92: bf00 nop
What is marked in green in above assembly listing is a jump table: a table with data/offsets in the code. Translating the switch statement, the compiler has decided to generate a table with jump offsets, and the code marked in red is loading the constant data with a PC relative instruction. Here again the executed code of this function is reading from its own code memory.
Veneer Functions
Another case where the code might use data in the code memory are ‘trampoline’ or ‘veneer’ functions. The limited opcode length of the ARM assembly code does not allow to jump to anyware in the 32bit address space.
For example the bl (branch and link) assembly instruction uses a 24bit immediate (in word units) for encoding the branch offset from the current PC location. The offset is resolved by the linker in the link phase.
Consider the following case where our ‘secret’ code call a function which is far away:
int SecretFarJump(int i) { return FarFunction(i); }
The assembly code for this is the following:
00000020 <SecretFarJump>:
20: b580 push {r7, lr}
22: b082 sub sp, #8
24: af00 add r7, sp, #0
26: 6078 str r0, [r7, #4]
28: 6878 ldr r0, [r7, #4]
2a: f7ff fffe bl 0 <FarFunction>
2a: R_ARM_THM_CALL FarFunction
2e: 4603 mov r3, r0
30: 4618 mov r0, r3
32: 3708 adds r7, #8
34: 46bd mov sp, r7
36: bd80 pop {r7, pc}
If the offset or distance to the called function fits into the 24bit offset, then the linker can directly patch that address to the ‘bl’ instruction offset. For the case that the called function is too far away, the compiler/linker uses the following helper/trampoline/veneer function:
This veneer function jumps to the destination address using the 32bit address directly placed right after the
ldr.w pc, [pc]
instruction which loads the program counter with that target address using the pc-relative-indirect addressing mode. Here again, the code using data access to the code area which will not be possible if that code runs in execute only memory.
The linker will patch the ‘bl’ to jump to that veneer function:
Pure Code
What is required for code to be placed into execute is to have ‘pure code’: code which does no data access at all. For this the ARM gcc implements the following special commandline option
-mpure-code
From https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html:
-mpure-code
do not allow constant data to be placed in code sections. Additionally, when compiling for ELF object format give all text sections the ELF processor-specific section attributeSHF_ARM_PURECODE
. This option is only available when generating non-pic code for M-profile targets with the MOVT instruction.
Note: There is a similar option:
-mslow-flash-data
Assume loading data from flash is slower than fetching instruction. Therefore literal load is minimized for better performance. This option is only supported when compiling for ARMv7 M-profile and off by default.
Similar as -mpure-code, the -mslow-flash-data option avoids data access in the code, but not 100%. The result of this option is that it can improve performance especially if the flash memory is slower than the instruction fetches for code execution. But with the -mslow-flash-data still data fetches in the code could exist. But this might be a good optimization option.
💡 The ‘pure code’ feature is implemented from GNU ARM Embedded Toolchain 6 2016q4 release, see https://launchpad.net/gcc-arm-embedded/+announcements?memo=5&start=5. KDS V3.2 is using an older version, so you would have to upgrade the compiler, see Switching ARM GNU Tool Chain and Libraries in Kinetis Design Studio
I can add that -mpure-code option to the files I want to put into execute only memory. This can be accomplished in Eclipse/CDT with adding the option to the file settings:
With ‘-mpure-code’
int SecretFunction(int i) { return i+0x1234567; }
does not use any constant loads in the code. Instead it uses movw and movt instructions:
00000000 <SecretFunction>: 0: b480 push {r7} 2: b083 sub sp, #12 4: af00 add r7, sp, #0 6: 6078 str r0, [r7, #4] 8: 687a ldr r2, [r7, #4] a: f244 5367 movw r3, #17767 ; 0x4567 e: f2c0 1323 movt r3, #291 ; 0x123 12: 4413 add r3, r2 14: 4618 mov r0, r3 16: 370c adds r7, #12 18: 46bd mov sp, r7 1a: f85d 7b04 ldr.w r7, [sp], #4 1e: 4770 bx lr
Same for the jump table previously generated for the switch():
int SecretSwitch(int i) { switch(i) { case 0: return 0; case 1: return 1; case 2: return 2; case 3: return 3; case 4: return 4; case 5: return 5; case 6: return 6; default: return i; } }
generates now:
00000038 <SecretSwitch>: 38: b480 push {r7} 3a: b083 sub sp, #12 3c: af00 add r7, sp, #0 3e: 6078 str r0, [r7, #4] 40: 687b ldr r3, [r7, #4] 42: 2b03 cmp r3, #3 44: d015 beq.n 72 <SecretSwitch+0x3a> 46: 2b03 cmp r3, #3 48: dc06 bgt.n 58 <SecretSwitch+0x20> 4a: 2b01 cmp r3, #1 4c: d00d beq.n 6a <SecretSwitch+0x32> 4e: 2b01 cmp r3, #1 50: dc0d bgt.n 6e <SecretSwitch+0x36> 52: 2b00 cmp r3, #0 54: d007 beq.n 66 <SecretSwitch+0x2e> 56: e014 b.n 82 <SecretSwitch+0x4a> 58: 2b05 cmp r3, #5 5a: d00e beq.n 7a <SecretSwitch+0x42> 5c: 2b05 cmp r3, #5 5e: db0a blt.n 76 <SecretSwitch+0x3e> 60: 2b06 cmp r3, #6 62: d00c beq.n 7e <SecretSwitch+0x46> 64: e00d b.n 82 <SecretSwitch+0x4a> 66: 2300 movs r3, #0 68: e00c b.n 84 <SecretSwitch+0x4c> 6a: 2301 movs r3, #1 6c: e00a b.n 84 <SecretSwitch+0x4c> 6e: 2302 movs r3, #2 70: e008 b.n 84 <SecretSwitch+0x4c> 72: 2303 movs r3, #3 74: e006 b.n 84 <SecretSwitch+0x4c> 76: 2304 movs r3, #4 78: e004 b.n 84 <SecretSwitch+0x4c> 7a: 2305 movs r3, #5 7c: e002 b.n 84 <SecretSwitch+0x4c> 7e: 2306 movs r3, #6 80: e000 b.n 84 <SecretSwitch+0x4c> 82: 687b ldr r3, [r7, #4] 84: 4618 mov r0, r3 86: 370c adds r7, #12 88: 46bd mov sp, r7 8a: f85d 7b04 ldr.w r7, [sp], #4 8e: 4770 bx lr
which does not access any data inside the code.
Looking at the veneer function, this one is now ‘pure’ too:
How to put code into execute only memory
What remains is how to get the execute code into execute only memory. First, I need to something like this for the
MEMORY { /* Define each memory region */ PROGRAM_FLASH (rx) : ORIGIN = 0x0, LENGTH = 0x80000 /* 512K bytes (alias Flash) */ EXECUTE_ONLY (x) : ORIGIN = 0x80000, LENGTH = 0x80000 /* 512K bytes (alias Flash2) */ FAR_FLASH (rx) : ORIGIN = 0xa0100000, LENGTH = 0x400 /* 1K bytes (alias Flash3) */ SRAM_UPPER (rwx) : ORIGIN = 0x20000000, LENGTH = 0x30000 /* 192K bytes (alias RAM) */ SRAM_LOWER (rwx) : ORIGIN = 0x1fff0000, LENGTH = 0x10000 /* 64K bytes (alias RAM2) */ }
For this I can use __attribute__ to mark a function:
int __attribute__((section (".text_EXECUTE_ONLY"))) mySecretCode(int i) { /* code */ }
because in the linker script I have something like this to place things into the EXECUTE_ONLY section:
SECTIONS { .text_Flash2 : ALIGN(8) { FILL(0xff) *(.text_Flash2*) /* for compatibility with previous releases */ *(.text_EXECUTE_ONLY*) /* for compatibility with previous releases */ *(.text.$Flash2*) *(.text.$EXECUTE_ONLY*) *(.rodata.$Flash2*) *(.rodata.$EXECUTE_ONLY*) } > EXECUTE_ONLY ...
An easier way might be to simply do this on a file base. Say if I have all my execute only code in a file named ExecuteOnly.c (producing the object file ExecuteOnly.o), then I can use this
.text_Flash2 : ALIGN(8) { FILL(0xff) *ExecuteOnly.o (.text .text*) *(.text_Flash2*) /* for compatibility with previous releases */ *(.text_EXECUTE_ONLY*) /* for compatibility with previous releases */ *(.text.$Flash2*) *(.text.$EXECUTE_ONLY*) *(.rodata.$Flash2*) *(.rodata.$EXECUTE_ONLY*) } > EXECUTE_ONLY
Which places all the .text* from ExecuteOnly.o into my special section (see “Putting Code of Files into Special Section with the GNU Linker“). If using the MCUXpresso IDE which has a nice managed linker script feature, I add the follwing to the Extra linker script input section:
The question is: what happens with the any veneer functions? The release note text above talks about the SHF_ARM_PURECODE
attribute. What I see is that the veneer function gets the name .text_EXECUTE_ONLY.__stub: that way it gets placed into execute-only section too, because I used *(.text_EXECUTE_ONLY*) in the linker script :-).
Summary
‘Execute-only’ memory is something which gets implemented in more and more devices and applications which are concerned about code security. It might not be the 100% perfect secure solution for everyone, but to me it looks like a good idea to put walls around the firmware to prevent reverse engineering. But it requires understanding how the compiler is generating code, and how to configure the compiler and linker for execute-only-memory.
Happy Executing 🙂
Links
- What is eXecute-Only-Memory (XOM): https://community.arm.com/processors/b/blog/posts/what-is-execute-only-memory-xom
- Whitepaper: Separating instructions and data with PureCode
- Putting Code of Files into Special Section with the GNU Linker
Hi Erich,
What if the secret function uses a large amount of constant data such as a large number of coefficients as used in a DSP filter (for example). The coefficient table would be defined outside the secret program. How do you make the data “secret” as well?
LikeLike
As you cannot read from that constant table in an execute-only memory, you would have to call a function like int Get Coefficient(index i) which implements a switch statement returning the coefficient constant value. Not as affient as a simple table access, but this would keep your coefficient table secret.
LikeLiked by 1 person
Pingback: Tutorial: MCUXpresso SDK with Linux, Part 3: RAM and XiP Code on i.MX RT1064 | MCU on Eclipse
This security feature / concept is broken, especially on the NXP Kinetis devices. See https://www.usenix.org/conference/woot19/presentation/schink
LikeLike
Thanks for that link! Indeed this is a very weak way of protection. 😦
LikeLike
Just thank you ♥
LikeLiked by 1 person
You are welcome 🙂
LikeLike