Calling convention weaknesses in 32-bit embedded ARM

David Brown via Gcc-help <gcc-help@xxxxxxxxxxx> · Mon, 30 Dec 2024 15:34:47 +0100

Hi,

I work with embedded microcontroller systems - primarily based on 32-bit 
ARM Cortex-M devices.  Efficiency of the generated code is important to 
me - it means I can use the clearest, safest high-level source code and 
rely on the tools to do the low-level optimisation.

One thing that sometimes hinders this is the calling conventions set by 
the CPU vendors.  These were often designed in the days when everything 
was an "int", memory was fast, and 32 bits were enough for anyone, and 
are not optimal for modern usage.

A general point for efficiency on RISC processors is trying to avoid 
unnecessary stack usage.  Some of the faster Cortex-M cores are now 
significantly faster than RAM, especially if off-chip RAM is used. 
Caches and tightly-coupled memories help, but the more you keep in 
registers, the better.  Cortex-M cores are not like modern x86 cores 
that have store buffers and other features specifically optimising away 
the overhead of stack usage.

The 32-bit ARM eabi calls for an 8-byte aligned stack.  That would have 
made sense for ancient ARM cores which do not support unaligned accesses 
and needed it for 64-bit doubles - AFAIK modern ARM cores all handle 
unaligned access for doubles and vectors without problems.  (For devices 
with hardware double and/or vector support, such data would almost 
always be in registers or in non-stack data anyway.)  8-byte stack 
alignment is just a waste of ram and cycles for half of the non-leaf 
functions in the program.

More importantly, however, is the failure to use registers properly for 
function returns.  The eabi allows R0:R1 to be used for 64-bit integer 
types and 64-bit doubles (when hardware floating point registers are not 
available) - other than that, all types greater than 32-bit in size are 
returned via the stack.

	typedef unsigned long long uint64;
	uint64 big1(void) { return 1; }

	typedef struct Uint64 { uint64 val; } Uint64;
	Uint64 big2(void) { return (Uint64) { 1 }; }

Compiles to:

big1:
        movs    r0, #1
        movs    r1, #0
        bx      lr
big2:
        movs    r2, #1
        movs    r3, #0
        strd    r2, [r0]
        bx      lr

(Code here was from godbolt.org, using ARM GCC 14.2.0 (unknown-eabi) 
with flags "-O2 -mcpu=cortex-m4".)

Simply wrapping the 64-bit integer type in a struct leads to using the 
stack for the return value.  On some quick measurements I tried on a 600 
MHz Cortex-M7 device using tightly-coupled memory for the stack, the 
"struct" version took /16/ times as long as the R0:R1 return version - 
80 cycles extra.  Timings like this are influenced by many factors, but 
the overhead here is not insignificant.

(For comparison, more modern ABI's like RISC-V and x86-64 will return 
structs in two registers where possible, including mixing integer and 
floating point registers where it makes sense.)

Small structs turn up regularly in modern coding, especially in newer 
C++.  std::optional<>, std::variant<>, std::expected<> - these are all 
useful for safe coding, but have a significant unnecessary overhead. 
The same problem applies to strong type wrappers around 64-bit integers.

I can't see any good reason who all four scratch registers r0-r3 should 
not be used for return values.

I'm hoping to get some ideas or workarounds for this limitation.  Maybe 
there are appropriate gcc options or function attributes that I haven't 
noticed.  (There is plenty of precedence for different calling 
convention flags and function attributes in the x86 gcc port.)  Failing 
that, it would be nice to have opinions on whether or not any of this 
would be a good idea.  I don't imagine it would be trivial to implement 
these two suggestions - there's no point in filing a bugzilla feature 
request unless other people also think they would be useful.

David