Hi,
I work with embedded microcontroller systems - primarily based on 32-bit
ARM Cortex-M devices. Efficiency of the generated code is important to
me - it means I can use the clearest, safest high-level source code and
rely on the tools to do the low-level optimisation.
One thing that sometimes hinders this is the calling conventions set by
the CPU vendors. These were often designed in the days when everything
was an "int", memory was fast, and 32 bits were enough for anyone, and
are not optimal for modern usage.
A general point for efficiency on RISC processors is trying to avoid
unnecessary stack usage. Some of the faster Cortex-M cores are now
significantly faster than RAM, especially if off-chip RAM is used.
Caches and tightly-coupled memories help, but the more you keep in
registers, the better. Cortex-M cores are not like modern x86 cores
that have store buffers and other features specifically optimising away
the overhead of stack usage.
The 32-bit ARM eabi calls for an 8-byte aligned stack. That would have
made sense for ancient ARM cores which do not support unaligned accesses
and needed it for 64-bit doubles - AFAIK modern ARM cores all handle
unaligned access for doubles and vectors without problems. (For devices
with hardware double and/or vector support, such data would almost
always be in registers or in non-stack data anyway.) 8-byte stack
alignment is just a waste of ram and cycles for half of the non-leaf
functions in the program.
More importantly, however, is the failure to use registers properly for
function returns. The eabi allows R0:R1 to be used for 64-bit integer
types and 64-bit doubles (when hardware floating point registers are not
available) - other than that, all types greater than 32-bit in size are
returned via the stack.
typedef unsigned long long uint64;
uint64 big1(void) { return 1; }
typedef struct Uint64 { uint64 val; } Uint64;
Uint64 big2(void) { return (Uint64) { 1 }; }
Compiles to:
big1:
movs r0, #1
movs r1, #0
bx lr
big2:
movs r2, #1
movs r3, #0
strd r2, [r0]
bx lr
(Code here was from godbolt.org, using ARM GCC 14.2.0 (unknown-eabi)
with flags "-O2 -mcpu=cortex-m4".)
Simply wrapping the 64-bit integer type in a struct leads to using the
stack for the return value. On some quick measurements I tried on a 600
MHz Cortex-M7 device using tightly-coupled memory for the stack, the
"struct" version took /16/ times as long as the R0:R1 return version -
80 cycles extra. Timings like this are influenced by many factors, but
the overhead here is not insignificant.
(For comparison, more modern ABI's like RISC-V and x86-64 will return
structs in two registers where possible, including mixing integer and
floating point registers where it makes sense.)
Small structs turn up regularly in modern coding, especially in newer
C++. std::optional<>, std::variant<>, std::expected<> - these are all
useful for safe coding, but have a significant unnecessary overhead.
The same problem applies to strong type wrappers around 64-bit integers.
I can't see any good reason who all four scratch registers r0-r3 should
not be used for return values.
I'm hoping to get some ideas or workarounds for this limitation. Maybe
there are appropriate gcc options or function attributes that I haven't
noticed. (There is plenty of precedence for different calling
convention flags and function attributes in the x86 gcc port.) Failing
that, it would be nice to have opinions on whether or not any of this
would be a good idea. I don't imagine it would be trivial to implement
these two suggestions - there's no point in filing a bugzilla feature
request unless other people also think they would be useful.
David