Re: Calling convention weaknesses in 32-bit embedded ARM

Trampas Stern via Gcc-help <gcc-help@xxxxxxxxxxx> · Mon, 30 Dec 2024 10:04:05 -0500

Forgive my ignorance I assumed the reason for the 8byte alignment on the
stack was because some cores used an internal 64bit memory bus?

Again I assumed the idea was that a vendor could have a wider memory bus.
For example I know many cores have 64bit or even 128bit flash memory bus
such that they can fetch multiple instructions at slower speeds to keep up
with CPU core.  As such I assumed they did the same for SRAM.

Please correct me if I am wrong or misunderstand as I would like to learn

Thanks

On Mon, Dec 30, 2024 at 9:36 AM David Brown via Gcc-help <
gcc-help@xxxxxxxxxxx> wrote:

> Hi,
>
> I work with embedded microcontroller systems - primarily based on 32-bit
> ARM Cortex-M devices.  Efficiency of the generated code is important to
> me - it means I can use the clearest, safest high-level source code and
> rely on the tools to do the low-level optimisation.
>
> One thing that sometimes hinders this is the calling conventions set by
> the CPU vendors.  These were often designed in the days when everything
> was an "int", memory was fast, and 32 bits were enough for anyone, and
> are not optimal for modern usage.
>
> A general point for efficiency on RISC processors is trying to avoid
> unnecessary stack usage.  Some of the faster Cortex-M cores are now
> significantly faster than RAM, especially if off-chip RAM is used.
> Caches and tightly-coupled memories help, but the more you keep in
> registers, the better.  Cortex-M cores are not like modern x86 cores
> that have store buffers and other features specifically optimising away
> the overhead of stack usage.
>
> The 32-bit ARM eabi calls for an 8-byte aligned stack.  That would have
> made sense for ancient ARM cores which do not support unaligned accesses
> and needed it for 64-bit doubles - AFAIK modern ARM cores all handle
> unaligned access for doubles and vectors without problems.  (For devices
> with hardware double and/or vector support, such data would almost
> always be in registers or in non-stack data anyway.)  8-byte stack
> alignment is just a waste of ram and cycles for half of the non-leaf
> functions in the program.
>
>
> More importantly, however, is the failure to use registers properly for
> function returns.  The eabi allows R0:R1 to be used for 64-bit integer
> types and 64-bit doubles (when hardware floating point registers are not
> available) - other than that, all types greater than 32-bit in size are
> returned via the stack.
>
>         typedef unsigned long long uint64;
>         uint64 big1(void) { return 1; }
>
>         typedef struct Uint64 { uint64 val; } Uint64;
>         Uint64 big2(void) { return (Uint64) { 1 }; }
>
> Compiles to:
>
> big1:
>          movs    r0, #1
>          movs    r1, #0
>          bx      lr
> big2:
>          movs    r2, #1
>          movs    r3, #0
>          strd    r2, [r0]
>          bx      lr
>
> (Code here was from godbolt.org, using ARM GCC 14.2.0 (unknown-eabi)
> with flags "-O2 -mcpu=cortex-m4".)
>
>
> Simply wrapping the 64-bit integer type in a struct leads to using the
> stack for the return value.  On some quick measurements I tried on a 600
> MHz Cortex-M7 device using tightly-coupled memory for the stack, the
> "struct" version took /16/ times as long as the R0:R1 return version -
> 80 cycles extra.  Timings like this are influenced by many factors, but
> the overhead here is not insignificant.
>
> (For comparison, more modern ABI's like RISC-V and x86-64 will return
> structs in two registers where possible, including mixing integer and
> floating point registers where it makes sense.)
>
>
> Small structs turn up regularly in modern coding, especially in newer
> C++.  std::optional<>, std::variant<>, std::expected<> - these are all
> useful for safe coding, but have a significant unnecessary overhead.
> The same problem applies to strong type wrappers around 64-bit integers.
>
>
> I can't see any good reason who all four scratch registers r0-r3 should
> not be used for return values.
>
>
> I'm hoping to get some ideas or workarounds for this limitation.  Maybe
> there are appropriate gcc options or function attributes that I haven't
> noticed.  (There is plenty of precedence for different calling
> convention flags and function attributes in the x86 gcc port.)  Failing
> that, it would be nice to have opinions on whether or not any of this
> would be a good idea.  I don't imagine it would be trivial to implement
> these two suggestions - there's no point in filing a bugzilla feature
> request unless other people also think they would be useful.
>
>
> David
>
>