Re: Calling convention weaknesses in 32-bit embedded ARM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Some Cortex-M cores do indeed have a few internal 64-bit buses. (Wider buses to flash, as found in a few microcontrollers, are external to the core - the slow wide flash bus is connected to a small cache, and the CPU core connects to that with a faster but narrower bus.)

But having a wider bus does not necessitate an aligned stack. For example, instruction fetches and stack alignment are entirely independent. And for the faster devices, data access is typically either using tightly-coupled memory (for the Cortex-M7 at least, that uses two independent 32-bit buses) or is via the cache. Reads and writes of cache lines may use a 64-bit bus to internal RAM or external memory controllers, but the data paths to the CPU are generally 32-bit wide (but running at core clock speed, rather than the speed of the slower memory blocks).

I am sure there will be situations where 8-byte stack alignment is the most efficient, and many situations where it makes little difference - all depending on the details of the hardware and the particular source code. But I think on a lot of systems, particularly on small chips, a 4-byte stack alignment would be more efficient.

Would it make enough difference to be worth the effort, however? That I cannot say. I think improving structure returns would have a much more significant effect, for code that uses larger types.

David


On 30/12/2024 16:04, Trampas Stern wrote:
Forgive my ignorance I assumed the reason for the 8byte alignment on the stack was because some cores used an internal 64bit memory bus?

Again I assumed the idea was that a vendor could have a wider memory bus. For example I know many cores have 64bit or even 128bit flash memory bus such that they can fetch multiple instructions at slower speeds to keep up with CPU core.  As such I assumed they did the same for SRAM.

Please correct me if I am wrong or misunderstand as I would like to learn

Thanks


On Mon, Dec 30, 2024 at 9:36 AM David Brown via Gcc-help <gcc-help@xxxxxxxxxxx <mailto:gcc-help@xxxxxxxxxxx>> wrote:

    Hi,

    I work with embedded microcontroller systems - primarily based on
    32-bit
    ARM Cortex-M devices.  Efficiency of the generated code is important to
    me - it means I can use the clearest, safest high-level source code and
    rely on the tools to do the low-level optimisation.

    One thing that sometimes hinders this is the calling conventions set by
    the CPU vendors.  These were often designed in the days when everything
    was an "int", memory was fast, and 32 bits were enough for anyone, and
    are not optimal for modern usage.

    A general point for efficiency on RISC processors is trying to avoid
    unnecessary stack usage.  Some of the faster Cortex-M cores are now
    significantly faster than RAM, especially if off-chip RAM is used.
    Caches and tightly-coupled memories help, but the more you keep in
    registers, the better.  Cortex-M cores are not like modern x86 cores
    that have store buffers and other features specifically optimising away
    the overhead of stack usage.

    The 32-bit ARM eabi calls for an 8-byte aligned stack.  That would have
    made sense for ancient ARM cores which do not support unaligned
    accesses
    and needed it for 64-bit doubles - AFAIK modern ARM cores all handle
    unaligned access for doubles and vectors without problems.  (For
    devices
    with hardware double and/or vector support, such data would almost
    always be in registers or in non-stack data anyway.)  8-byte stack
    alignment is just a waste of ram and cycles for half of the non-leaf
    functions in the program.


    More importantly, however, is the failure to use registers properly for
    function returns.  The eabi allows R0:R1 to be used for 64-bit integer
    types and 64-bit doubles (when hardware floating point registers are
    not
    available) - other than that, all types greater than 32-bit in size are
    returned via the stack.

             typedef unsigned long long uint64;
             uint64 big1(void) { return 1; }

             typedef struct Uint64 { uint64 val; } Uint64;
             Uint64 big2(void) { return (Uint64) { 1 }; }

    Compiles to:

    big1:
              movs    r0, #1
              movs    r1, #0
              bx      lr
    big2:
              movs    r2, #1
              movs    r3, #0
              strd    r2, [r0]
              bx      lr

    (Code here was from godbolt.org <http://godbolt.org>, using ARM GCC
    14.2.0 (unknown-eabi)
    with flags "-O2 -mcpu=cortex-m4".)


    Simply wrapping the 64-bit integer type in a struct leads to using the
    stack for the return value.  On some quick measurements I tried on a
    600
    MHz Cortex-M7 device using tightly-coupled memory for the stack, the
    "struct" version took /16/ times as long as the R0:R1 return version -
    80 cycles extra.  Timings like this are influenced by many factors, but
    the overhead here is not insignificant.

    (For comparison, more modern ABI's like RISC-V and x86-64 will return
    structs in two registers where possible, including mixing integer and
    floating point registers where it makes sense.)


    Small structs turn up regularly in modern coding, especially in newer
    C++.  std::optional<>, std::variant<>, std::expected<> - these are all
    useful for safe coding, but have a significant unnecessary overhead.
    The same problem applies to strong type wrappers around 64-bit integers.


    I can't see any good reason who all four scratch registers r0-r3 should
    not be used for return values.


    I'm hoping to get some ideas or workarounds for this limitation.  Maybe
    there are appropriate gcc options or function attributes that I haven't
    noticed.  (There is plenty of precedence for different calling
    convention flags and function attributes in the x86 gcc port.)  Failing
    that, it would be nice to have opinions on whether or not any of this
    would be a good idea.  I don't imagine it would be trivial to implement
    these two suggestions - there's no point in filing a bugzilla feature
    request unless other people also think they would be useful.


    David




[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux