Forgive my ignorance I assumed the reason for the 8byte alignment on the stack was because some cores used an internal 64bit memory bus? Again I assumed the idea was that a vendor could have a wider memory bus. For example I know many cores have 64bit or even 128bit flash memory bus such that they can fetch multiple instructions at slower speeds to keep up with CPU core. As such I assumed they did the same for SRAM. Please correct me if I am wrong or misunderstand as I would like to learn Thanks On Mon, Dec 30, 2024 at 9:36 AM David Brown via Gcc-help < gcc-help@xxxxxxxxxxx> wrote: > Hi, > > I work with embedded microcontroller systems - primarily based on 32-bit > ARM Cortex-M devices. Efficiency of the generated code is important to > me - it means I can use the clearest, safest high-level source code and > rely on the tools to do the low-level optimisation. > > One thing that sometimes hinders this is the calling conventions set by > the CPU vendors. These were often designed in the days when everything > was an "int", memory was fast, and 32 bits were enough for anyone, and > are not optimal for modern usage. > > A general point for efficiency on RISC processors is trying to avoid > unnecessary stack usage. Some of the faster Cortex-M cores are now > significantly faster than RAM, especially if off-chip RAM is used. > Caches and tightly-coupled memories help, but the more you keep in > registers, the better. Cortex-M cores are not like modern x86 cores > that have store buffers and other features specifically optimising away > the overhead of stack usage. > > The 32-bit ARM eabi calls for an 8-byte aligned stack. That would have > made sense for ancient ARM cores which do not support unaligned accesses > and needed it for 64-bit doubles - AFAIK modern ARM cores all handle > unaligned access for doubles and vectors without problems. (For devices > with hardware double and/or vector support, such data would almost > always be in registers or in non-stack data anyway.) 8-byte stack > alignment is just a waste of ram and cycles for half of the non-leaf > functions in the program. > > > More importantly, however, is the failure to use registers properly for > function returns. The eabi allows R0:R1 to be used for 64-bit integer > types and 64-bit doubles (when hardware floating point registers are not > available) - other than that, all types greater than 32-bit in size are > returned via the stack. > > typedef unsigned long long uint64; > uint64 big1(void) { return 1; } > > typedef struct Uint64 { uint64 val; } Uint64; > Uint64 big2(void) { return (Uint64) { 1 }; } > > Compiles to: > > big1: > movs r0, #1 > movs r1, #0 > bx lr > big2: > movs r2, #1 > movs r3, #0 > strd r2, [r0] > bx lr > > (Code here was from godbolt.org, using ARM GCC 14.2.0 (unknown-eabi) > with flags "-O2 -mcpu=cortex-m4".) > > > Simply wrapping the 64-bit integer type in a struct leads to using the > stack for the return value. On some quick measurements I tried on a 600 > MHz Cortex-M7 device using tightly-coupled memory for the stack, the > "struct" version took /16/ times as long as the R0:R1 return version - > 80 cycles extra. Timings like this are influenced by many factors, but > the overhead here is not insignificant. > > (For comparison, more modern ABI's like RISC-V and x86-64 will return > structs in two registers where possible, including mixing integer and > floating point registers where it makes sense.) > > > Small structs turn up regularly in modern coding, especially in newer > C++. std::optional<>, std::variant<>, std::expected<> - these are all > useful for safe coding, but have a significant unnecessary overhead. > The same problem applies to strong type wrappers around 64-bit integers. > > > I can't see any good reason who all four scratch registers r0-r3 should > not be used for return values. > > > I'm hoping to get some ideas or workarounds for this limitation. Maybe > there are appropriate gcc options or function attributes that I haven't > noticed. (There is plenty of precedence for different calling > convention flags and function attributes in the x86 gcc port.) Failing > that, it would be nice to have opinions on whether or not any of this > would be a good idea. I don't imagine it would be trivial to implement > these two suggestions - there's no point in filing a bugzilla feature > request unless other people also think they would be useful. > > > David > >