Re: Stack alignment on modern 32 bit bare metal ARMs?

"Richard Earnshaw \(lists\) via Gcc-help" <gcc-help@xxxxxxxxxxx> · Mon, 7 Aug 2023 11:45:20 +0100

This should be on gcc-help@xxxxxxxxxxx, not the main gcc@ list.  I've 
sent my response there (and hopefully BCC gcc@).

On 06/08/2023 01:30, Barrie Slaymaker via Gcc wrote:
Hi,

I'm cross compiling for 32 bit bare metal ARMs (modern ones: Cortex-M4 and
Cortex M-33) w/ gcc 12.3.0, which is the latest available from ARM, (see
gcc -v output below) and have found that va_arg(..., double) (i.e.
__builtin_va_arg()) assumes that doubles are 64-bit aligned, but the stack
is not always so.

I searched the bug database but didn't see this, so I'm guessing this isn't
a GCC bug--the ARM world would be on fire if it were. And I've searched the
gcc command line options docs, and the ARM architecture docs to no avail.
I'm hoping I didn't miss something obvious...

So, does gcc assume or require that doubles on the stack be 64-bit aligned,
or is there an option we should be passing to either allow 32-bit alignment
or force 64-bit alignment, or is the MCU vendor's startup code a wee buggy
(this is what I suspect, but wanted to be damn sure before continuing)?

Your problem is a common one.  GCC maintains 64-bit stack alignment in 
code, but it does not align the stack if the caller messes up.  Your 
most likely problem is that the stack was not correctly aligned before 
calling main().  This is something the startup code must ensure when 
setting up the program environment.

R.

Here's the test code:

void va_args_test(int i, ...) {
     va_list args;
     va_start(args, i);
     double d = (int)va_arg(args, double);
     va_end(args);
     // display code elided
}

Here's the generated assembly, with commentary mine:

void va_args_test(int i, ...) {
     3f60:→  b40f      → push→   {r0, r1, r2, r3}
     3f62:→  b580      → push→   {r7, lr}
     3f64:→  b082      → sub→sp, #8
     3f66:→  af00      → add→r7, sp, #0

     va_list args;
     3f68:→  2300      → movs→   r3, #0
     3f6a:→  607b      → str→r3, [r7, #4]

     va_start(args, i);
     3f6c:→  f107 0314 → add.w→  r3, r7, #20
     3f70:→  607b      → str→r3, [r7, #4]

     double d = (int)va_arg(args, double);
     3f72:→  f107 031b → add.w→  r3, r7, #27   ; Loads the address of the
last byte of the low order word into r3.
     3f76:→  f023 0307 → bic.w→  r3, r3, #7    ; Clears the low 3 bits,
which works when the double is 64-bit aligned. Not so much otherwise.
     3f7a:→  f103 0208 → add.w→  r2, r3, #8    ; Increments args' internal
pointer
     3f7e:→  607a      → str→r2, [r7, #4]      ; Saves that pointer
     3f80:→  e9d3 0100 → ldrd→   r0, r1, [r3]  ; Reads the double, right or
wrong...

Here's the call site assembly:

     va_args_test(0, (double)1.0);
     3fc2:→  2200      → movs→   r2, #0
     3fc4:→  4b09      → ldr→r3, [pc, #36]→  ; (3fec <main+0x44>)
     3fc6:→  2000      → movs→   r0, #0
     3fc8:→  4909      → ldr→r1, [pc, #36]→  ; (3ff0 <main+0x48>)
     3fca:→  4788      → blx→r1

This is using GCC 12.3.0, cross-compiling for ARM on x86_64 (gcc -v output
below sig), with a command line like

arm-none-eabi-gcc -o ../build/main/PAC5524/tmp/base/src/main.o
base/src/main.c <<-I options elided>>> -mcpu=cortex-m4 -march=armv7e-m
-mfpu=fpv4-sp-d16 -std=gnu99 -ffunction-sections -fno-omit-frame-pointer
-fno-strict-overflow -fsingle-precision-constant
-ftrivial-auto-var-init=zero -mthumb -mlittle-endian -mlong-calls
-mfloat-abi=hard -Og -c -MD -MP

Removing any one of the -f options happens to align the stack correctly in
most cases (I've elided the -f options that don't affect this issue as far
as I can tell).

Many thanks,

Barrie

gcc -v output:

Using built-in specs.
COLLECT_GCC=arm-none-eabi-gcc
COLLECT_LTO_WRAPPER=/usr/share/arm-gnu-toolchain-12.3.rel1-x86_64-arm-none-eabi/bin/../libexec/gcc/arm-none-eabi/12.3.1/lto-wrapper
Target: arm-none-eabi
Configured with:
/data/jenkins/workspace/GNU-toolchain/arm-12/src/gcc/configure
--target=arm-none-eabi
--prefix=/data/jenkins/workspace/GNU-toolchain/arm-12/build-arm-none-eabi/install
--with-gmp=/data/jenkins/workspace/GNU-toolchain/arm-12/build-arm-none-eabi/host-tools
--with-mpfr=/data/jenkins/workspace/GNU-toolchai
n/arm-12/build-arm-none-eabi/host-tools
--with-mpc=/data/jenkins/workspace/GNU-toolchain/arm-12/build-arm-none-eabi/host-tools
--with-isl=/data/jenkins/workspace/GNU-toolchain/arm-12/build-arm-none-eabi/host-tools
--disable-shared --disable-nls --disable-threads --disable-tls
--enable-checking=release --enable-language
s=c,c++,fortran --with-newlib --with-gnu-as --with-gnu-ld
--with-sysroot=/data/jenkins/workspace/GNU-toolchain/arm-12/build-arm-none-eabi/install/arm-none-eabi
--with-multilib-list=aprofile,rmprofile --with-pkgversion='Arm GNU
Toolchain 12.3.Rel1 (Build arm-12.35)' --with-bugurl=
https://bugs.linaro.org/
Thread model: single
Supported LTO compression algorithms: zlib
gcc version 12.3.1 20230626 (Arm GNU Toolchain 12.3.Rel1 (Build arm-12.35))

Test code (the LED lights very prettily when va_arg() returns the correct
value):

void va_args_test(int i, ...) {
     va_list args;
     va_start(args, i);
     i = (int)va_arg(args, double);
     va_end(args);
     bal_init();
     bal_set_AUX_LED1(i == 1);
}

int main(void) {
    ...CPU initialization elided...
     va_args_test(0, (double)1.0);
     while (true) {
     }
}