I've been investigating a large performance regression in ARM code
generation in 4.5.0 (compared to 4.4.x), but while doing so I stumbled
upon something else, which is not a regression from 4.4.x. The generated
assembly code is so horrible it's almost comical. The C code looks like
this:
struct dev_t {
volatile unsigned R0;
volatile unsigned R1;
};
#define DEV ((struct dev_t*)0x40011400)
void write_data(const unsigned *d)
{
unsigned i, mask;
for (i = 0, mask = 1; i < 8; i++, mask <<= 1) {
if (mask & *d)
DEV->R0 = 1U << 13;
else
DEV->R1 = 1U << 13;
}
}
Compiling this with a 4.5.0 cross compiler for arm-none-eabi with:
arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -S -O3 -o- bad.c
gives the following assembly code:
...
ldr r3, [r0, #0]
tst r3, #32
itete eq
moveq r3, #5120
movne r3, #5120
moveq r2, #8192
movne r2, #8192
itete eq
movteq r3, 16385
movtne r3, 16385
streq r2, [r3, #4]
strne r2, [r3, #0]
ldr r3, [r0, #0]
tst r3, #64
itete eq
moveq r3, #5120
movne r3, #5120
moveq r2, #8192
movne r2, #8192
itete eq
movteq r3, 16385
movtne r3, 16385
streq r2, [r3, #4]
strne r2, [r3, #0]
...
Note how the moveq/movne pairs always load the same value. If equal load
5120 else load 5120. Also note that it loads the addresses every
iteration instead of just once before the (unrolled) loop.
It's also pointlessly reloading the value of *d ("[r0]") to r3 every
time, which is a regression from 4.4.3, and the root of my primary
performance regression.
Should I file zero, one or two bugs about this?
/Tobias