On Mon, May 10, 2021 at 02:45:03PM +0100, Mark Rutland wrote: > About 31% of this seems to be due to GCC (almost) always clearing x16 > and x17 (see further down for numbers). I suspect that's because GCC has > to assume that any (non-static) functions might be reached via a PLT > which would clobber x16 and x17 with specific values. Wheee. > We also have a bunch of small functions with multiple returns, where > each return path gets the full complement of zeroing instructions, e.g. > > Stock: > > | <fpsimd_sync_to_sve>: > | d503245f bti c > | f9400001 ldr x1, [x0] > | 7209003f tst w1, #0x800000 > | 54000040 b.eq ffff800010014cc4 <fpsimd_sync_to_sve+0x14> // b.none > | d65f03c0 ret > | d503233f paciasp > | a9bf7bfd stp x29, x30, [sp, #-16]! > | 910003fd mov x29, sp > | 97fffdac bl ffff800010014380 <fpsimd_to_sve> > | a8c17bfd ldp x29, x30, [sp], #16 > | d50323bf autiasp > | d65f03c0 ret > > With zero-call-regs: > > | <fpsimd_sync_to_sve>: > | d503245f bti c > | f9400001 ldr x1, [x0] > | 7209003f tst w1, #0x800000 > | 540000c0 b.eq ffff8000100152a8 <fpsimd_sync_to_sve+0x24> // b.none > | d2800000 mov x0, #0x0 // #0 > | d2800001 mov x1, #0x0 // #0 > | d2800010 mov x16, #0x0 // #0 > | d2800011 mov x17, #0x0 // #0 > | d65f03c0 ret > | d503233f paciasp > | a9bf7bfd stp x29, x30, [sp, #-16]! > | 910003fd mov x29, sp > | 97fffd17 bl ffff800010014710 <fpsimd_to_sve> > | a8c17bfd ldp x29, x30, [sp], #16 > | d50323bf autiasp > | d2800000 mov x0, #0x0 // #0 > | d2800001 mov x1, #0x0 // #0 > | d2800010 mov x16, #0x0 // #0 > | d2800011 mov x17, #0x0 // #0 > | d65f03c0 ret > > ... where we go from 12 instructions to 20, which is a ~67% bloat. Yikes. Yeah, so that is likely a good example of missed optimization opportunity. > We have a bunch of cases like the above. Also note that per the AAPCS a > function can clobber x0-17 (and x18 if it's not reserved for something > like SCS), and I see a few places that clobber x1-x17. Ah, gotcha. I wasn't quite sure which registers might qualify. > [...] > That's 441301 new MOVs, and the equivalent of 442511 new instructions > overall. There are 135728 new MOVs to x16 and x17 specifically, which > account for ~31% of that. I assume the x16/x17 case could be addressed by the compiler if it examined the need for PLTs, or is that too late (in the sense that the linker is doing that phase)? Regardless, I will update the documentation on this feature. :) -- Kees Cook