Hi! On Fri, Nov 23, 2018 at 09:01:56PM +0100, Helmut Eller wrote: > when compiling this example with gcc -O2 -ftrapv: > > long foo (long x, long y) { return x + y; } > > long bar (long x, long y) { > long z; > if (__builtin_add_overflow (x, y, &z)) > __builtin_trap (); > return z; > } > > then GCC seems to produce less efficient code for foo than for bar: > > foo: > subq $8, %rsp > call __addvdi3@PLT > addq $8, %rsp > ret > > bar: > movq %rdi, %rax > addq %rsi, %rax > jo .L9 > rep ret > .L9: > ud2 > > I see several inefficiencies: > > 1.) __addvdi3 is not inlined. It is implemented in libgcc. The x86 target code does not handle addvdi3, only addvdi4 (3 calls abort, 4 jumps to its 4th arg). > 2.) %rsp is adjusted before calling __addvdi3. Why is that needed? To keep the stack aligned (to 16 bytes). > 3.) Obviously __addvdi3 is not implemented as sibling-call even though > -O2 should enable that. It calls via the PLT, do sibling calls via the PLT work in your ABI? > Where should I start, if I wanted to teach GCC how to produce the same > code for foo as for bar? Would it be enough to add a pattern to > i386.md? There is already a pattern for "addv<mode>4", but apparently > it's not used in this case. As Marc says, -ftrapv is probably not the way to go. Adding an addv<mode>3 to the i386 backend might help. You do *not* want exactly the same code, btw; addv3 calls abort on overflow, that's not the same as executing an ud2 instruction. Segher