From: Peter Zijlstra > Sent: 07 September 2022 10:40 > > On Wed, Sep 07, 2022 at 09:06:12AM +0000, David Laight wrote: > > From: Peter Zijlstra > > > Sent: 07 September 2022 10:01 > > > > > > On Wed, Sep 07, 2022 at 09:06:45AM +0200, Peter Zijlstra wrote: > > > > On Wed, Sep 07, 2022 at 09:55:21AM +0900, Masami Hiramatsu (Google) wrote: > > > > > > > > > +/* Return the jump target address or 0 */ > > > > > +static inline unsigned long insn_get_branch_addr(struct insn *insn) > > > > > +{ > > > > > + switch (insn->opcode.bytes[0]) { > > > > > + case 0xe0: /* loopne */ > > > > > + case 0xe1: /* loope */ > > > > > + case 0xe2: /* loop */ > > > > > > > > Oh cute, objtool doesn't know about those, let me go add them. > > > > Do they ever appear in the kernel? > > No; that is, not on any of the random vmlinux.o images I checked this > morning. > > Still, best to properly decode them anyway. It is annoying that cpu with adox/adcx have slow loop. You really want to be able to do: 1: adox ... adcx ... loop 1b That would never run with one iteration/clock. But unrolling once would probably be enough. What you can do (and gives the fastest IPcsum loop) is: 1: jcxz 2f .... lea %rcx,... jmp 1b 2: The extra instructions mean that needs unrolling 4 times. I've got over 12 bytes/clock that way. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)