On Sat, Sep 16, 2023 at 09:32:40AM +0000, David Laight wrote: > From: Charlie Jenkins > > Sent: 15 September 2023 18:01 > > > > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit > > will load from the buffer in groups of 32 bits, and when compiled for > > 64-bit will load in groups of 64 bits. > > > ... > > + /* > > + * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be > > + * faster than doing 32-bit reads on architectures that support larger > > + * reads. > > + */ > > + while (len > 0) { > > + csum += data; > > + csum += csum < data; > > + len -= sizeof(unsigned long); > > + ptr += 1; > > + data = *ptr; > > + } > > I think you'd be better adding the 'carry' bits in a separate > variable. > It reduces the register dependency chain length in the loop. > (Helps if the cpu can execute two instructions in one clock.) > > The masked misaligned data values are max 24 bits > (if > > You'll also almost certainly remove at least one instruction > from the loop by comparing against the end address rather than > changing 'len'. > > So ending up with (something like): > end = buff + length; > ... > while (++ptr < end) { > csum += data; > carry += csum < data; > data = ptr[-1]; > } > (Although a do-while loop tends to generate better code > and gcc will pretty much always make that transformation.) > > I think that is 4 instructions per word (load, add, cmp+set, add). > In principle they could be completely pipelined and all > execute (for different loop iterations) in the same clock. > (But that is pretty unlikely to happen - even x86 isn't that good.) > But taking two clocks is quite plausible. > Plus 2 instructions per loop (inc, cmp+jmp). > They might execute in parallel, but unrolling once > may be required. > It looks like GCC actually ends up generating 7 total instructions: ffffffff808d2acc: 97b6 add a5,a5,a3 ffffffff808d2ace: 00d7b533 sltu a0,a5,a3 ffffffff808d2ad2: 0721 add a4,a4,8 ffffffff808d2ad4: 86be mv a3,a5 ffffffff808d2ad6: 962a add a2,a2,a0 ffffffff808d2ad8: ff873783 ld a5,-8(a4) ffffffff808d2adc: feb768e3 bltu a4,a1,ffffffff808d2acc <do_csum+0x34> This mv instruction could be avoided if the registers were shuffled around, but perhaps this way reduces some dependency chains. > ... > > + if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) && > > + riscv_has_extension_likely(RISCV_ISA_EXT_ZBB)) { > ... > > + } > > +end: > > + return csum >> 16; > > + } > > Is it really worth doing all that to save (I think) 4 instructions? > (shift, shift, or with rotate twice). > There is much more to be gained by careful inspection > of the loop (even leaving it in C). > The main benefit was from using rev8 to replace swab32. However, now that I am looking at the assembly in the kernel it is not outputting the asm that matches what I have from an out of kernel test case, so rev8 might not be beneficial. I am going to have to look at this more to figure out what is happening. > > + > > +#ifndef CONFIG_32BIT > > + csum += (csum >> 32) | (csum << 32); > > + csum >>= 32; > > +#endif > > + csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16)); > > Use ror64() and ror32(). > > David > Good idea. - Charlie > > + if (offset & 1) > > + return (unsigned short)swab32(csum); > > + return csum >> 16; > > +} > > > > -- > > 2.42.0 > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)