On Fri, Sep 22, 2023 at 2:01 PM David Howells <dhowells@xxxxxxxxxx> wrote: > > David Laight <David.Laight@xxxxxxxxxx> wrote: > > > > (8) Move the copy-and-csum code to net/ where it can be in proximity with > > > the code that uses it. This eliminates the code if CONFIG_NET=n and > > > allows for the slim possibility of it being inlined. > > > > > > (9) Fold memcpy_and_csum() in to its two users. > > > > > > (10) Move csum_and_copy_from_iter_full() out of line and merge in > > > csum_and_copy_from_iter() since the former is the only caller of the > > > latter. > > > > I thought that the real idea behind these was to do the checksum > > at the same time as the copy to avoid loading the data into the L1 > > data-cache twice - especially for long buffers. > > I wonder how often there are multiple iov[] that actually make > > it better than just check summing the linear buffer? > > It also reduces the overhead for finding the data to checksum in the case the > packet gets split since we're doing the checksumming as we copy - but with a > linear buffer, that's negligible. > > > I had a feeling that check summing of udp data was done during > > copy_to/from_user, but the code can't be the copy-and-csum here > > for that because it is missing support form odd-length buffers. > > Is there a bug there? > > > Intel x86 desktop chips can easily checksum at 8 bytes/clock > > (But probably not with the current code!). > > (I've got ~12 bytes/clock using adox and adcx but that loop > > is entirely horrid and it would need run-time patching. > > Especially since I think some AMD cpu execute them very slowly.) > > > > OTOH 'rep movs[bq]' copy will copy 16 bytes/clock (32 if the > > destination is 32 byte aligned - it pretty much won't be). > > > > So you'd need a csum-and-copy loop that did 16 bytes every > > three clocks to get the same throughput for long buffers. > > In principle splitting the 'adc memory' into two instructions > > is the same number of u-ops - but I'm sure I've tried to do > > that and failed and the extra memory write can happen in > > parallel with everything else. > > So I don't think you'll get 16 bytes in two clocks - but you > > might get it is three. > > > > OTOH for a cpu where memcpy is code loop summing the data in > > the copy loop is likely to be a gain. > > > > But I suspect doing the checksum and copy at the same time > > got 'all to complicated' to actually implement fully. > > With most modern ethernet chips checksumming receive pacakets > > does it really get used enough for the additional complexity? > > You may be right. That's more a question for the networking folks than for > me. It's entirely possible that the checksumming code is just not used on > modern systems these days. > > Maybe Willem can comment since he's the UDP maintainer? Perhaps these days it is more relevant to embedded systems than high end servers.