Re: [PATCH v5 00/11] iov_iter: Convert the iterator macros into inline funcs

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Sat, 23 Sep 2023 08:59:25 +0200

On Fri, Sep 22, 2023 at 2:01 PM David Howells <dhowells@xxxxxxxxxx> wrote:
>
> David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> > >  (8) Move the copy-and-csum code to net/ where it can be in proximity with
> > >      the code that uses it.  This eliminates the code if CONFIG_NET=n and
> > >      allows for the slim possibility of it being inlined.
> > >
> > >  (9) Fold memcpy_and_csum() in to its two users.
> > >
> > > (10) Move csum_and_copy_from_iter_full() out of line and merge in
> > >      csum_and_copy_from_iter() since the former is the only caller of the
> > >      latter.
> >
> > I thought that the real idea behind these was to do the checksum
> > at the same time as the copy to avoid loading the data into the L1
> > data-cache twice - especially for long buffers.
> > I wonder how often there are multiple iov[] that actually make
> > it better than just check summing the linear buffer?
>
> It also reduces the overhead for finding the data to checksum in the case the
> packet gets split since we're doing the checksumming as we copy - but with a
> linear buffer, that's negligible.
>
> > I had a feeling that check summing of udp data was done during
> > copy_to/from_user, but the code can't be the copy-and-csum here
> > for that because it is missing support form odd-length buffers.
>
> Is there a bug there?
>
> > Intel x86 desktop chips can easily checksum at 8 bytes/clock
> > (But probably not with the current code!).
> > (I've got ~12 bytes/clock using adox and adcx but that loop
> > is entirely horrid and it would need run-time patching.
> > Especially since I think some AMD cpu execute them very slowly.)
> >
> > OTOH 'rep movs[bq]' copy will copy 16 bytes/clock (32 if the
> > destination is 32 byte aligned - it pretty much won't be).
> >
> > So you'd need a csum-and-copy loop that did 16 bytes every
> > three clocks to get the same throughput for long buffers.
> > In principle splitting the 'adc memory' into two instructions
> > is the same number of u-ops - but I'm sure I've tried to do
> > that and failed and the extra memory write can happen in
> > parallel with everything else.
> > So I don't think you'll get 16 bytes in two clocks - but you
> > might get it is three.
> >
> > OTOH for a cpu where memcpy is code loop summing the data in
> > the copy loop is likely to be a gain.
> >
> > But I suspect doing the checksum and copy at the same time
> > got 'all to complicated' to actually implement fully.
> > With most modern ethernet chips checksumming receive pacakets
> > does it really get used enough for the additional complexity?
>
> You may be right.  That's more a question for the networking folks than for
> me.  It's entirely possible that the checksumming code is just not used on
> modern systems these days.
>
> Maybe Willem can comment since he's the UDP maintainer?

Perhaps these days it is more relevant to embedded systems than high
end servers.