Re: receive side CRC computation in siw.

Tom Talpey <tom@xxxxxxxxxx> · Thu, 13 Jun 2019 08:24:58 -0400

On 6/13/2019 7:36 AM, Bernard Metzler wrote:
-----"Tom Talpey" <tom@xxxxxxxxxx> wrote: -----

To: "Jason Gunthorpe" <jgg@xxxxxxxx>
From: "Tom Talpey" <tom@xxxxxxxxxx>
Date: 06/12/2019 10:34PM
Cc: "Bernard Metzler" <BMT@xxxxxxxxxxxxxx>,
linux-rdma@xxxxxxxxxxxxxxx
Subject: [EXTERNAL] Re: receive side CRC computation in siw.

On 6/12/2019 4:13 PM, Jason Gunthorpe wrote:
On Wed, Jun 12, 2019 at 04:07:53PM -0400, Tom Talpey wrote:
On 6/12/2019 11:21 AM, Jason Gunthorpe wrote:
On Tue, Jun 11, 2019 at 11:11:08AM -0400, Tom Talpey wrote:
On 6/11/2019 9:21 AM, Bernard Metzler wrote:
Hi all,

If enabled for siw, during receive operation, a crc32c over
header and data is being generated and checked. So far, siw
was generating that CRC from the content of the just written
target buffer. What kept me busy last weekend were spurious
CRC errors, if running qperf. I finally found the application
is constantly writing the target buffer while data are placed
concurrently, which sometimes races with the CRC computation
for that buffer, and yields a broken CRC.

Well, that's a clear bug in the application, assuming siw has
not yet delivered a send completion for the operation using
the buffer. This is a basic Verbs API contract.

May be so, but a kernel driver must not make any assumptions
about the
content of memory controlled by user. So it is clearly wrong to
write
data to a user buffer and then read it again to compute a CRC.

But it's not a user buffer. It's been mapped into the kernel for
the
purpose of registering and performing data transfer This is
standard
i/o processing. Both kernel and user have access.

It is a user buffer because the user has access. In fact it may not
even be mapped into the kernel address space.

Belaboring this point a bit, but SIW certainly maps it, in order to
copy. An adapter maps it, via dma_map, in order to do the same. My
point is simply that if the kernel tried to prevent that, the whole
i/o model would break down.

In other words, if a hardware adapter were doing this same thing,
would you consider it out of spec? If so, why?

Tom.

Furthermore, an RDMA hardware adapter has zero notion of user
buffers.
All it gets is a registration, with memory described by dma
addresses.
It can perform whatever memory operations are required on them,
and the
kernel isn't even in the loop.

Adapters cannot make assumptions about data they place in memory
buffers - ie they cannot write something and then read it back on
the
assumption it has not changed. They cannot read something twice on
the
assumption it has not changed, etc. It is a security requirement.

All the applications touching buffers without waiting for a
completion
are relying on some extended behavior outside the specification,
but
they cannot cause the kernel to malfunction and report bogus data
integrity errors.

Ok, this I agree with, but the RDMA specifications were quite
careful
about it. And we *definitely* don't want to require that the
providers
all start double-buffering incoming data, in order to shield an
uncomplying application from itself. To double buffer RDMA Writes
(and
Sends) would undo the entire direct data placement design!

Bernard, I'd still welcome your thoughts on whether you can
compute
the MPA CRC inline in SIW during the copy_to_user. Avoiding the
overhead
of reading back the data after copying could be a speedup for you?

It would have to be integrated into skb_copy_bits(). Best
allowing the caller to provide a variable digets which gets
updated on the go. So we might propose that to the right people?

In the mean time, I may have to accept a substantial
throughput breakdown if the CRC is switched on. For large data
chunks, perf is close to cut to half, compared to what I had
before with CRC.

Interestingly, the penalty mainly comes from walking
the skb again, and not from touching the data twice:
If I would provide an intermediate scratch buffer where I
copy into the skb content first, checksum it, and then
memcpy it to the target, I would almost maintain current
perf.

That obviously contradicts the RDMA concept ;) And it is
not really nice. At the other hand, since we would need
only one page size buffer per core (since siw RX runs in
softirq), it might not be a big waste of memory though...
What do others think?

I hate the idea of forcing data through a bounce buffer. And I think
that the performance hit will increase in a realworld workload, since
in single-flow tests the CPU cache basically stays warm after your
skb_copy_bits(). Once the processor gets more busy, cache eviction
will change things for the worse - just when you least want it to.

Tom.

Copy and CRC is obviously the right thing to do.

Jason

I think you mean CRC first and copy then.

Many thanks for that fruitful discussion.
Bernard.