Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Tue, 31 Aug 2021 14:33:03 +0000

> On Aug 30, 2021, at 4:37 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
> 
>> On Aug 30, 2021, at 2:18 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>> 
>> On Mon, 2021-08-30 at 18:02 +0000, Chuck Lever III wrote:
>>> 
>>>> On Aug 30, 2021, at 1:34 PM, Trond Myklebust
>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>> 
>>>> On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote:
>>>>> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III
>>>>> <chuck.lever@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>> Hi Olga-
>>>>>> 
>>>>>>> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia
>>>>>>> <olga.kornievskaia@xxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> From: Olga Kornievskaia <kolga@xxxxxxxxxx>
>>>>>>> 
>>>>>>> Given the patch "Always provide aligned buffers to the RPC
>>>>>>> read
>>>>>>> layers",
>>>>>>> RPC over RDMA doesn't need to look at the tail page and add
>>>>>>> that
>>>>>>> space
>>>>>>> to the write chunk.
>>>>>>> 
>>>>>>> For the RFC 8166 compliant server, it must not write an XDR
>>>>>>> padding
>>>>>>> into the write chunk (even if space was provided).
>>>>>>> Historically
>>>>>>> (before RFC 8166) Solaris RDMA server has been requiring the
>>>>>>> client
>>>>>>> to provide space for the XDR padding and thus this client
>>>>>>> code
>>>>>>> has
>>>>>>> existed.
>>>>>> 
>>>>>> I don't understand this change.
>>>>>> 
>>>>>> So, the upper layer doesn't provide XDR padding any more. That
>>>>>> doesn't
>>>>>> mean Solaris servers still aren't going to want to write into
>>>>>> it.
>>>>>> The
>>>>>> client still has to provide this padding from somewhere.
>>>>>> 
>>>>>> This suggests that "Always provide aligned buffers to the RPC
>>>>>> read
>>>>>> layers" breaks our interop with Solaris servers. Does it?
>>>>> 
>>>>> No, I don't believe "Always provide aligned buffers to the RPC
>>>>> read
>>>>> layers" breaks the interoperability. THIS patch would break the
>>>>> interop.
>>>>> 
>>>>> If we are not willing to break the interoperability and support
>>>>> only
>>>>> servers that comply with RFC 8166, this patch is not needed.
>>>> 
>>>> Why? The intention of the first patch is to ensure that we do not
>>>> have
>>>> buffers that are not word aligned. If Solaris wants to write
>>>> padding
>>>> after the end of the file, then there is space in the page buffer
>>>> for
>>>> it to do so. There should be no need for an extra tail in which to
>>>> write the padding.
>>> 
>>> The RPC/RDMA protocol is designed for hardware-offloaded direct data
>>> placement. That means the padding, which isn't data, must be directed
>>> to another buffer.
>>> 
>>> This is a problem with RPC/RDMA v1 implementations. RFC 5666 was
>>> ambiguous, so there are implementations that write XDR padding into
>>> Write chunks. This is why RFC 8166 says SHOULD NOT instead of MUST
>>> NOT.
>>> 
>>> I believe rpcrdma-version-two makes it a requirement not to use XDR
>>> padding in either Read or Write data payload chunks.
>>> 
>>> 
>> Correct, but in order to satisfy the needs of the Solaris server,
>> you've hijacked the tail for use as a data buffer. AFAICS it is not
>> being used as a SEND buffer target, but is instead being turned into a
>> write chunk target. That is not acceptable!
> 
> The buffer is being used as both. Proper function depends on the
> order of RDMA Writes and Receives on the client.

I've refreshed my memory. The above statement oversimplifies
a bit.

- The memory pointed to by rqst->rq_buffer is persistently
  registered to allow RDMA Send from it. It can also be
  registered on the fly for use in Read chunks.

- The memory pointed to by rqst->rq_rbuffer is NOT
  persistently registered and is never used for RDMA Send
  or Receive. It can be registered on the fly for use in a
  Write or Reply chunk. This is when the tail portion of
  rq_rcv_buf might be used to receive a pad.

- A third set of persistently registered buffers is used
  ONLY to catch RDMA Receive completions.

When a Receive completes, the reply handler decides how to
construct the received RPC message. If there was a Reply
chunk, it leaves rq_rcv_buf pointing to rq_rbuffer, as that's
where the server wrote the Reply chunk. If there wasn't a
Reply chunk, the handler points rq_rcv_buf to the RDMA
Receive buffer.

The only case where there might be overlap is when the client
has provisioned both a Write chunk and a Reply chunk. In that
case, yes, I can see that there might be an opportunity for
the server to RDMA Write the Write chunk's pad and the Reply
chunk into the same client memory buffer. However:

- Most servers don't write a pad. Modern Solaris servers
  behave "correctly" in this regard, I'm now told. Linux
  never writes the pad, and I'm told OnTAP also does not.

- Server implementations I'm aware of Write the Write chunk
  first and then the Reply chunk. The Reply chunk would
  overwrite the pad.

- The client hardly ever uses Write + Reply. It's most
  often one or the other.

So again, there is little to zero chance there is an existing
operational issue here. But see below.

> rpcrdma_encode_write_list() registers up to an extra 3 bytes in
> rq_rcv_buf.tail as part of the Write chunk. The R_keys for the
> segments in the Write chunk are then sent to the server as part
> of the RPC Call.
> 
> As part of Replying, the server RDMA Writes data into the chunk,
> and possibly also RDMA Writes padding. It then does an RDMA Send
> containing the RPC Reply.
> 
> The Send data always lands in the Receive buffer _after_ the Write
> data. The Receive completion guarantees that previous RDMA Writes
> are complete. Receive handling invalidates and unmaps the memory,
> and then it is made available to the RPC client.
> 
> 
>> It means that we now are limited to creating COMPOUNDs where there are
>> no more operations following the READ op because if we do so, we end up
>> with a situation where the RDMA behaviour breaks.
> 
> I haven't heard reports of a problem like this.
> 
> However, if there is a problem, it would be simple to create a
> persistently-registered memory region that is not part of any RPC
> buffer that can be used to catch unused Write chunk XDR padding.
> 
> 
>>>> This means that the RDMA and TCP cases should end up doing the same
>>>> thing for the case of the Solaris server: the padding is written
>>>> into
>>>> the page buffer. There is nothing written to the tail in either
>>>> case.
>>> 
>>> "Always provide" can guarantee that the NFS client makes aligned
>>> requests for buffered I/O, but what about NFS direct I/O from user
>>> space? The NIC will place the data payload in the application
>>> buffer, but there's no guarantee that the NFS READ request will be
>>> aligned or that the buffer will be able to sink the extra padding
>>> bytes.
>>> 
>>> We would definitely consider it an error if an unaligned RDMA Read
>>> leaked the link-layer's 4-byte padding into a sink buffer.
>>> 
>>> So, "Always provide" is nice for the in-kernel NFS client, but I
>>> don't believe it allows the way xprtrdma behaves to be changed.
>>> 
>> 
>> If you're doing an unaligned READ from user space then you are already
>> in a situation where you're doing something that is incompatible with
>> block device requirements.
>> If there really are any applications that contain O_DIRECT code
>> specifically for use with NFS, then we can artificially force the
>> buffers to be aligned by reducing the size of the buffer to align to a
>> 4 byte boundary. NFS supports returning short reads.
> 
> Or xprtrdma can use the scheme I describe above. I think that
> would be simpler and would avoid layering violations.
> 
> That would also possibly address the Nvidia problem, since a
> pre-registered MR that handles Write padding would always be a
> separate RDMA segment.
> 
> Again, I doubt this is necessary to fix any operational problem
> with _supported_ use cases, but let me know if you'd like me to
> make this change.

Since RFC 8666 says "SHOULD NOT", the spec mandates that the client
has to expect there might be a server that will RDMA Write the XDR
pad, so the Linux client really should always provide one. Having
a persistently registered spot to use for that means we have a
potential no-cost mechanism for providing that extra segment. The
whole "pad optimization" thing was because the pad segment is
currently registered on the fly, so it has a per-I/O cost.

So I'm leaning toward implementing this simple solution, at least
as proof-of-concept.

--
Chuck Lever