On Mon, Aug 30, 2021 at 4:38 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > > > On Aug 30, 2021, at 2:18 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Mon, 2021-08-30 at 18:02 +0000, Chuck Lever III wrote: > >> > >> > >>> On Aug 30, 2021, at 1:34 PM, Trond Myklebust > >>> <trondmy@xxxxxxxxxxxxxxx> wrote: > >>> > >>> On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote: > >>>> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III > >>>> <chuck.lever@xxxxxxxxxx> wrote: > >>>>> > >>>>> Hi Olga- > >>>>> > >>>>>> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia > >>>>>> <olga.kornievskaia@xxxxxxxxx> wrote: > >>>>>> > >>>>>> From: Olga Kornievskaia <kolga@xxxxxxxxxx> > >>>>>> > >>>>>> Given the patch "Always provide aligned buffers to the RPC > >>>>>> read > >>>>>> layers", > >>>>>> RPC over RDMA doesn't need to look at the tail page and add > >>>>>> that > >>>>>> space > >>>>>> to the write chunk. > >>>>>> > >>>>>> For the RFC 8166 compliant server, it must not write an XDR > >>>>>> padding > >>>>>> into the write chunk (even if space was provided). > >>>>>> Historically > >>>>>> (before RFC 8166) Solaris RDMA server has been requiring the > >>>>>> client > >>>>>> to provide space for the XDR padding and thus this client > >>>>>> code > >>>>>> has > >>>>>> existed. > >>>>> > >>>>> I don't understand this change. > >>>>> > >>>>> So, the upper layer doesn't provide XDR padding any more. That > >>>>> doesn't > >>>>> mean Solaris servers still aren't going to want to write into > >>>>> it. > >>>>> The > >>>>> client still has to provide this padding from somewhere. > >>>>> > >>>>> This suggests that "Always provide aligned buffers to the RPC > >>>>> read > >>>>> layers" breaks our interop with Solaris servers. Does it? > >>>> > >>>> No, I don't believe "Always provide aligned buffers to the RPC > >>>> read > >>>> layers" breaks the interoperability. THIS patch would break the > >>>> interop. > >>>> > >>>> If we are not willing to break the interoperability and support > >>>> only > >>>> servers that comply with RFC 8166, this patch is not needed. > >>> > >>> Why? The intention of the first patch is to ensure that we do not > >>> have > >>> buffers that are not word aligned. If Solaris wants to write > >>> padding > >>> after the end of the file, then there is space in the page buffer > >>> for > >>> it to do so. There should be no need for an extra tail in which to > >>> write the padding. > >> > >> The RPC/RDMA protocol is designed for hardware-offloaded direct data > >> placement. That means the padding, which isn't data, must be directed > >> to another buffer. > >> > >> This is a problem with RPC/RDMA v1 implementations. RFC 5666 was > >> ambiguous, so there are implementations that write XDR padding into > >> Write chunks. This is why RFC 8166 says SHOULD NOT instead of MUST > >> NOT. > >> > >> I believe rpcrdma-version-two makes it a requirement not to use XDR > >> padding in either Read or Write data payload chunks. > >> > >> > > Correct, but in order to satisfy the needs of the Solaris server, > > you've hijacked the tail for use as a data buffer. AFAICS it is not > > being used as a SEND buffer target, but is instead being turned into a > > write chunk target. That is not acceptable! > > The buffer is being used as both. Proper function depends on the > order of RDMA Writes and Receives on the client. > > rpcrdma_encode_write_list() registers up to an extra 3 bytes in > rq_rcv_buf.tail as part of the Write chunk. The R_keys for the > segments in the Write chunk are then sent to the server as part > of the RPC Call. Just clarifying, nothing in the code limits the registration of upto 3bytes. It allocates/registers the value of the tail.iov_len which typically much larger than 4bytes (40+bytes). > As part of Replying, the server RDMA Writes data into the chunk, > and possibly also RDMA Writes padding. It then does an RDMA Send > containing the RPC Reply. > > The Send data always lands in the Receive buffer _after_ the Write > data. The Receive completion guarantees that previous RDMA Writes > are complete. Receive handling invalidates and unmaps the memory, > and then it is made available to the RPC client. > > > > It means that we now are limited to creating COMPOUNDs where there are > > no more operations following the READ op because if we do so, we end up > > with a situation where the RDMA behaviour breaks. > > I haven't heard reports of a problem like this. I think what might be referred to here is that. *If* NFS READ compound also had a VERIFY added after it (which might be interesting to check validity). The way xdr encoding would happen is that the tail would have non-empty bytes for the VERIFY (not just the padding). In RDMA the code would be registering a write chunk, then it look at the non-empty tail and allocate/register all the bytes (which are VERIFY) into the write chunk (basically unused bytes hanging in the write chunk). I think the actual VERIFY call bytes would go into the RDMA Send call buffer together with the READ header. The code might also then create a reply chunk for the VERIFY reply (if the code estimates the reply is larger than the inline). > However, if there is a problem, it would be simple to create a > persistently-registered memory region that is not part of any RPC > buffer that can be used to catch unused Write chunk XDR padding. When I think of XDR padding, I'm thinking of a number of bytes less than 4. Perhaps I'm confused in my understanding. Tail.iov_len is never a value less than 4. Tail.iov_len can contain bytes after an opaque element for which a read or write chunk was used (eg READ+VERIFY or WRITE+GETATTR). Thus when RDMA looks at tail.iov_len and allocates that much memory, that allocation does not represent just XDR padding. Since RDMA can't distinguish between padding+operation vs padding, can this even be solved? > >>> This means that the RDMA and TCP cases should end up doing the same > >>> thing for the case of the Solaris server: the padding is written > >>> into > >>> the page buffer. There is nothing written to the tail in either > >>> case. > >> > >> "Always provide" can guarantee that the NFS client makes aligned > >> requests for buffered I/O, but what about NFS direct I/O from user > >> space? The NIC will place the data payload in the application To clarify, is direct I/O (with an application that uses unaligned buffers) only a problem for a (non-modern) Solaris server that insists on writing XDR padding into the write chunk? > >> buffer, but there's no guarantee that the NFS READ request will be > >> aligned or that the buffer will be able to sink the extra padding > >> bytes. > >> > >> We would definitely consider it an error if an unaligned RDMA Read > >> leaked the link-layer's 4-byte padding into a sink buffer. > >> > >> So, "Always provide" is nice for the in-kernel NFS client, but I > >> don't believe it allows the way xprtrdma behaves to be changed. > >> > > > > If you're doing an unaligned READ from user space then you are already > > in a situation where you're doing something that is incompatible with > > block device requirements. > > If there really are any applications that contain O_DIRECT code > > specifically for use with NFS, then we can artificially force the > > buffers to be aligned by reducing the size of the buffer to align to a > > 4 byte boundary. NFS supports returning short reads. > > Or xprtrdma can use the scheme I describe above. I think that > would be simpler and would avoid layering violations. > > That would also possibly address the Nvidia problem, since a > pre-registered MR that handles Write padding would always be a > separate RDMA segment. If we have separate segment(s) that don't mix page data with NFS allocated pages for whatever, that would indeed address the Nvidia problem. > Again, I doubt this is necessary to fix any operational problem > with _supported_ use cases, but let me know if you'd like me to > make this change. > > > >>>>>> Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx> > >>>>>> --- > >>>>>> net/sunrpc/xprtrdma/rpc_rdma.c | 15 --------------- > >>>>>> 1 file changed, 15 deletions(-) > >>>>>> > >>>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c > >>>>>> b/net/sunrpc/xprtrdma/rpc_rdma.c > >>>>>> index c335c1361564..2c4146bcf2a8 100644 > >>>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c > >>>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c > >>>>>> @@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt > >>>>>> *r_xprt, struct xdr_buf *xdrbuf, > >>>>>> page_base = 0; > >>>>>> } > >>>>>> > >>>>>> - if (type == rpcrdma_readch) > >>>>>> - goto out; > >>>>>> - > >>>>>> - /* When encoding a Write chunk, some servers need to > >>>>>> see an > >>>>>> - * extra segment for non-XDR-aligned Write chunks. The > >>>>>> upper > >>>>>> - * layer provides space in the tail iovec that may be > >>>>>> used > >>>>>> - * for this purpose. > >>>>>> - */ > >>>>>> - if (type == rpcrdma_writech && r_xprt->rx_ep- > >>>>>>> re_implicit_roundup) > >>>>>> - goto out; > >>>>>> - > >>>>>> - if (xdrbuf->tail[0].iov_len) > >>>>> > >>>>> Instead of checking for a tail, we could check > >>>>> > >>>>> if (xdr_pad_size(xdrbuf->page_len)) > >>>>> > >>>>> and provide some tail space in that case. > >>>> > >>>> I don't believe this is any different than what we have now. If > >>>> the > >>>> page size is non-4byte aligned then, we would still allocate size > >>>> for > >>>> the padding which "SHOULD NOT" be there. But yes it is allowed to > >>>> be > >>>> there. > >>>> > >>>> The problem, as you know from our offline discussion, is > >>>> allocating > >>>> the tail page and including it in the write chunk for the Nvidia > >>>> environment where Nvidia doesn't support use of data (user) pages > >>>> and > >>>> nfs kernel allocated pages in the same segment. > >>>> > >>>> Alternatively, my ask is then to change rpcrdma_convert_iovs() to > >>>> return 2 segs instead of one: one for the pages and another for > >>>> the > >>>> tail. > >>>> > >>>>> > >>>>>> - rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, > >>>>>> &n); > >>>>>> - > >>>>>> -out: > >>>>>> if (unlikely(n > RPCRDMA_MAX_SEGS)) > >>>>>> return -EIO; > >>>>>> return n; > >>>>>> -- > >>>>>> 2.27.0 > >>>>>> > >>>>> > >>>>> -- > >>>>> Chuck Lever > >>>>> > >>>>> > >>>>> > >>> > >>> -- > >>> Trond Myklebust > >>> Linux NFS client maintainer, Hammerspace > >>> trond.myklebust@xxxxxxxxxxxxxxx > >> > >> -- > >> Chuck Lever > >> > >> > >> > > > > -- > > Trond Myklebust > > Linux NFS client maintainer, Hammerspace > > trond.myklebust@xxxxxxxxxxxxxxx > > -- > Chuck Lever > > >