Re: [PATCH 3/3] NFS: Avoid copy of xdr padding in read()

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 20 Nov 2020 12:51:48 -0500

> On Nov 20, 2020, at 12:14 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, 2020-11-20 at 09:52 -0500, Chuck Lever wrote:
>> 
>> 
>>> On Nov 19, 2020, at 7:58 PM, Trond Myklebust <
>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> On Thu, 2020-11-19 at 17:58 -0500, Chuck Lever wrote:
>>>> 
>>>> 
>>>>> On Nov 19, 2020, at 9:34 AM, Trond Myklebust <
>>>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On Thu, 2020-11-19 at 09:31 -0500, Chuck Lever wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 19, 2020, at 9:30 AM, Trond Myklebust <
>>>>>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> On Thu, 2020-11-19 at 09:17 -0500, Chuck Lever wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 18, 2020, at 5:19 PM, trondmy@xxxxxxxxxx wrote:
>>>>>>>>> 
>>>>>>>>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
>>>>>>>>> 
>>>>>>>>> When doing a read() into a page, we also don't care if
>>>>>>>>> the
>>>>>>>>> nul
>>>>>>>>> padding
>>>>>>>>> stays in that last page when the data length is not 32-
>>>>>>>>> bit
>>>>>>>>> aligned.
>>>>>>>> 
>>>>>>>> What if the READ payload lands in the middle of a file?
>>>>>>>> The
>>>>>>>> pad on the end will overwrite file content just past
>>>>>>>> where
>>>>>>>> the READ payload lands.
>>>>>>> 
>>>>>>> If the size > buf->page_len, then it gets truncated in
>>>>>>> xdr_align_pages() afaik.
>>>>>> 
>>>>>> I will need to check how RPC/RDMA behaves. It might build a
>>>>>> chunk that includes the pad in this case, which would break
>>>>>> things.
>>>>> 
>>>>> That would be a bug in the existing code too, then. It
>>>>> shouldn't be
>>>>> writing beyond the buffer size we set in the NFS layer.
>>>> 
>>>> Testing now with xfstests, which should include fsx with direct
>>>> I/O of odd sizes. So far I haven't seen any unexpected behavior.
>>>> 
>>>> But I'm not sure what copy you're trying to avoid. This one in
>>>> xdr_align_pages() ?
>>>> 
>>>> 1189         else if (nwords < xdr->nwords) {
>>>> 1190                 /* Truncate page data and move it into the
>>>> tail
>>>> */
>>>> 1191                 offset = buf->page_len - len;
>>>> 1192                 copied = xdr_shrink_pagelen(buf, offset);
>>>> 1193                 trace_rpc_xdr_alignment(xdr, offset,
>>>> copied);
>>>> 1194                 xdr->nwords = XDR_QUADLEN(buf->len - cur);
>>>> 1195         }
>>>> 
>>>> We set up the receive buffer already to avoid this copy. It
>>>> should
>>>> rarely, if ever, happen. That's the point of
>>>> rpc_prepare_reply_pages().
>>> 
>>> 
>>> ...and the point of padding here is to avoid unaligned access to
>>> memory. That is completely broken by this whole mechanism, which
>>> causes
>>> us to place the real data in the tail in an unaligned buffer that
>>> follows this padding.
>> 
>> I've never seen run-time complaints about unaligned accesses of
>> the tail data. Do you have a reproducer? (And obviously this
>> applies only to NFSv4 COMPOUND results, right?)
> 
> I don't think we can trigger it anywhere right now mainly because we're
> careful never to put any further ops after a read, readdir or readlink,
> even though we probably should append a VERIFY to some of those so as
> to protect data caching (particularly for readlink).
> 
>> 
>>> Furthermore, rpc_prepare_reply_pages() only ever places the padding
>>> in
>>> the tail _if_ our buffer size is already not 32-bit aligned.
>>> Otherwise,
>>> we're engaging in this pointless exercise of making the tail buffer
>>> data unaligned after the fact.
>> 
>> Architecturally, I agree that it would be best if the tail buffer
>> presented the XDR data items aligned on 4 bytes.
>> 
>> But I do not agree that a pad goes in the pages. Some transports do
>> not send an XDR pad for unaligned data payloads. Transports have to
>> have a way of avoiding that pad. If it no longer goes in the tail,
>> then how will they do that?
>> 
>> We could perhaps have a new flag, XDRBUF_IMPLICIT_PAD, that means
>> the transport has to add a pad to xdr->pages on send, or did not add
>> a pad on receive.
>> 
> 
> I don't see why we would need any of this.

Right. I was conflating the send and receive side. You're talking
about only receive. This would be a problem when the structure of
the reply doesn't allow the client to align the pages and tail[]
in advance.

So, READ/READLINK/READDIR/LISTXATTRS, on either TCP, or with RDMA
and krb5i and krb5p where XDR pads have to stay inline. I'm still
a little troubled by what happens to the pad during a short READ.

The common case for READ requests is page-aligned payloads, so I
don't think tail content alignment is a performance issue. Do you
concur?

> If the pad is being directly
> placed in the tail by the RPC transport layer, then there is nothing in
> xdr_read_pages() that will move it or change the alignment of the
> existing code in the tail. The call just notes that xdr_align_pages()
> returned a length that was shorter than the one we supplied, and
> continues to skip the padding in the tail.

That's what I observe today. I'm trying to understand the usage
scenario where the tail content is not word-aligned. Probably can
happen on server-side receives too.

> The one thing that perhaps is broken here is that if we supply a length
> to xdr_read_pages() that is in fact larger than xdr_align_size(xdr-
>> buf->page_len), then we should skip additional data that was placed in
> the tail, but that is a bug with the existing xdr_read_pages() and is
> unaffected by these patches.

I think I understand what these patches do well enough to drop my
objection, pending further exercise with a testing barrage.

>>>>>>>>> Signed-off-by: Trond Myklebust
>>>>>>>>> <trond.myklebust@xxxxxxxxxxxxxxx>
>>>>>>>>> ---
>>>>>>>>> fs/nfs/nfs2xdr.c | 2 +-
>>>>>>>>> fs/nfs/nfs3xdr.c | 2 +-
>>>>>>>>> fs/nfs/nfs4xdr.c | 2 +-
>>>>>>>>> 3 files changed, 3 insertions(+), 3 deletions(-)
>>>>>>>>> 
>>>>>>>>> diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c
>>>>>>>>> index db9c265ad9e1..468bfbfe44d7 100644
>>>>>>>>> --- a/fs/nfs/nfs2xdr.c
>>>>>>>>> +++ b/fs/nfs/nfs2xdr.c
>>>>>>>>> @@ -102,7 +102,7 @@ static int decode_nfsdata(struct
>>>>>>>>> xdr_stream
>>>>>>>>> *xdr, struct nfs_pgio_res *result)
>>>>>>>>>         if (unlikely(!p))
>>>>>>>>>                 return -EIO;
>>>>>>>>>         count = be32_to_cpup(p);
>>>>>>>>> -       recvd = xdr_read_pages(xdr, count);
>>>>>>>>> +       recvd = xdr_read_pages(xdr,
>>>>>>>>> xdr_align_size(count));
>>>>>>>>>         if (unlikely(count > recvd))
>>>>>>>>>                 goto out_cheating;
>>>>>>>>> out:
>>>>>>>>> diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
>>>>>>>>> index d3e1726d538b..8ef7c961d3e2 100644
>>>>>>>>> --- a/fs/nfs/nfs3xdr.c
>>>>>>>>> +++ b/fs/nfs/nfs3xdr.c
>>>>>>>>> @@ -1611,7 +1611,7 @@ static int
>>>>>>>>> decode_read3resok(struct
>>>>>>>>> xdr_stream *xdr,
>>>>>>>>>         ocount = be32_to_cpup(p++);
>>>>>>>>>         if (unlikely(ocount != count))
>>>>>>>>>                 goto out_mismatch;
>>>>>>>>> -       recvd = xdr_read_pages(xdr, count);
>>>>>>>>> +       recvd = xdr_read_pages(xdr,
>>>>>>>>> xdr_align_size(count));
>>>>>>>>>         if (unlikely(count > recvd))
>>>>>>>>>                 goto out_cheating;
>>>>>>>>> out:
>>>>>>>>> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
>>>>>>>>> index 755b556e85c3..5baa767106dc 100644
>>>>>>>>> --- a/fs/nfs/nfs4xdr.c
>>>>>>>>> +++ b/fs/nfs/nfs4xdr.c
>>>>>>>>> @@ -5202,7 +5202,7 @@ static int decode_read(struct
>>>>>>>>> xdr_stream
>>>>>>>>> *xdr, struct rpc_rqst *req,
>>>>>>>>>                 return -EIO;
>>>>>>>>>         eof = be32_to_cpup(p++);
>>>>>>>>>         count = be32_to_cpup(p);
>>>>>>>>> -       recvd = xdr_read_pages(xdr, count);
>>>>>>>>> +       recvd = xdr_read_pages(xdr,
>>>>>>>>> xdr_align_size(count));
>>>>>>>>>         if (count > recvd) {
>>>>>>>>>                 dprintk("NFS: server cheating in read
>>>>>>>>> reply: "
>>>>>>>>>                                 "count %u > recvd
>>>>>>>>> %u\n",
>>>>>>>>> count,
>>>>>>>>> recvd);
>>>>>>>>> -- 
>>>>>>>>> 2.28.0
>>>>> 
>>>>> -- 
>>>>> Trond Myklebust
>>>>> Linux NFS client maintainer, Hammerspace
>>>>> trond.myklebust@xxxxxxxxxxxxxxx
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Chuck Lever
>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> trond.myklebust@xxxxxxxxxxxxxxx
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx

--
Chuck Lever