Re: [PATCH 3/3] NFS: Avoid copy of xdr padding in read()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Nov 19, 2020, at 7:58 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Thu, 2020-11-19 at 17:58 -0500, Chuck Lever wrote:
>> 
>> 
>>> On Nov 19, 2020, at 9:34 AM, Trond Myklebust <
>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> On Thu, 2020-11-19 at 09:31 -0500, Chuck Lever wrote:
>>>> 
>>>> 
>>>>> On Nov 19, 2020, at 9:30 AM, Trond Myklebust <
>>>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On Thu, 2020-11-19 at 09:17 -0500, Chuck Lever wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 18, 2020, at 5:19 PM, trondmy@xxxxxxxxxx wrote:
>>>>>>> 
>>>>>>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
>>>>>>> 
>>>>>>> When doing a read() into a page, we also don't care if the
>>>>>>> nul
>>>>>>> padding
>>>>>>> stays in that last page when the data length is not 32-bit
>>>>>>> aligned.
>>>>>> 
>>>>>> What if the READ payload lands in the middle of a file? The
>>>>>> pad on the end will overwrite file content just past where
>>>>>> the READ payload lands.
>>>>> 
>>>>> If the size > buf->page_len, then it gets truncated in
>>>>> xdr_align_pages() afaik.
>>>> 
>>>> I will need to check how RPC/RDMA behaves. It might build a
>>>> chunk that includes the pad in this case, which would break
>>>> things.
>>> 
>>> That would be a bug in the existing code too, then. It shouldn't be
>>> writing beyond the buffer size we set in the NFS layer.
>> 
>> Testing now with xfstests, which should include fsx with direct
>> I/O of odd sizes. So far I haven't seen any unexpected behavior.
>> 
>> But I'm not sure what copy you're trying to avoid. This one in
>> xdr_align_pages() ?
>> 
>> 1189         else if (nwords < xdr->nwords) {
>> 1190                 /* Truncate page data and move it into the tail
>> */
>> 1191                 offset = buf->page_len - len;
>> 1192                 copied = xdr_shrink_pagelen(buf, offset);
>> 1193                 trace_rpc_xdr_alignment(xdr, offset, copied);
>> 1194                 xdr->nwords = XDR_QUADLEN(buf->len - cur);
>> 1195         }
>> 
>> We set up the receive buffer already to avoid this copy. It should
>> rarely, if ever, happen. That's the point of
>> rpc_prepare_reply_pages().
> 
> 
> ...and the point of padding here is to avoid unaligned access to
> memory. That is completely broken by this whole mechanism, which causes
> us to place the real data in the tail in an unaligned buffer that
> follows this padding.

I've never seen run-time complaints about unaligned accesses of
the tail data. Do you have a reproducer? (And obviously this
applies only to NFSv4 COMPOUND results, right?)


> Furthermore, rpc_prepare_reply_pages() only ever places the padding in
> the tail _if_ our buffer size is already not 32-bit aligned. Otherwise,
> we're engaging in this pointless exercise of making the tail buffer
> data unaligned after the fact.

Architecturally, I agree that it would be best if the tail buffer
presented the XDR data items aligned on 4 bytes.

But I do not agree that a pad goes in the pages. Some transports do
not send an XDR pad for unaligned data payloads. Transports have to
have a way of avoiding that pad. If it no longer goes in the tail,
then how will they do that?

We could perhaps have a new flag, XDRBUF_IMPLICIT_PAD, that means
the transport has to add a pad to xdr->pages on send, or did not add
a pad on receive.


>>>>>>> Signed-off-by: Trond Myklebust
>>>>>>> <trond.myklebust@xxxxxxxxxxxxxxx>
>>>>>>> ---
>>>>>>> fs/nfs/nfs2xdr.c | 2 +-
>>>>>>> fs/nfs/nfs3xdr.c | 2 +-
>>>>>>> fs/nfs/nfs4xdr.c | 2 +-
>>>>>>> 3 files changed, 3 insertions(+), 3 deletions(-)
>>>>>>> 
>>>>>>> diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c
>>>>>>> index db9c265ad9e1..468bfbfe44d7 100644
>>>>>>> --- a/fs/nfs/nfs2xdr.c
>>>>>>> +++ b/fs/nfs/nfs2xdr.c
>>>>>>> @@ -102,7 +102,7 @@ static int decode_nfsdata(struct
>>>>>>> xdr_stream
>>>>>>> *xdr, struct nfs_pgio_res *result)
>>>>>>>         if (unlikely(!p))
>>>>>>>                 return -EIO;
>>>>>>>         count = be32_to_cpup(p);
>>>>>>> -       recvd = xdr_read_pages(xdr, count);
>>>>>>> +       recvd = xdr_read_pages(xdr, xdr_align_size(count));
>>>>>>>         if (unlikely(count > recvd))
>>>>>>>                 goto out_cheating;
>>>>>>> out:
>>>>>>> diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
>>>>>>> index d3e1726d538b..8ef7c961d3e2 100644
>>>>>>> --- a/fs/nfs/nfs3xdr.c
>>>>>>> +++ b/fs/nfs/nfs3xdr.c
>>>>>>> @@ -1611,7 +1611,7 @@ static int decode_read3resok(struct
>>>>>>> xdr_stream *xdr,
>>>>>>>         ocount = be32_to_cpup(p++);
>>>>>>>         if (unlikely(ocount != count))
>>>>>>>                 goto out_mismatch;
>>>>>>> -       recvd = xdr_read_pages(xdr, count);
>>>>>>> +       recvd = xdr_read_pages(xdr, xdr_align_size(count));
>>>>>>>         if (unlikely(count > recvd))
>>>>>>>                 goto out_cheating;
>>>>>>> out:
>>>>>>> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
>>>>>>> index 755b556e85c3..5baa767106dc 100644
>>>>>>> --- a/fs/nfs/nfs4xdr.c
>>>>>>> +++ b/fs/nfs/nfs4xdr.c
>>>>>>> @@ -5202,7 +5202,7 @@ static int decode_read(struct
>>>>>>> xdr_stream
>>>>>>> *xdr, struct rpc_rqst *req,
>>>>>>>                 return -EIO;
>>>>>>>         eof = be32_to_cpup(p++);
>>>>>>>         count = be32_to_cpup(p);
>>>>>>> -       recvd = xdr_read_pages(xdr, count);
>>>>>>> +       recvd = xdr_read_pages(xdr, xdr_align_size(count));
>>>>>>>         if (count > recvd) {
>>>>>>>                 dprintk("NFS: server cheating in read
>>>>>>> reply: "
>>>>>>>                                 "count %u > recvd %u\n",
>>>>>>> count,
>>>>>>> recvd);
>>>>>>> -- 
>>>>>>> 2.28.0
>>> 
>>> -- 
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> trond.myklebust@xxxxxxxxxxxxxxx
>>> 
>>> 
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx

--
Chuck Lever







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux