> On Nov 19, 2020, at 7:58 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Thu, 2020-11-19 at 17:58 -0500, Chuck Lever wrote: >> >> >>> On Nov 19, 2020, at 9:34 AM, Trond Myklebust < >>> trondmy@xxxxxxxxxxxxxxx> wrote: >>> >>> On Thu, 2020-11-19 at 09:31 -0500, Chuck Lever wrote: >>>> >>>> >>>>> On Nov 19, 2020, at 9:30 AM, Trond Myklebust < >>>>> trondmy@xxxxxxxxxxxxxxx> wrote: >>>>> >>>>> On Thu, 2020-11-19 at 09:17 -0500, Chuck Lever wrote: >>>>>> >>>>>> >>>>>>> On Nov 18, 2020, at 5:19 PM, trondmy@xxxxxxxxxx wrote: >>>>>>> >>>>>>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >>>>>>> >>>>>>> When doing a read() into a page, we also don't care if the >>>>>>> nul >>>>>>> padding >>>>>>> stays in that last page when the data length is not 32-bit >>>>>>> aligned. >>>>>> >>>>>> What if the READ payload lands in the middle of a file? The >>>>>> pad on the end will overwrite file content just past where >>>>>> the READ payload lands. >>>>> >>>>> If the size > buf->page_len, then it gets truncated in >>>>> xdr_align_pages() afaik. >>>> >>>> I will need to check how RPC/RDMA behaves. It might build a >>>> chunk that includes the pad in this case, which would break >>>> things. >>> >>> That would be a bug in the existing code too, then. It shouldn't be >>> writing beyond the buffer size we set in the NFS layer. >> >> Testing now with xfstests, which should include fsx with direct >> I/O of odd sizes. So far I haven't seen any unexpected behavior. >> >> But I'm not sure what copy you're trying to avoid. This one in >> xdr_align_pages() ? >> >> 1189 else if (nwords < xdr->nwords) { >> 1190 /* Truncate page data and move it into the tail >> */ >> 1191 offset = buf->page_len - len; >> 1192 copied = xdr_shrink_pagelen(buf, offset); >> 1193 trace_rpc_xdr_alignment(xdr, offset, copied); >> 1194 xdr->nwords = XDR_QUADLEN(buf->len - cur); >> 1195 } >> >> We set up the receive buffer already to avoid this copy. It should >> rarely, if ever, happen. That's the point of >> rpc_prepare_reply_pages(). > > > ...and the point of padding here is to avoid unaligned access to > memory. That is completely broken by this whole mechanism, which causes > us to place the real data in the tail in an unaligned buffer that > follows this padding. I've never seen run-time complaints about unaligned accesses of the tail data. Do you have a reproducer? (And obviously this applies only to NFSv4 COMPOUND results, right?) > Furthermore, rpc_prepare_reply_pages() only ever places the padding in > the tail _if_ our buffer size is already not 32-bit aligned. Otherwise, > we're engaging in this pointless exercise of making the tail buffer > data unaligned after the fact. Architecturally, I agree that it would be best if the tail buffer presented the XDR data items aligned on 4 bytes. But I do not agree that a pad goes in the pages. Some transports do not send an XDR pad for unaligned data payloads. Transports have to have a way of avoiding that pad. If it no longer goes in the tail, then how will they do that? We could perhaps have a new flag, XDRBUF_IMPLICIT_PAD, that means the transport has to add a pad to xdr->pages on send, or did not add a pad on receive. >>>>>>> Signed-off-by: Trond Myklebust >>>>>>> <trond.myklebust@xxxxxxxxxxxxxxx> >>>>>>> --- >>>>>>> fs/nfs/nfs2xdr.c | 2 +- >>>>>>> fs/nfs/nfs3xdr.c | 2 +- >>>>>>> fs/nfs/nfs4xdr.c | 2 +- >>>>>>> 3 files changed, 3 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c >>>>>>> index db9c265ad9e1..468bfbfe44d7 100644 >>>>>>> --- a/fs/nfs/nfs2xdr.c >>>>>>> +++ b/fs/nfs/nfs2xdr.c >>>>>>> @@ -102,7 +102,7 @@ static int decode_nfsdata(struct >>>>>>> xdr_stream >>>>>>> *xdr, struct nfs_pgio_res *result) >>>>>>> if (unlikely(!p)) >>>>>>> return -EIO; >>>>>>> count = be32_to_cpup(p); >>>>>>> - recvd = xdr_read_pages(xdr, count); >>>>>>> + recvd = xdr_read_pages(xdr, xdr_align_size(count)); >>>>>>> if (unlikely(count > recvd)) >>>>>>> goto out_cheating; >>>>>>> out: >>>>>>> diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c >>>>>>> index d3e1726d538b..8ef7c961d3e2 100644 >>>>>>> --- a/fs/nfs/nfs3xdr.c >>>>>>> +++ b/fs/nfs/nfs3xdr.c >>>>>>> @@ -1611,7 +1611,7 @@ static int decode_read3resok(struct >>>>>>> xdr_stream *xdr, >>>>>>> ocount = be32_to_cpup(p++); >>>>>>> if (unlikely(ocount != count)) >>>>>>> goto out_mismatch; >>>>>>> - recvd = xdr_read_pages(xdr, count); >>>>>>> + recvd = xdr_read_pages(xdr, xdr_align_size(count)); >>>>>>> if (unlikely(count > recvd)) >>>>>>> goto out_cheating; >>>>>>> out: >>>>>>> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c >>>>>>> index 755b556e85c3..5baa767106dc 100644 >>>>>>> --- a/fs/nfs/nfs4xdr.c >>>>>>> +++ b/fs/nfs/nfs4xdr.c >>>>>>> @@ -5202,7 +5202,7 @@ static int decode_read(struct >>>>>>> xdr_stream >>>>>>> *xdr, struct rpc_rqst *req, >>>>>>> return -EIO; >>>>>>> eof = be32_to_cpup(p++); >>>>>>> count = be32_to_cpup(p); >>>>>>> - recvd = xdr_read_pages(xdr, count); >>>>>>> + recvd = xdr_read_pages(xdr, xdr_align_size(count)); >>>>>>> if (count > recvd) { >>>>>>> dprintk("NFS: server cheating in read >>>>>>> reply: " >>>>>>> "count %u > recvd %u\n", >>>>>>> count, >>>>>>> recvd); >>>>>>> -- >>>>>>> 2.28.0 >>> >>> -- >>> Trond Myklebust >>> Linux NFS client maintainer, Hammerspace >>> trond.myklebust@xxxxxxxxxxxxxxx >>> >>> >> >> -- >> Chuck Lever >> >> >> > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx -- Chuck Lever