On Fri, 2020-11-20 at 09:52 -0500, Chuck Lever wrote: > > > > On Nov 19, 2020, at 7:58 PM, Trond Myklebust < > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Thu, 2020-11-19 at 17:58 -0500, Chuck Lever wrote: > > > > > > > > > > On Nov 19, 2020, at 9:34 AM, Trond Myklebust < > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > On Thu, 2020-11-19 at 09:31 -0500, Chuck Lever wrote: > > > > > > > > > > > > > > > > On Nov 19, 2020, at 9:30 AM, Trond Myklebust < > > > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > On Thu, 2020-11-19 at 09:17 -0500, Chuck Lever wrote: > > > > > > > > > > > > > > > > > > > > > > On Nov 18, 2020, at 5:19 PM, trondmy@xxxxxxxxxx wrote: > > > > > > > > > > > > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > > > > > > > > > When doing a read() into a page, we also don't care if > > > > > > > > the > > > > > > > > nul > > > > > > > > padding > > > > > > > > stays in that last page when the data length is not 32- > > > > > > > > bit > > > > > > > > aligned. > > > > > > > > > > > > > > What if the READ payload lands in the middle of a file? > > > > > > > The > > > > > > > pad on the end will overwrite file content just past > > > > > > > where > > > > > > > the READ payload lands. > > > > > > > > > > > > If the size > buf->page_len, then it gets truncated in > > > > > > xdr_align_pages() afaik. > > > > > > > > > > I will need to check how RPC/RDMA behaves. It might build a > > > > > chunk that includes the pad in this case, which would break > > > > > things. > > > > > > > > That would be a bug in the existing code too, then. It > > > > shouldn't be > > > > writing beyond the buffer size we set in the NFS layer. > > > > > > Testing now with xfstests, which should include fsx with direct > > > I/O of odd sizes. So far I haven't seen any unexpected behavior. > > > > > > But I'm not sure what copy you're trying to avoid. This one in > > > xdr_align_pages() ? > > > > > > 1189 else if (nwords < xdr->nwords) { > > > 1190 /* Truncate page data and move it into the > > > tail > > > */ > > > 1191 offset = buf->page_len - len; > > > 1192 copied = xdr_shrink_pagelen(buf, offset); > > > 1193 trace_rpc_xdr_alignment(xdr, offset, > > > copied); > > > 1194 xdr->nwords = XDR_QUADLEN(buf->len - cur); > > > 1195 } > > > > > > We set up the receive buffer already to avoid this copy. It > > > should > > > rarely, if ever, happen. That's the point of > > > rpc_prepare_reply_pages(). > > > > > > ...and the point of padding here is to avoid unaligned access to > > memory. That is completely broken by this whole mechanism, which > > causes > > us to place the real data in the tail in an unaligned buffer that > > follows this padding. > > I've never seen run-time complaints about unaligned accesses of > the tail data. Do you have a reproducer? (And obviously this > applies only to NFSv4 COMPOUND results, right?) I don't think we can trigger it anywhere right now mainly because we're careful never to put any further ops after a read, readdir or readlink, even though we probably should append a VERIFY to some of those so as to protect data caching (particularly for readlink). > > > Furthermore, rpc_prepare_reply_pages() only ever places the padding > > in > > the tail _if_ our buffer size is already not 32-bit aligned. > > Otherwise, > > we're engaging in this pointless exercise of making the tail buffer > > data unaligned after the fact. > > Architecturally, I agree that it would be best if the tail buffer > presented the XDR data items aligned on 4 bytes. > > But I do not agree that a pad goes in the pages. Some transports do > not send an XDR pad for unaligned data payloads. Transports have to > have a way of avoiding that pad. If it no longer goes in the tail, > then how will they do that? > > We could perhaps have a new flag, XDRBUF_IMPLICIT_PAD, that means > the transport has to add a pad to xdr->pages on send, or did not add > a pad on receive. > I don't see why we would need any of this. If the pad is being directly placed in the tail by the RPC transport layer, then there is nothing in xdr_read_pages() that will move it or change the alignment of the existing code in the tail. The call just notes that xdr_align_pages() returned a length that was shorter than the one we supplied, and continues to skip the padding in the tail. The one thing that perhaps is broken here is that if we supply a length to xdr_read_pages() that is in fact larger than xdr_align_size(xdr- >buf->page_len), then we should skip additional data that was placed in the tail, but that is a bug with the existing xdr_read_pages() and is unaffected by these patches. > > > > > > > > > Signed-off-by: Trond Myklebust > > > > > > > > <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > --- > > > > > > > > fs/nfs/nfs2xdr.c | 2 +- > > > > > > > > fs/nfs/nfs3xdr.c | 2 +- > > > > > > > > fs/nfs/nfs4xdr.c | 2 +- > > > > > > > > 3 files changed, 3 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > > > diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c > > > > > > > > index db9c265ad9e1..468bfbfe44d7 100644 > > > > > > > > --- a/fs/nfs/nfs2xdr.c > > > > > > > > +++ b/fs/nfs/nfs2xdr.c > > > > > > > > @@ -102,7 +102,7 @@ static int decode_nfsdata(struct > > > > > > > > xdr_stream > > > > > > > > *xdr, struct nfs_pgio_res *result) > > > > > > > > if (unlikely(!p)) > > > > > > > > return -EIO; > > > > > > > > count = be32_to_cpup(p); > > > > > > > > - recvd = xdr_read_pages(xdr, count); > > > > > > > > + recvd = xdr_read_pages(xdr, > > > > > > > > xdr_align_size(count)); > > > > > > > > if (unlikely(count > recvd)) > > > > > > > > goto out_cheating; > > > > > > > > out: > > > > > > > > diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c > > > > > > > > index d3e1726d538b..8ef7c961d3e2 100644 > > > > > > > > --- a/fs/nfs/nfs3xdr.c > > > > > > > > +++ b/fs/nfs/nfs3xdr.c > > > > > > > > @@ -1611,7 +1611,7 @@ static int > > > > > > > > decode_read3resok(struct > > > > > > > > xdr_stream *xdr, > > > > > > > > ocount = be32_to_cpup(p++); > > > > > > > > if (unlikely(ocount != count)) > > > > > > > > goto out_mismatch; > > > > > > > > - recvd = xdr_read_pages(xdr, count); > > > > > > > > + recvd = xdr_read_pages(xdr, > > > > > > > > xdr_align_size(count)); > > > > > > > > if (unlikely(count > recvd)) > > > > > > > > goto out_cheating; > > > > > > > > out: > > > > > > > > diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c > > > > > > > > index 755b556e85c3..5baa767106dc 100644 > > > > > > > > --- a/fs/nfs/nfs4xdr.c > > > > > > > > +++ b/fs/nfs/nfs4xdr.c > > > > > > > > @@ -5202,7 +5202,7 @@ static int decode_read(struct > > > > > > > > xdr_stream > > > > > > > > *xdr, struct rpc_rqst *req, > > > > > > > > return -EIO; > > > > > > > > eof = be32_to_cpup(p++); > > > > > > > > count = be32_to_cpup(p); > > > > > > > > - recvd = xdr_read_pages(xdr, count); > > > > > > > > + recvd = xdr_read_pages(xdr, > > > > > > > > xdr_align_size(count)); > > > > > > > > if (count > recvd) { > > > > > > > > dprintk("NFS: server cheating in read > > > > > > > > reply: " > > > > > > > > "count %u > recvd > > > > > > > > %u\n", > > > > > > > > count, > > > > > > > > recvd); > > > > > > > > -- > > > > > > > > 2.28.0 > > > > > > > > -- > > > > Trond Myklebust > > > > Linux NFS client maintainer, Hammerspace > > > > trond.myklebust@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > -- > > > Chuck Lever > > > > > > > > > > > > > -- > > Trond Myklebust > > Linux NFS client maintainer, Hammerspace > > trond.myklebust@xxxxxxxxxxxxxxx > > -- > Chuck Lever > > > -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx