On 11/24/20, 4:20 PM, "Frank van der Linden" <fllinden@xxxxxxxxxx> wrote: NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe. On Tue, Nov 24, 2020 at 08:50:36PM +0000, Kornievskaia, Olga wrote: > > > On 11/24/20, 3:06 PM, "Frank van der Linden" <fllinden@xxxxxxxxxx> wrote: > > On Tue, Nov 24, 2020 at 12:26:32PM -0500, Chuck Lever wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > > > > > By switching to an XFS-backed export, I am able to reproduce the > > ibcomp worker crash on my client with xfstests generic/013. > > > > For the failing LISTXATTRS operation, xdr_inline_pages() is called > > with page_len=12 and buflen=128. Then: > > > > - Because buflen is small, rpcrdma_marshal_req will not set up a > > Reply chunk and the rpcrdma's XDRBUF_SPARSE_PAGES logic does not > > get invoked at all. > > > > - Because page_len is non-zero, rpcrdma_inline_fixup() tries to > > copy received data into rq_rcv_buf->pages, but they're missing. > > > > The result is that the ibcomp worker faults and dies. Sometimes that > > causes a visible crash, and sometimes it results in a transport > > hang without other symptoms. > > > > RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and > > should eventually be fixed or replaced. However, my preference is > > that upper-layer operations should explicitly allocate their receive > > buffers (using GFP_KERNEL) when possible, rather than relying on > > XDRBUF_SPARSE_PAGES. > > > > Reported-by: Olga kornievskaia <kolga@xxxxxxxxxx> > > Suggested-by: Olga kornievskaia <kolga@xxxxxxxxxx> > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > > --- > > fs/nfs/nfs42proc.c | 17 ++++++++++------- > > fs/nfs/nfs42xdr.c | 1 - > > 2 files changed, 10 insertions(+), 8 deletions(-) > > > > Hi- > > > > I like Olga's proposed approach. What do you think of this patch? > > > > > > diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c > > index 2b2211d1234e..24810305ec1c 100644 > > --- a/fs/nfs/nfs42proc.c > > +++ b/fs/nfs/nfs42proc.c > > @@ -1241,7 +1241,7 @@ static ssize_t _nfs42_proc_listxattrs(struct inode *inode, void *buf, > > .rpc_resp = &res, > > }; > > u32 xdrlen; > > - int ret, np; > > + int ret, np, i; > > > > > > res.scratch = alloc_page(GFP_KERNEL); > > @@ -1253,10 +1253,14 @@ static ssize_t _nfs42_proc_listxattrs(struct inode *inode, void *buf, > > xdrlen = server->lxasize; > > np = xdrlen / PAGE_SIZE + 1; > > > > + ret = -ENOMEM; > > pages = kcalloc(np, sizeof(struct page *), GFP_KERNEL); > > - if (pages == NULL) { > > - __free_page(res.scratch); > > - return -ENOMEM; > > + if (pages == NULL) > > + goto out_free; > > + for (i = 0; i < np; i++) { > > + pages[i] = alloc_page(GFP_KERNEL); > > + if (!pages[i]) > > + goto out_free; > > } > > > > arg.xattr_pages = pages; > > @@ -1271,14 +1275,13 @@ static ssize_t _nfs42_proc_listxattrs(struct inode *inode, void *buf, > > *eofp = res.eof; > > } > > > > +out_free: > > while (--np >= 0) { > > if (pages[np]) > > __free_page(pages[np]); > > } > > - > > - __free_page(res.scratch); > > kfree(pages); > > - > > + __free_page(res.scratch); > > return ret; > > > > } > > diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c > > index 6e060a88f98c..8432bd6b95f0 100644 > > --- a/fs/nfs/nfs42xdr.c > > +++ b/fs/nfs/nfs42xdr.c > > @@ -1528,7 +1528,6 @@ static void nfs4_xdr_enc_listxattrs(struct rpc_rqst *req, > > > > rpc_prepare_reply_pages(req, args->xattr_pages, 0, args->count, > > hdr.replen); > > - req->rq_rcv_buf.flags |= XDRBUF_SPARSE_PAGES; > > > > encode_nops(&hdr); > > } > > > > > > I can see why this is the simplest and most pragmatic solution, so it's > fine with me. > > Why doesn't this happen with getxattr? Do we need to convert that too? > > [olga] I don't know if GETXATTR/SETXATTR works. I'm not sure what tests exercise those operations. I just ran into the fact that generic/013 wasn't passing. And I don't see that it's an xattr specific tests. I'm not sure how it ends up triggering is usage of xattr. I'm attaching the test program I used, it should give things a better workout. [olga] I'm not sure if I'm doing something wrong but there are only 2 GETXATTR call on the network trace from running this application and both calls are returning an error (ERR_NOXATTR). Which btw might explain why no problems are seen since no decoding of data is happening. There are lots of SETXATTRs and REMOVEXATTR and there is a LISTXATTR (which btw network trace is marking as malformed so there might something bad there). Anyway... This is my initial report: no real exercise of the GETXATTR code as far as I can tell. - Frank