> On May 7, 2023, at 5:20 PM, NeilBrown <neilb@xxxxxxx> wrote: > > On Fri, 08 Apr 2022, Chuck Lever wrote: >> Trond Myklebust reports an NFSD crash in svc_rdma_sendto(). Further >> investigation shows that the crash occurred while NFSD was handling >> a deferred request. >> >> This patch addresses two inter-related issues that prevent request >> deferral from working correctly for RPC/RDMA requests: >> >> 1. Prevent the crash by ensuring that the original >> svc_rqst::rq_xprt_ctxt value is available when the request is >> revisited. Otherwise svc_rdma_sendto() does not have a Receive >> context available with which to construct its reply. >> >> 2. Possibly since before commit 71641d99ce03 ("svcrdma: Properly >> compute .len and .buflen for received RPC Calls"), >> svc_rdma_recvfrom() did not include the transport header in the >> returned xdr_buf. There should have been no need for svc_defer() >> and friends to save and restore that header, as of that commit. >> This issue is addressed in a backport-friendly way by simply >> having svc_rdma_recvfrom() set rq_xprt_hlen to zero >> unconditionally, just as svc_tcp_recvfrom() does. This enables >> svc_deferred_recv() to correctly reconstruct an RPC message >> received via RPC/RDMA. > > I'm a bit late to this party but .... this patch is bad and I don't know > how best to fix it. > It is bad for two reasons. > 1/ It can leak memory. When a deferral happens, the context is moved > into an svc_deferred_req. There are a couple of places which assume > that an svc_deferred_req can be freed with kfree(), such as > svc_delete_xprt() and svc_revisit(). These will now leak the > context. This is the bit that is hard to fix. > > 2/ It can cause a double-free with UDP (and maybe rdma). > When a request is deferred, the ctxt is moved to the dreq. > When that request is revisited the ctxt is copied back into the rqst. > If the rqst is again deferred then the old dreq is reused and, > importantly, the rq_xprt_ctxt is not cleared. So when the release > function is called the ctxt is freed. > When the request is revisited a second time that ctxt (now pointing > to freed and possibly reused memory) is copied back into the rqst. > When the request completes the ctxt is again freed - double free > oops. > > For UDP there is no value in saving the ctxt in the dreq - the content > of the ctxt, which is an skbuf, has been copied into the dreq. So maybe > the UDB ctxt is a very different beast than the rdma ctxt and needs to > be handled completely differently. > > We can fix the UDP double-free by always doing > rqstp->rq_xprt_ctxt = NULL; > whether a new dreq was needed or not. But that doesn't fix the leaking > of ctxts and is really just a band-aid. A double-free is potentially catastrophic, so I would say this is a reasonable change that can be back-ported without fuss (while noting that more is needed). The RDMA recv_ctxt is persistent for the life of the connection. Releasing one of those just puts it back on a free list. So, maybe the second free could cause free list corruption? > Thoughts? > Do we need to preserve ALL of the svc_rdma_recv_ctxt for deferred > requests? Could we just copy some bits into the dreq (allocate a bit > more space somehow) so that a simple kfree() is still enough? > Or do we need a free_ctxt() handler attached to the xprt? While I take some time to review code and options, did you know there is a deferral fault injector that might possibly help you sort through some of this? I doubt I took the time to try forcing a second deferral of the same request. > Thanks, > NeilBrown > > > > >> >> Reported-by: Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> >> Link: https://lore.kernel.org/linux-nfs/82662b7190f26fb304eb0ab1bb04279072439d4e.camel@xxxxxxxxxxxxxxx/ >> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >> Cc: <stable@xxxxxxxxxxxxxxx> >> --- >> include/linux/sunrpc/svc.h | 1 + >> net/sunrpc/svc_xprt.c | 3 +++ >> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 2 +- >> 3 files changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h >> index a5dda4987e8b..217711fc9cac 100644 >> --- a/include/linux/sunrpc/svc.h >> +++ b/include/linux/sunrpc/svc.h >> @@ -395,6 +395,7 @@ struct svc_deferred_req { >> size_t addrlen; >> struct sockaddr_storage daddr; /* where reply must come from */ >> size_t daddrlen; >> + void *xprt_ctxt; >> struct cache_deferred_req handle; >> size_t xprt_hlen; >> int argslen; >> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c >> index 0c117d3bfda8..b42cfffa7395 100644 >> --- a/net/sunrpc/svc_xprt.c >> +++ b/net/sunrpc/svc_xprt.c >> @@ -1231,6 +1231,8 @@ static struct cache_deferred_req *svc_defer(struct cache_req *req) >> dr->daddr = rqstp->rq_daddr; >> dr->argslen = rqstp->rq_arg.len >> 2; >> dr->xprt_hlen = rqstp->rq_xprt_hlen; >> + dr->xprt_ctxt = rqstp->rq_xprt_ctxt; >> + rqstp->rq_xprt_ctxt = NULL; >> >> /* back up head to the start of the buffer and copy */ >> skip = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len; >> @@ -1269,6 +1271,7 @@ static noinline int svc_deferred_recv(struct svc_rqst *rqstp) >> rqstp->rq_xprt_hlen = dr->xprt_hlen; >> rqstp->rq_daddr = dr->daddr; >> rqstp->rq_respages = rqstp->rq_pages; >> + rqstp->rq_xprt_ctxt = dr->xprt_ctxt; >> svc_xprt_received(rqstp->rq_xprt); >> return (dr->argslen<<2) - dr->xprt_hlen; >> } >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c >> index cf76a6ad127b..864131a9fc6e 100644 >> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c >> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c >> @@ -831,7 +831,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp) >> goto out_err; >> if (ret == 0) >> goto out_drop; >> - rqstp->rq_xprt_hlen = ret; >> + rqstp->rq_xprt_hlen = 0; >> >> if (svc_rdma_is_reverse_direction_reply(xprt, ctxt)) >> goto out_backchannel; -- Chuck Lever