> On Aug 14, 2017, at 4:38 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Mon, 2017-08-14 at 16:23 -0400, Chuck Lever wrote: >>> On Aug 14, 2017, at 4:07 PM, Trond Myklebust <trondmy@primarydata.c >>> om> wrote: >>> >>> On Mon, 2017-08-14 at 15:28 -0400, Chuck Lever wrote: >>>>> On Aug 14, 2017, at 3:16 PM, Trond Myklebust <trond.myklebust@p >>>>> rima >>>>> rydata.com> wrote: >>>>> >>>>> Instead add a mechanism to ensure that the request doesn't >>>>> disappear >>>>> from underneath us while copying from the socket. We do this by >>>>> preventing xprt_release() from freeing the XDR buffers until >>>>> the >>>>> flag RPC_TASK_MSG_RECV has been cleared from the request. >>>>> >>>>> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx >>>>>> >>>>> --- >>>>> include/linux/sunrpc/sched.h | 2 ++ >>>>> include/linux/sunrpc/xprt.h | 2 ++ >>>>> net/sunrpc/xprt.c | 42 >>>>> ++++++++++++++++++++++++++++++++++++++++++ >>>>> net/sunrpc/xprtsock.c | 27 ++++++++++++++++++++++----- >>>>> 4 files changed, 68 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/include/linux/sunrpc/sched.h >>>>> b/include/linux/sunrpc/sched.h >>>>> index 15bc1cd6ee5c..e972d9e05426 100644 >>>>> --- a/include/linux/sunrpc/sched.h >>>>> +++ b/include/linux/sunrpc/sched.h >>>>> @@ -141,6 +141,8 @@ struct rpc_task_setup { >>>>> #define RPC_TASK_ACTIVE 2 >>>>> #define RPC_TASK_MSG_XMIT 3 >>>>> #define RPC_TASK_MSG_XMIT_WAIT 4 >>>>> +#define RPC_TASK_MSG_RECV 5 >>>>> +#define RPC_TASK_MSG_RECV_WAIT 6 >>>>> >>>>> #define RPC_IS_RUNNING(t) test_bit(RPC_TASK_RUNNING, >>>>> &(t)- >>>>>> tk_runstate) >>>>> >>>>> #define rpc_set_running(t) set_bit(RPC_TASK_RUNNING, >>>>> &(t)- >>>>>> tk_runstate) >>>>> >>>>> diff --git a/include/linux/sunrpc/xprt.h >>>>> b/include/linux/sunrpc/xprt.h >>>>> index eab1c749e192..65b9e0224753 100644 >>>>> --- a/include/linux/sunrpc/xprt.h >>>>> +++ b/include/linux/sunrpc/xprt.h >>>>> @@ -372,6 +372,8 @@ void xprt_write_spac >>>>> e(st >>>>> ruct rpc_xprt *xprt); >>>>> void xprt_adjust_cwnd(struct rpc_xprt >>>>> *xprt, >>>>> struct rpc_task *task, int result); >>>>> struct rpc_rqst * xprt_lookup_rqst(struct rpc_xprt >>>>> *xprt, >>>>> __be32 xid); >>>>> void xprt_complete_rqst(struct rpc_task >>>>> *task, int copied); >>>>> +void xprt_pin_rqst(struct rpc_rqst >>>>> *req); >>>>> +void xprt_unpin_rqst(struct rpc_rqst >>>>> *req); >>>>> void xprt_release_rqst_cong(struct >>>>> rpc_task >>>>> *task); >>>>> void xprt_disconnect_done(struct >>>>> rpc_xprt >>>>> *xprt); >>>>> void xprt_force_disconnect(struct >>>>> rpc_xprt >>>>> *xprt); >>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c >>>>> index 788c1b6138c2..62c379865f7c 100644 >>>>> --- a/net/sunrpc/xprt.c >>>>> +++ b/net/sunrpc/xprt.c >>>>> @@ -844,6 +844,47 @@ struct rpc_rqst *xprt_lookup_rqst(struct >>>>> rpc_xprt *xprt, __be32 xid) >>>>> } >>>>> EXPORT_SYMBOL_GPL(xprt_lookup_rqst); >>>>> >>>>> +/** >>>>> + * xprt_pin_rqst - Pin a request on the transport receive list >>>>> + * @req: Request to pin >>>>> + * >>>>> + * Caller must ensure this is atomic with the call to >>>>> xprt_lookup_rqst() >>>>> + * so should be holding the xprt transport lock. >>>>> + */ >>>>> +void xprt_pin_rqst(struct rpc_rqst *req) >>>>> +{ >>>>> + set_bit(RPC_TASK_MSG_RECV, &req->rq_task- >>>>>> tk_runstate); >>>>> +} >>>>> + >>>>> +/** >>>>> + * xprt_unpin_rqst - Unpin a request on the transport receive >>>>> list >>>>> + * @req: Request to pin >>>>> + * >>>>> + * Caller should be holding the xprt transport lock. >>>>> + */ >>>>> +void xprt_unpin_rqst(struct rpc_rqst *req) >>>>> +{ >>>>> + struct rpc_task *task = req->rq_task; >>>>> + >>>>> + clear_bit(RPC_TASK_MSG_RECV, &task->tk_runstate); >>>>> + if (test_bit(RPC_TASK_MSG_RECV_WAIT, &task- >>>>>> tk_runstate)) >>>>> + wake_up_bit(&task->tk_runstate, >>>>> RPC_TASK_MSG_RECV); >>>>> +} >>>>> + >>>>> +static void xprt_wait_on_pinned_rqst(struct rpc_rqst *req) >>>>> +__must_hold(&req->rq_xprt->transport_lock) >>>>> +{ >>>>> + struct rpc_task *task = req->rq_task; >>>>> + if (test_bit(RPC_TASK_MSG_RECV, &task->tk_runstate)) { >>>>> + spin_unlock_bh(&req->rq_xprt->transport_lock); >>>>> + set_bit(RPC_TASK_MSG_RECV_WAIT, &task- >>>>>> tk_runstate); >>>>> >>>>> + wait_on_bit(&task->tk_runstate, >>>>> RPC_TASK_MSG_RECV, >>>>> + TASK_UNINTERRUPTIBLE); >>>>> + clear_bit(RPC_TASK_MSG_RECV_WAIT, &task- >>>>>> tk_runstate); >>>>> >>>>> + spin_lock_bh(&req->rq_xprt->transport_lock); >>>>> + } >>>>> +} >>>>> + >>>>> static void xprt_update_rtt(struct rpc_task *task) >>>>> { >>>>> struct rpc_rqst *req = task->tk_rqstp; >>>>> @@ -1301,6 +1342,7 @@ void xprt_release(struct rpc_task *task) >>>>> list_del(&req->rq_list); >>>>> xprt->last_used = jiffies; >>>>> xprt_schedule_autodisconnect(xprt); >>>>> + xprt_wait_on_pinned_rqst(req); >>>>> spin_unlock_bh(&xprt->transport_lock); >>>> >>>> Is it OK to call wait_on_bit(TASK_UNINTERRUPTIBLE) while holding >>>> a BH spin lock? This could be prone to deadlock. >>> >>> We drop the lock inside xprt_wait_on_pinned_rqst() if we need to >>> wait. >>> >>> The reason why we want to hold the lock there is to avoid a use- >>> after- >>> free in xprt_unpin_rqst(). Without the lock, there is a risk that >>> xprt_release() could complete, and the task get freed before we >>> have >>> completed wake_up_bit(). >> >> Agree with the last bit. I had that challenge with the order of >> memory invalidation (which waits) and xprt_lookup_rqst/complete_rqst >> in rpcrdma_reply_handler. >> >> However my concern is that there are many other users of the >> transport_lock. Does it become more tricky not to introduce a >> problem by changing any one of the sites that take transport_lock? > > What is your concern? The main reason for taking the transport lock in > the current code is twofold: > > 1) Perform the lookup of the request on the receive list > 2) Ensure the request doesn't disappear while we're writing to it. Yes, I think we see the same problem space. I missed, though, that xprt_wait_on_pinned_rqst always releases transport_lock before invoking wait_on_bit. >> Despite the deadlock concern, I don't think it makes sense to call >> wait_on_bit while holding a spin lock simply because spin locks are >> supposed to be for things that are quick, like updating a data >> structure. >> >> wait_on_bit can take microseconds: prepare_to_wait and finish_wait >> can both take a irqsave spin_lock, for example. bit_wait is the >> action that is done if the bit is not ready, and that calls >> schedule(). On a busy system, that kind of context switch can take >> dozens of microseconds. > > Agreed, but we should only very rarely be hitting the wait case. In all > successful RPC calls, the receive will have happened long before we > call xprt_release(). In fact, the only 2 cases where this can happen > would be when the RPC call itself is interrupted. Either: > > 1) The task is synchronous and was interrupted by the user. > 2) The task is terminating early due to a soft timeout. > > In both those cases, we're in a non-performance path. Furthermore, they > would currently end up spinning against the transport lock. OK, no argument that there is any performance concern. I can hit this case 2 or 3 times out of 5 with generic/029 on NFSv4/RDMA. These days, the worse that happens is the client drops the RDMA connection because I've spent a very long time immersed in these code paths, trying to make the two cases you list above work 100% without deadlock or crash. See commits 04d25b7d5d1b through 431af645cf66, and commit 68791649a725. It makes me uncertain that waiting for anything in xprt_release is safe, even if the transport_lock is not being held. xprtrdma used to perform synchronous memory invalidation in ->buf_free. It doesn't any more for this reason. >> IMO it would be a lot nicer if we had an FSM state available that >> could allow another sleep while an RPC task during xprt_release. > > Why would that be preferable, if the argument holds that this only > happens in these 2 rare error cases? Because FSM steps are the way this code manages sleeps. That makes it easier to understand and less brittle. I don't think it would be more overhead than test_bit(). -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html