Re: svc_xprt_put is no longer BH-safe

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 8 Nov 2016 16:05:54 -0500

> On Nov 8, 2016, at 3:20 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> 
> On Tue, Nov 08, 2016 at 03:13:13PM -0500, Chuck Lever wrote:
>> 
>>> On Nov 8, 2016, at 3:03 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>> 
>>> On Sat, Oct 29, 2016 at 12:21:03PM -0400, Chuck Lever wrote:
>>>> Hi Bruce-
>>> 
>>> Sorry for the slow response!
>>> 
>>> ...
>>>> In commit 39a9beab5acb83176e8b9a4f0778749a09341f1f ('rpc: share one xps between
>>>> all backchannels') you added:
>>>> 
>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>> index f5572e3..4f01f63 100644
>>>> --- a/net/sunrpc/svc_xprt.c
>>>> +++ b/net/sunrpc/svc_xprt.c
>>>> @@ -136,6 +136,8 @@ static void svc_xprt_free(struct kref *kref)
>>>>       /* See comment on corresponding get in xs_setup_bc_tcp(): */
>>>>       if (xprt->xpt_bc_xprt)
>>>>               xprt_put(xprt->xpt_bc_xprt);
>>>> +       if (xprt->xpt_bc_xps)
>>>> +               xprt_switch_put(xprt->xpt_bc_xps);
>>>>       xprt->xpt_ops->xpo_free(xprt);
>>>>       module_put(owner);
>>>> }
>>>> 
>>>> svc_xprt_free() is invoked by svc_xprt_put(). svc_xprt_put() is called
>>>> from svc_rdma in soft IRQ context (eg. svc_rdma_wc_receive).
>>> 
>>> Is that necessary?  I wonder why the svcrdma code seems to be taking so
>>> many of its own references on svc_xprts.
>> 
>> The idea is to keep the xprt around while Work Requests (I/O) are running,
>> so that the xprt is guaranteed to be there during work completions. The
>> completion handlers (where svc_xprt_put is often invoked) run in soft IRQ
>> context.
>> 
>> It's simple to change completions to use a Work Queue instead, but testing
>> so far shows that will result in a performance loss. I'm still studying it.
>> 
>> Is there another way to keep the xprt's ref count boosted while I/O is
>> going on?
> 
> Why do you need the svc_xprt in the completion?

1. The svc_xprt contains the svcrdma_xprt, which contains the Send Queue
accounting mechanism. SQ accounting has to be updated for each completion:
the completion indicates that the SQ entry used by this Work Request is
now available, and that other callers waiting for enough SQEs can go ahead.

2. When handling a Receive completion, the incoming RPC message is enqueued
on the svc_xprt via svc_xprt_enqueue, unless RDMA Reads are needed.

3. When handling a Read completion, the incoming RPC message is enqueued on
the svc_xprt via svc_xprt_enqueue.

> Can the xpo_detach method wait for any pending completions?

So the completions would call wake_up on a waitqueue, or use a kernel
completion? That sounds more expensive than managing an atomic reference
count.

I suppose some other reference count could be used to trigger a work item
that would invoke xpo_detach.

But based on your comments, then, svc_xprt_put() was never intended to be
BH-safe.

> --b.
> 
>> 
>> 
>>>> However, xprt_switch_put() takes a spin lock (xps_lock) which is locked
>>>> everywhere without disabling BHs.
>>>> 
>>>> It looks to me like 39a9beab5acb makes svc_xprt_put() no longer BH-safe?
>>>> Not sure if svc_xprt_put() was intended to be BH-safe beforehand.
>>>> 
>>>> Maybe xprt_switch_put() could be invoked in ->xpo_free, but that seems
>>>> like a temporary solution.
>>> 
>>> Since xpo_free is also called from svc_xprt_put that doesn't sound like
>>> it would change anything.  Or do we not trunk over RDMA for some reason?
>> 
>> It's quite desirable to trunk NFS/RDMA on multiple connections, and it
>> should work just like it does for TCP, but so far it's not been tested.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html