Re: [PATCH v1] SUNRPC: Replace the use of the xprtiod WQ in rpcrdma

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Sat, 17 Sep 2022 02:14:17 +0000

> On Sep 16, 2022, at 6:40 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, 2022-09-16 at 14:28 -0400, Olga Kornievskaia wrote:
>> On Thu, Sep 15, 2022 at 11:56 AM Olga Kornievskaia <aglo@xxxxxxxxx>
>> wrote:
>>> 
>>> On Tue, Sep 6, 2022 at 12:25 PM Chuck Lever
>>> <chuck.lever@xxxxxxxxxx> wrote:
>>>> 
>>>> While setting up a new lab, I accidentally misconfigured the
>>>> Ethernet port for a system that tried an NFS mount using RoCE.
>>>> This made the NFS server unreachable. The following WARNING
>>>> popped on the NFS client while waiting for the mount attempt to
>>>> time out:
>>> 
>>> I also hit this today (on the 5.18 kernel) while running xfstest
>>> generic/460 using soft iWarp. In my case the port was properly
>>> configured. The test was going. I'm not sure exactly what happened.
>>> I
>>> know I also crashed the server that I was running against. But the
>>> point I would like to make is that this condition is possible to
>>> get
>>> to on a properly configured system.
>> 
>> But I think with this patch. I'm hitting this instead (of course
>> could
>> be something else):
>> 
>> [ 3222.712335] BUG: using smp_processor_id() in preemptible
>> [00000000]
>> code: 192.168.1.124-m/3814
>> [ 3222.714428] caller is xprt_rdma_connect+0x6a/0x120 [rpcrdma]
>> [ 3222.716047] CPU: 0 PID: 3814 Comm: 192.168.1.124-m Not tainted
>> 6.0.0-rc5+ #123
>> [ 3222.717706] Hardware name: VMware, Inc. VMware Virtual
>> Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
>> [ 3222.720310] Call Trace:
>> [ 3222.721032]  <TASK>
>> [ 3222.721587]  dump_stack_lvl+0x33/0x46
>> [ 3222.722501]  check_preemption_disabled+0xc3/0xf0
>> [ 3222.723754]  xprt_rdma_connect+0x6a/0x120 [rpcrdma]
>> [ 3222.725594]  xprt_connect+0x300/0x370 [sunrpc]
>> [ 3222.727369]  ? call_reserveresult+0xa0/0xa0 [sunrpc]
>> [ 3222.729272]  __rpc_execute+0x162/0x870 [sunrpc]
>> [ 3222.731101]  ? rpc_exit+0x40/0x40 [sunrpc]
>> [ 3222.732841]  ? __wake_up+0x10/0x10
>> [ 3222.733657]  rpc_execute+0x148/0x1b0 [sunrpc]
>> [ 3222.735326]  rpc_run_task+0x270/0x2d0 [sunrpc]
>> [ 3222.737182]  nfs4_proc_bind_one_conn_to_session+0x1cc/0x3a0
>> [nfsv4]
>> [ 3222.740472]  ? _nfs4_do_set_security_label+0x2d0/0x2d0 [nfsv4]
>> [ 3222.745034]  ? xprt_get+0xa0/0x120 [sunrpc]
>> [ 3222.747150]  ? nfs4_proc_bind_one_conn_to_session+0x3a0/0x3a0
>> [nfsv4]
>> [ 3222.749299]  ? __rcu_read_unlock+0x4e/0x250
>> [ 3222.750429]  ? nfs4_proc_bind_one_conn_to_session+0x3a0/0x3a0
>> [nfsv4]
>> [ 3222.752586]  rpc_clnt_iterate_for_each_xprt+0xc6/0x140 [sunrpc]
>> [ 3222.754900]  ? rpc_clnt_xprt_switch_add_xprt+0xa0/0xa0 [sunrpc]
>> [ 3222.757041]  ? preempt_count_sub+0x14/0xc0
>> [ 3222.758097]  nfs4_proc_bind_conn_to_session+0x87/0xb0 [nfsv4]
>> [ 3222.760341]  ? nfs4_proc_secinfo+0x250/0x250 [nfsv4]
>> [ 3222.762257]  nfs4_state_manager+0x34e/0xf60 [nfsv4]
>> [ 3222.764095]  nfs4_run_state_manager+0x1a6/0x2e0 [nfsv4]
>> [ 3222.766778]  ? nfs4_state_manager+0xf60/0xf60 [nfsv4]
>> [ 3222.768811]  ? _raw_spin_lock_irqsave+0x8d/0xf0
>> [ 3222.770029]  ? _raw_spin_unlock_irqrestore+0x40/0x40
>> [ 3222.771740]  ? __list_del_entry_valid+0x77/0xa0
>> [ 3222.773400]  ? nfs4_state_manager+0xf60/0xf60 [nfsv4]
>> [ 3222.775653]  kthread+0x160/0x190
>> [ 3222.776729]  ? kthread_complete_and_exit+0x20/0x20
>> [ 3222.777865]  ret_from_fork+0x1f/0x30
>> [ 3222.778972]  </TASK>
>> 
>> 
>>> 
>>>> kernel: workqueue: WQ_MEM_RECLAIM
>>>> xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing
>>>> !WQ_MEM_RECLAI>
>>>> kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628
>>>> check_flush_dependency+0xbf/0xca
>>>> kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs
>>>> 8021q garp stp mrp llc rfkill rpcrdma>
>>>> kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-
>>>> 00002-g6229f8c054e5 #13
>>>> kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b
>>>> 06/12/2017
>>>> kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
>>>> kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
>>>> kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48
>>>> 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
>>>> kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
>>>> kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX:
>>>> 0000000000000027
>>>> kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI:
>>>> 00000000ffffffff
>>>> kernel: RBP: ffff978941315840 R08: 0000000000000000 R09:
>>>> 0000000000000000
>>>> kernel: R10: 00000000000008b0 R11: 0000000000000001 R12:
>>>> ffffffffc0ce3731
>>>> kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15:
>>>> ffff978951112eb0
>>>> kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000)
>>>> knlGS:0000000000000000
>>>> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4:
>>>> 00000000003706f0
>>>> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>> kernel: Call Trace:
>>>> kernel:  <TASK>
>>>> kernel:  __flush_work.isra.0+0xaf/0x188
>>>> kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
>>>> kernel:  ? lock_timer_base+0x38/0x5f
>>>> kernel:  __cancel_work_timer+0xea/0x13d
>>>> kernel:  ? preempt_latency_start+0x2b/0x46
>>>> kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
>>>> kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
>>>> kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
>>>> kernel:  ? _raw_spin_unlock+0x14/0x29
>>>> kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
>>>> kernel:  ? finish_task_switch.isra.0+0x171/0x249
>>>> kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
>>>> kernel:  process_one_work+0x1d8/0x2d4
>>>> kernel:  worker_thread+0x18b/0x24f
>>>> kernel:  ? rescuer_thread+0x280/0x280
>>>> kernel:  kthread+0xf4/0xfc
>>>> kernel:  ? kthread_complete_and_exit+0x1b/0x1b
>>>> kernel:  ret_from_fork+0x22/0x30
>>>> kernel:  </TASK>
>>>> 
>>>> SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue
>>>> that
>>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
>>>> prevent a priority inversion. The internal workqueues in the
>>>> RDMA/core are currently non-MEM_RECLAIM.
>>>> 
>>>> Jason Gunthorpe says this about the current state of RDMA/core:
>>>>> If you attempt to do a reconnection/etc from within a RECLAIM
>>>>> context it will deadlock on one of the many allocations that
>>>>> are
>>>>> made to support opening the connection.
>>>>> 
>>>>> The general idea of reclaim is that the entire task context
>>>>> working under the reclaim is marked with an override of the gfp
>>>>> flags to make all allocations under that call chain reclaim
>>>>> safe.
>>>>> 
>>>>> But rdmacm does allocations outside this, eg in the WQs
>>>>> processing
>>>>> the CM packets. So this doesn't work and we will deadlock.
>>>>> 
>>>>> Fixing it is a big deal and needs more than poking
>>>>> WQ_MEM_RECLAIM
>>>>> here and there.
>>>> 
>>>> So we will change the ULP in this case to avoid the use of
>>>> WQ_MEM_RECLAIM where possible. Deadlocks that were possible
>>>> before
>>>> are not fixed, but at least we no longer have a false sense of
>>>> confidence that the stack won't allocate memory during memory
>>>> reclaim.
>>>> 
>>>> While we're adjusting these queue_* call sites, ensure the work
>>>> requests always run on the local CPU so the worker allocates RDMA
>>>> resources that are local to the CPU that queued the work request.
>>>> 
>>>> Suggested-by: Leon Romanovsky <leon@xxxxxxxxxx>
>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>> ---
>>>>  net/sunrpc/xprtrdma/transport.c |    4 ++--
>>>>  net/sunrpc/xprtrdma/verbs.c     |   11 ++++-------
>>>>  2 files changed, 6 insertions(+), 9 deletions(-)
>>>> 
>>>> Hi Anna-
>>>> 
>>>> I've had this applied to my test client for a while. I think it's
>>>> ready to apply.
>>>> 
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/transport.c
>>>> b/net/sunrpc/xprtrdma/transport.c
>>>> index bcb37b51adf6..9581641bb8cb 100644
>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>> @@ -494,8 +494,8 @@ xprt_rdma_connect(struct rpc_xprt *xprt,
>>>> struct rpc_task *task)
>>>>                 xprt_reconnect_backoff(xprt,
>>>> RPCRDMA_INIT_REEST_TO);
>>>>         }
>>>>         trace_xprtrdma_op_connect(r_xprt, delay);
>>>> -       queue_delayed_work(xprtiod_workqueue, &r_xprt-
>>>>> rx_connect_worker,
>>>> -                          delay);
>>>> +       queue_delayed_work_on(smp_processor_id(), system_long_wq,
>>>> +                             &r_xprt->rx_connect_worker, delay);
>>>>  }
>>>> 
>>>>  /**
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>>>> b/net/sunrpc/xprtrdma/verbs.c
>>>> index 2fbe9aaeec34..691afc96bcbc 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -791,13 +791,10 @@ void rpcrdma_mrs_refresh(struct
>>>> rpcrdma_xprt *r_xprt)
>>>>         /* If there is no underlying connection, it's no use
>>>>          * to wake the refresh worker.
>>>>          */
>>>> -       if (ep->re_connect_status == 1) {
>>>> -               /* The work is scheduled on a WQ_MEM_RECLAIM
>>>> -                * workqueue in order to prevent MR allocation
>>>> -                * from recursing into NFS during direct reclaim.
>>>> -                */
>>>> -               queue_work(xprtiod_workqueue, &buf-
>>>>> rb_refresh_worker);
>>>> -       }
>>>> +       if (ep->re_connect_status != 1)
>>>> +               return;
>>>> +       queue_work_on(smp_processor_id(), system_highpri_wq,
>>>> +                     &buf->rb_refresh_worker);
>>>>  }
>>>> 
>>>>  /**
>>>> 
>>>> 
> 
> Right. smp_processor_id() is only allowed to be called when preemption
> has been disabled. See Documentation/kernel-hacking/hacking.rst and
> Documentation/locking/preempt-locking.rst.
> 
> Why not just use queue_work(), Chuck? That achieves the exact same
> thing without requiring any extra locking.

The intent was to allocate resources on the NUMA node that owns
the device. The code doesn't do that, I can see now.

I will send you a v2 that does the obvious thing.

--
Chuck Lever