Re: [PATCH v1] RDMA/core: Fix check_flush_dependency splat on addr_wq

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 26 Aug 2022 14:02:55 +0000

> On Aug 26, 2022, at 9:29 AM, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> 
> On Wed, Aug 24, 2022 at 02:09:52PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Aug 24, 2022, at 5:20 AM, Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>>> 
>>> On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>>>>> 
>>>>> On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote:
>>> 
>>> <...>
>>> 
>>>>>> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that
>>>>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
>>>>>> prevent a priority inversion.
>>>>> 
>>>>> But why do you have WQ_MEM_RECLAIM in xprtiod?
>>>> 
>>>> Because RPC is under a filesystem (NFS). Therefore it has to handle
>>>> writeback demanded by direct reclaim. All of the storage ULPs have
>>>> this constraint, in fact.
>>> 
>>> I don't know, this ib_addr workqueue is used when connection is created.
>> 
>> Reconnection is exactly when we need to ensure that creating
>> a new connection won't trigger more memory allocation, because
>> that will immediately deadlock.
>> 
>> Again, all network storage ULPs have this constraint.
> 
> IMHO this whole concept is broken.
> 
> The RDMA stack does not operate globally under RECLAIM, nor should it.
> 
> If you attempt to do a reconnection/etc from within a RECLAIM context
> it will deadlock on one of the many allocations that are made to
> support opening the connection.
> 
> The general idea of reclaim is that the entire task context working
> under the reclaim is marked with an override of the gfp flags to make
> all allocations under that call chain reclaim safe.
> 
> But rdmacm does allocations outside this, eg in the WQs processing the
> CM packets. So this doesn't work and we will deadlock.
> 
> Fixing it is a big deal and needs more that poking WQ_MEM_RECLAIM here
> and there..
> 
> For instance, this patch is just incorrect, you can't use
> WQ_MEM_RECLAIM on a WQ that is doing allocations and expect anything
> useful to happen:
> 
> addr_resolve()
> addr_resolve_neigh()
>  fetch_ha()
>   ib_nl_fetch_ha()
>     ib_nl_ip_send_msg()
>       	skb = nlmsg_new(len, GFP_KERNEL);
> 
> So regardless of the MEM_RECLAIM the *actual work* being canceled can
> be stuck on an non-reclaim-safe allocation.

I see recent commits that do exactly what I've done for the reason I've done it.

4c4b1996b5db ("IB/hfi1: Fix WQ_MEM_RECLAIM warning")
533d2e8b4d5e ("nvmet-tcp: fix lockdep complaint on nvmet_tcp_wq flush during queue teardown")

I accept that this might be a long chain to pull, but we need a plan
to resolve this. Storage ULPs go to a lot of trouble to pre-allocate
resources to avoid deadlocking in reclaim -- handling reclaim properly
is supposed to be designed into this stack.

--
Chuck Lever