> On Aug 26, 2022, at 9:29 AM, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > On Wed, Aug 24, 2022 at 02:09:52PM +0000, Chuck Lever III wrote: >> >> >>> On Aug 24, 2022, at 5:20 AM, Leon Romanovsky <leon@xxxxxxxxxx> wrote: >>> >>> On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote: >>>> >>>> >>>>> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@xxxxxxxxxx> wrote: >>>>> >>>>> On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote: >>> >>> <...> >>> >>>>>> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that >>>>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to >>>>>> prevent a priority inversion. >>>>> >>>>> But why do you have WQ_MEM_RECLAIM in xprtiod? >>>> >>>> Because RPC is under a filesystem (NFS). Therefore it has to handle >>>> writeback demanded by direct reclaim. All of the storage ULPs have >>>> this constraint, in fact. >>> >>> I don't know, this ib_addr workqueue is used when connection is created. >> >> Reconnection is exactly when we need to ensure that creating >> a new connection won't trigger more memory allocation, because >> that will immediately deadlock. >> >> Again, all network storage ULPs have this constraint. > > IMHO this whole concept is broken. > > The RDMA stack does not operate globally under RECLAIM, nor should it. > > If you attempt to do a reconnection/etc from within a RECLAIM context > it will deadlock on one of the many allocations that are made to > support opening the connection. > > The general idea of reclaim is that the entire task context working > under the reclaim is marked with an override of the gfp flags to make > all allocations under that call chain reclaim safe. > > But rdmacm does allocations outside this, eg in the WQs processing the > CM packets. So this doesn't work and we will deadlock. > > Fixing it is a big deal and needs more that poking WQ_MEM_RECLAIM here > and there.. > > For instance, this patch is just incorrect, you can't use > WQ_MEM_RECLAIM on a WQ that is doing allocations and expect anything > useful to happen: > > addr_resolve() > addr_resolve_neigh() > fetch_ha() > ib_nl_fetch_ha() > ib_nl_ip_send_msg() > skb = nlmsg_new(len, GFP_KERNEL); > > So regardless of the MEM_RECLAIM the *actual work* being canceled can > be stuck on an non-reclaim-safe allocation. I see recent commits that do exactly what I've done for the reason I've done it. 4c4b1996b5db ("IB/hfi1: Fix WQ_MEM_RECLAIM warning") 533d2e8b4d5e ("nvmet-tcp: fix lockdep complaint on nvmet_tcp_wq flush during queue teardown") I accept that this might be a long chain to pull, but we need a plan to resolve this. Storage ULPs go to a lot of trouble to pre-allocate resources to avoid deadlocking in reclaim -- handling reclaim properly is supposed to be designed into this stack. -- Chuck Lever