Re: [PATCH v1] RDMA/core: Fix check_flush_dependency splat on addr_wq

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 29 Aug 2022 14:22:46 -0300

On Mon, Aug 29, 2022 at 05:14:56PM +0000, Chuck Lever III wrote:
> 
> 
> > On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> > 
> > On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote:
> >> The connect APIs would be a place to start. In the meantime, though...
> >> 
> >> Two or three years ago I spent some effort to ensure that closing
> >> an RDMA connection leaves a client-side RPC/RDMA transport with no
> >> RDMA resources associated with it. It releases the CQs, QP, and all
> >> the MRs. That makes initial connect and reconnect both behave exactly
> >> the same, and guarantees that a reconnect does not get stuck with
> >> an old CQ that is no longer working or a QP that is in TIMEWAIT.
> >> 
> >> However that does mean that substantial resource allocation is
> >> done on every reconnect.
> > 
> > And if the resource allocations fail then what happens? The storage
> > ULP retries forever and is effectively deadlocked?
> 
> The reconnection attempt fails, and any resources allocated during
> that attempt are released. The ULP waits a bit then tries again
> until it works or is interrupted.
> 
> A deadlock might occur if one of those allocations triggers
> additional reclaim activity.

No, you are deadlocked now.

If a direct reclaim calls back into NFS we are already at the point
where normal allocations fail, and we are accessing the emergency
reserve.

When reclaim does this it marks the entire task with
memalloc_noio_save() which forces GFP_NOIO on every allocation that
task makes, meaning every allocation comes from the emergency reserve
already.

This is why it (barely) works *at all* with RDMA.

If during the writeback the reserve is exhaused and memory allocation
fails, then the IO stack is in trouble - either it fails the writeback
(then what?) or it deadlocks the kernel because it *cannot* make
forward progress without those memory allocations.

The fact we have cases where the storage thread under the
memalloc_noio_save() becomes contingent on the forward progress of
other contexts that don't have memalloc_noio_save() is a fairly
serious problem I can't see a solution to.

Even a simple case like mlx5 may cause the NIC to trigger a host
memory allocation, which is done in another thread and done as a
normal GFP_KERNEL. This memory allocation must progress before a
CQ/QP/MR/etc can be created. So now we are deadlocked again.

Jason