> On Aug 29, 2022, at 1:22 PM, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > On Mon, Aug 29, 2022 at 05:14:56PM +0000, Chuck Lever III wrote: >> >> >>> On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: >>> >>> On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote: >>>> The connect APIs would be a place to start. In the meantime, though... >>>> >>>> Two or three years ago I spent some effort to ensure that closing >>>> an RDMA connection leaves a client-side RPC/RDMA transport with no >>>> RDMA resources associated with it. It releases the CQs, QP, and all >>>> the MRs. That makes initial connect and reconnect both behave exactly >>>> the same, and guarantees that a reconnect does not get stuck with >>>> an old CQ that is no longer working or a QP that is in TIMEWAIT. >>>> >>>> However that does mean that substantial resource allocation is >>>> done on every reconnect. >>> >>> And if the resource allocations fail then what happens? The storage >>> ULP retries forever and is effectively deadlocked? >> >> The reconnection attempt fails, and any resources allocated during >> that attempt are released. The ULP waits a bit then tries again >> until it works or is interrupted. >> >> A deadlock might occur if one of those allocations triggers >> additional reclaim activity. > > No, you are deadlocked now. GFP_KERNEL can and will give up eventually, in which case the connection attempt fails and any previously allocated memory is released. Something else can then make progress. Single page allocation nearly always succeeds. It's the larger-order allocations that can block for long periods, and that's not necessarily because memory is low -- it can happen when one NUMA node's memory is heavily fragmented. > If a direct reclaim calls back into NFS we are already at the point > where normal allocations fail, and we are accessing the emergency > reserve. > > When reclaim does this it marks the entire task with > memalloc_noio_save() which forces GFP_NOIO on every allocation that > task makes, meaning every allocation comes from the emergency reserve > already. > > This is why it (barely) works *at all* with RDMA. > > If during the writeback the reserve is exhaused and memory allocation > fails, then the IO stack is in trouble - either it fails the writeback > (then what?) or it deadlocks the kernel because it *cannot* make > forward progress without those memory allocations. > > The fact we have cases where the storage thread under the > memalloc_noio_save() becomes contingent on the forward progress of > other contexts that don't have memalloc_noio_save() is a fairly > serious problem I can't see a solution to. This issue seems to be addressed in the socket stack, so I don't believe there's _no_ solution for RDMA. Usually the trick is to communicate the memalloc_noio settings somehow to other allocating threads. We could use cgroups, for example, to collect these threads and resources under one GFP umbrella. /eyeroll /ducks If nothing else we can talk with the MM folks about planning improvements. We've just gone through this with NFS on the socket stack. > Even a simple case like mlx5 may cause the NIC to trigger a host > memory allocation, which is done in another thread and done as a > normal GFP_KERNEL. This memory allocation must progress before a > CQ/QP/MR/etc can be created. So now we are deadlocked again. That sounds to me like a bug in mlx5. The driver is supposed to respect the caller's GFP settings. Again, if the request is small, it's likely to succeed anyway, but larger requests are not reliable and need to fail quickly so the system can move onto other fishing spots. I would like to at least get rid of the check_flush_dependency splat, which will fire a lot more often than we will get stuck in a reclaim allocation corner. I'm testing a patch that converts rpcrdma not to use MEM_RECLAIM work queues and notes how extensive the problem actually is. -- Chuck Lever