On Sat, 23 Sep 2023, Trond Myklebust wrote: > On Fri, 2023-09-22 at 13:22 -0400, Olga Kornievskaia wrote: > > On Wed, Sep 20, 2023 at 8:27 PM Trond Myklebust > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > On Wed, 2023-09-20 at 15:38 -0400, Anna Schumaker wrote: > > > > Hi Trond, > > > > > > > > On Sun, Sep 17, 2023 at 7:12 PM <trondmy@xxxxxxxxxx> wrote: > > > > > > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > > > Commit 4dc73c679114 reintroduces the deadlock that was fixed by > > > > > commit > > > > > aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") > > > > > because > > > > > it > > > > > prevents the setup of new threads to handle reboot recovery, > > > > > while > > > > > the > > > > > older recovery thread is stuck returning delegations. > > > > > > > > I'm seeing a possible deadlock with xfstests generic/472 on NFS > > > > v4.x > > > > after applying this patch. The test itself checks for various > > > > swapfile > > > > edge cases, so it seems likely something is going on there. > > > > > > > > Let me know if you need more info > > > > Anna > > > > > > > > > > Did you turn off delegations on your server? If you don't, then > > > swap > > > will deadlock itself under various scenarios. > > > > Is there documentation somewhere that says that delegations must be > > turned off on the server if NFS over swap is enabled? > > I think the question is more generally "Is there documentation for NFS > swap?" The main difference between using NFS for swap and for regular file IO is in the handling of writes, and particularly in the style of memory allocation that is safe while handling a write request (or anything which might block some write request, etc). For buffered IO, memory allocations must be GFP_NOIO or PF_MEMALLOC_NOIO. For swap-out, memory allocations must be GFP_MEMALLOC or PG_MEMALLOC. That is the primary difference - all other differences are minor. This difference might justify documentation suggesting that /proc/sys/vm/min_free_kbytes could usefully be increased, but I don't see that more is needed. The NOIO/MEMALLOC distinction is properly plumbed through nfs, sunrpc, and networking and all "just works". The problem area is that kthread_create() doesn't take a gfpflags_t argument, so it uses GFP_KERNEL allocations to create the new thread. This means that when a write cannot proceed without state management, and state management requests that a threads be started, there is a risk of memory allocation deadlock. I believe the risk is there even for buffered IO, but I'm not 100% certain and in practice I don't think a deadlock has ever been reported. With swap-out it is fairly easy to trigger a deadlock if there is heavy swap-out traffic when state management is needed. The common pattern in the kernel when a thread might be needed to support writeout is to keep the thread running permanently (rather than to add a gfpflags_t to kthread_create), so that is what I added to the nfsv4 state manager. However the state manager thread has a second use - returning delegations. This sometimes needs to run concurrently with state management, so one thread is not enough. What is that context for delegation return? Does it ever block writes? If it doesn't, would it make sense to use a work queue for returning delegations - maybe system_wq? I think it might be best to have the nfsv4 state manager thread always be running whether swap is enabled or not. As I say I think there is at least a theoretical risk of a deadlock even without swap, and having a small test matrix is usually a good thing. Thanks, NeilBrown