On Tue, 2023-09-26 at 08:28 +1000, NeilBrown wrote: > On Sat, 23 Sep 2023, Trond Myklebust wrote: > > On Fri, 2023-09-22 at 13:22 -0400, Olga Kornievskaia wrote: > > > On Wed, Sep 20, 2023 at 8:27 PM Trond Myklebust > > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > On Wed, 2023-09-20 at 15:38 -0400, Anna Schumaker wrote: > > > > > Hi Trond, > > > > > > > > > > On Sun, Sep 17, 2023 at 7:12 PM <trondmy@xxxxxxxxxx> wrote: > > > > > > > > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > > > > > Commit 4dc73c679114 reintroduces the deadlock that was > > > > > > fixed by > > > > > > commit > > > > > > aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") > > > > > > because > > > > > > it > > > > > > prevents the setup of new threads to handle reboot > > > > > > recovery, > > > > > > while > > > > > > the > > > > > > older recovery thread is stuck returning delegations. > > > > > > > > > > I'm seeing a possible deadlock with xfstests generic/472 on > > > > > NFS > > > > > v4.x > > > > > after applying this patch. The test itself checks for various > > > > > swapfile > > > > > edge cases, so it seems likely something is going on there. > > > > > > > > > > Let me know if you need more info > > > > > Anna > > > > > > > > > > > > > Did you turn off delegations on your server? If you don't, then > > > > swap > > > > will deadlock itself under various scenarios. > > > > > > Is there documentation somewhere that says that delegations must > > > be > > > turned off on the server if NFS over swap is enabled? > > > > I think the question is more generally "Is there documentation for > > NFS > > swap?" > > The main difference between using NFS for swap and for regular file > IO > is in the handling of writes, and particularly in the style of memory > allocation that is safe while handling a write request (or anything > which might block some write request, etc). > > For buffered IO, memory allocations must be GFP_NOIO or > PF_MEMALLOC_NOIO. > For swap-out, memory allocations must be GFP_MEMALLOC or PG_MEMALLOC. > > That is the primary difference - all other differences are minor. > This > difference might justify documentation suggesting that > /proc/sys/vm/min_free_kbytes could usefully be increased, but I don't > see that more is needed. > > The NOIO/MEMALLOC distinction is properly plumbed through nfs, > sunrpc, > and networking and all "just works". The problem area is that > kthread_create() doesn't take a gfpflags_t argument, so it uses > GFP_KERNEL allocations to create the new thread. > > This means that when a write cannot proceed without state management, > and state management requests that a threads be started, there is a > risk > of memory allocation deadlock. > I believe the risk is there even for buffered IO, but I'm not 100% > certain and in practice I don't think a deadlock has ever been > reported. > With swap-out it is fairly easy to trigger a deadlock if there is > heavy > swap-out traffic when state management is needed. > > The common pattern in the kernel when a thread might be needed to > support writeout is to keep the thread running permanently (rather > than > to add a gfpflags_t to kthread_create), so that is what I added to > the > nfsv4 state manager. > > However the state manager thread has a second use - returning > delegations. This sometimes needs to run concurrently with state > management, so one thread is not enough. > > What is that context for delegation return? Does it ever block > writes? > If it doesn't, would it make sense to use a work queue for returning > delegations - maybe system_wq? These are potentially long-lived processes because there may be lock recovery involved, and because of the conflict with state recovery, so it does not make sense to put them on a workqueue. > > I think it might be best to have the nfsv4 state manager thread > always > be running whether swap is enabled or not. As I say I think there is > at > least a theoretical risk of a deadlock even without swap, and having > a > small test matrix is usually a good thing. > That's a "prove me wrong" argument. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx