On Fri, Sep 22, 2023 at 3:05 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Fri, 2023-09-22 at 13:22 -0400, Olga Kornievskaia wrote: > > On Wed, Sep 20, 2023 at 8:27 PM Trond Myklebust > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > On Wed, 2023-09-20 at 15:38 -0400, Anna Schumaker wrote: > > > > Hi Trond, > > > > > > > > On Sun, Sep 17, 2023 at 7:12 PM <trondmy@xxxxxxxxxx> wrote: > > > > > > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > > > Commit 4dc73c679114 reintroduces the deadlock that was fixed by > > > > > commit > > > > > aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") > > > > > because > > > > > it > > > > > prevents the setup of new threads to handle reboot recovery, > > > > > while > > > > > the > > > > > older recovery thread is stuck returning delegations. > > > > > > > > I'm seeing a possible deadlock with xfstests generic/472 on NFS > > > > v4.x > > > > after applying this patch. The test itself checks for various > > > > swapfile > > > > edge cases, so it seems likely something is going on there. > > > > > > > > Let me know if you need more info > > > > Anna > > > > > > > > > > Did you turn off delegations on your server? If you don't, then > > > swap > > > will deadlock itself under various scenarios. > > > > Is there documentation somewhere that says that delegations must be > > turned off on the server if NFS over swap is enabled? > > I think the question is more generally "Is there documentation for NFS > swap?" > > > > > If the client can't handle delegations + swap, then shouldn't this be > > solved by (1) checking if we are in NFS over swap and then > > proactively > > setting 'dont want delegation' on open and/or (2) proactively return > > the delegation if received so that we don't get into the deadlock? > > We could do that for NFSv4.1 and NFSv4.2, but the protocol will not > allow us to do that for NFSv4.0. > > > > > I think this is similar to Anna's. With this patch, I'm running into > > a > > problem running against an ONTAP server using xfstests (no problems > > without the patch). During the run two stuck threads are: > > [root@unknown000c291be8aa aglo]# cat /proc/3724/stack > > [<0>] nfs4_run_state_manager+0x1c0/0x1f8 [nfsv4] > > [<0>] kthread+0x100/0x110 > > [<0>] ret_from_fork+0x10/0x20 > > [root@unknown000c291be8aa aglo]# cat /proc/3725/stack > > [<0>] nfs_wait_bit_killable+0x1c/0x88 [nfs] > > [<0>] nfs4_wait_clnt_recover+0xb4/0xf0 [nfsv4] > > [<0>] nfs4_client_recover_expired_lease+0x34/0x88 [nfsv4] > > [<0>] _nfs4_do_open.isra.0+0x94/0x408 [nfsv4] > > [<0>] nfs4_do_open+0x9c/0x238 [nfsv4] > > [<0>] nfs4_atomic_open+0x100/0x118 [nfsv4] > > [<0>] nfs4_file_open+0x11c/0x240 [nfsv4] > > [<0>] do_dentry_open+0x140/0x528 > > [<0>] vfs_open+0x30/0x38 > > [<0>] do_open+0x14c/0x360 > > [<0>] path_openat+0x104/0x250 > > [<0>] do_filp_open+0x84/0x138 > > [<0>] file_open_name+0x134/0x190 > > [<0>] __do_sys_swapoff+0x58/0x6e8 > > [<0>] __arm64_sys_swapoff+0x18/0x28 > > [<0>] invoke_syscall.constprop.0+0x7c/0xd0 > > [<0>] do_el0_svc+0xb4/0xd0 > > [<0>] el0_svc+0x50/0x228 > > [<0>] el0t_64_sync_handler+0x134/0x150 > > [<0>] el0t_64_sync+0x17c/0x180 > > Oh crap... Yes, that is a bug. Can you please apply the attached patch > on top of the original, and see if that fixes the problem? I can't consistently reproduce the problem. Out of several xfstests runs a couple got stuck in that state. So when I apply that patch and run, I can't tell if i'm no longer hitting or if I'm just not hitting the right condition. Given I don't exactly know what's caused it I'm trying to find something I can hit consistently. Any ideas? I mean this stack trace seems to imply a recovery open but I'm not doing any server reboots or connection drops. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >