Re: [PATCH 2/2] NFSv4: Fix a state manager thread deadlock regression

Olga Kornievskaia <aglo@xxxxxxxxx> · Fri, 22 Sep 2023 17:00:20 -0400

On Fri, Sep 22, 2023 at 3:05 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Fri, 2023-09-22 at 13:22 -0400, Olga Kornievskaia wrote:
> > On Wed, Sep 20, 2023 at 8:27 PM Trond Myklebust
> > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, 2023-09-20 at 15:38 -0400, Anna Schumaker wrote:
> > > > Hi Trond,
> > > >
> > > > On Sun, Sep 17, 2023 at 7:12 PM <trondmy@xxxxxxxxxx> wrote:
> > > > >
> > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > >
> > > > > Commit 4dc73c679114 reintroduces the deadlock that was fixed by
> > > > > commit
> > > > > aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock")
> > > > > because
> > > > > it
> > > > > prevents the setup of new threads to handle reboot recovery,
> > > > > while
> > > > > the
> > > > > older recovery thread is stuck returning delegations.
> > > >
> > > > I'm seeing a possible deadlock with xfstests generic/472 on NFS
> > > > v4.x
> > > > after applying this patch. The test itself checks for various
> > > > swapfile
> > > > edge cases, so it seems likely something is going on there.
> > > >
> > > > Let me know if you need more info
> > > > Anna
> > > >
> > >
> > > Did you turn off delegations on your server? If you don't, then
> > > swap
> > > will deadlock itself under various scenarios.
> >
> > Is there documentation somewhere that says that delegations must be
> > turned off on the server if NFS over swap is enabled?
>
> I think the question is more generally "Is there documentation for NFS
> swap?"
>
> >
> > If the client can't handle delegations + swap, then shouldn't this be
> > solved by (1) checking if we are in NFS over swap and then
> > proactively
> > setting 'dont want delegation' on open and/or (2) proactively return
> > the delegation if received so that we don't get into the deadlock?
>
> We could do that for NFSv4.1 and NFSv4.2, but the protocol will not
> allow us to do that for NFSv4.0.
>
> >
> > I think this is similar to Anna's. With this patch, I'm running into
> > a
> > problem running against an ONTAP server using xfstests (no problems
> > without the patch). During the run two stuck threads are:
> > [root@unknown000c291be8aa aglo]# cat /proc/3724/stack
> > [<0>] nfs4_run_state_manager+0x1c0/0x1f8 [nfsv4]
> > [<0>] kthread+0x100/0x110
> > [<0>] ret_from_fork+0x10/0x20
> > [root@unknown000c291be8aa aglo]# cat /proc/3725/stack
> > [<0>] nfs_wait_bit_killable+0x1c/0x88 [nfs]
> > [<0>] nfs4_wait_clnt_recover+0xb4/0xf0 [nfsv4]
> > [<0>] nfs4_client_recover_expired_lease+0x34/0x88 [nfsv4]
> > [<0>] _nfs4_do_open.isra.0+0x94/0x408 [nfsv4]
> > [<0>] nfs4_do_open+0x9c/0x238 [nfsv4]
> > [<0>] nfs4_atomic_open+0x100/0x118 [nfsv4]
> > [<0>] nfs4_file_open+0x11c/0x240 [nfsv4]
> > [<0>] do_dentry_open+0x140/0x528
> > [<0>] vfs_open+0x30/0x38
> > [<0>] do_open+0x14c/0x360
> > [<0>] path_openat+0x104/0x250
> > [<0>] do_filp_open+0x84/0x138
> > [<0>] file_open_name+0x134/0x190
> > [<0>] __do_sys_swapoff+0x58/0x6e8
> > [<0>] __arm64_sys_swapoff+0x18/0x28
> > [<0>] invoke_syscall.constprop.0+0x7c/0xd0
> > [<0>] do_el0_svc+0xb4/0xd0
> > [<0>] el0_svc+0x50/0x228
> > [<0>] el0t_64_sync_handler+0x134/0x150
> > [<0>] el0t_64_sync+0x17c/0x180
>
> Oh crap... Yes, that is a bug. Can you please apply the attached patch
> on top of the original, and see if that fixes the problem?

I can't consistently reproduce the problem. Out of several xfstests
runs a couple got stuck in that state. So when I apply that patch and
run, I can't tell if i'm no longer hitting or if I'm just not hitting
the right condition.

Given I don't exactly know what's caused it I'm trying to find
something I can hit consistently. Any ideas? I mean this stack trace
seems to imply a recovery open but I'm not doing any server reboots or
connection drops.

>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>