Re: [PATCH 2/2] NFSv4: Fix a state manager thread deadlock regression

Anna Schumaker <schumaker.anna@xxxxxxxxx> · Tue, 26 Sep 2023 10:55:07 -0400

On Sun, Sep 24, 2023 at 1:08 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Fri, 2023-09-22 at 17:06 -0400, Trond Myklebust wrote:
> > On Fri, 2023-09-22 at 17:00 -0400, Olga Kornievskaia wrote:
> > > On Fri, Sep 22, 2023 at 3:05 PM Trond Myklebust
> > > >
> > > > Oh crap... Yes, that is a bug. Can you please apply the attached
> > > > patch
> > > > on top of the original, and see if that fixes the problem?
> > >
> > > I can't consistently reproduce the problem. Out of several xfstests
> > > runs a couple got stuck in that state. So when I apply that patch
> > > and
> > > run, I can't tell if i'm no longer hitting or if I'm just not
> > > hitting
> > > the right condition.
> > >
> > > Given I don't exactly know what's caused it I'm trying to find
> > > something I can hit consistently. Any ideas? I mean this stack
> > > trace
> > > seems to imply a recovery open but I'm not doing any server reboots
> > > or
> > > connection drops.
> > >
> > > >
> >
> > If I'm right about the root cause, then just turning off delegations
> > on
> > the server, setting up a NFS swap partition and then running some
> > ordinary file open/close workload against the exact same server would
> > probably suffice to trigger your stack trace 100% reliably.
> >
> > I'll see if I can find time to test it over the weekend.
>
> >
>
> Yep... Creating a 4G empty file on /mnt/nfs/swap/swapfile, running
> mkswap  and then swapon followed by a simple bash line of
>         echo "foo" >/mnt/nfs/foobar
>
> will immediately lead to a hang. When I look at the stack for the bash
> process, I see the following dump, which matches yours:
>
> [root@vmw-test-1 ~]# cat /proc/1120/stack
> [<0>] nfs_wait_bit_killable+0x11/0x60 [nfs]
> [<0>] nfs4_wait_clnt_recover+0x54/0x90 [nfsv4]
> [<0>] nfs4_client_recover_expired_lease+0x29/0x60 [nfsv4]
> [<0>] nfs4_do_open+0x170/0xa90 [nfsv4]
> [<0>] nfs4_atomic_open+0x94/0x100 [nfsv4]
> [<0>] nfs_atomic_open+0x2d9/0x670 [nfs]
> [<0>] path_openat+0x3c3/0xd40
> [<0>] do_filp_open+0xb4/0x160
> [<0>] do_sys_openat2+0x81/0xe0
> [<0>] __x64_sys_openat+0x81/0xa0
> [<0>] do_syscall_64+0x68/0xa0
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
> With the fix I sent you:
>
> [root@vmw-test-1 ~]# mount -t nfs -overs=4.2 vmw-test-2:/export /mnt/nfs
> [root@vmw-test-1 ~]# mkswap /mnt/nfs/swap/swapfile
> mkswap: /mnt/nfs/swap/swapfile: warning: wiping old swap signature.
> Setting up swapspace version 1, size = 4 GiB (4294963200 bytes)
> no label, UUID=1360b0a3-833a-4ba7-b467-8a59d3723013
> [root@vmw-test-1 ~]# swapon /mnt/nfs/swap/swapfile
> [root@vmw-test-1 ~]# ps -efww | grep manage
> root        1214       2  0 13:04 ?        00:00:00 [192.168.76.251-manager]
> root        1216    1147  0 13:04 pts/0    00:00:00 grep --color=auto manage
> [root@vmw-test-1 ~]# echo "foo" >/mnt/nfs/foobar
> [root@vmw-test-1 ~]# cat /mnt/nfs/foobar
> foo
>
> So that returns behaviour to normal in my testing, and I no longer see
> the hangs.
>
> Let me send out a PATCHv2...

I'm liking patch v2 much better! I was testing with a Linux server
with default configuration, and it's nice that the xfstests swapfile
tests aren't hanging anymore :)

Anna

> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>