Re: TEST_STATEID issues with NFS4.1 and FreeNAS server

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 20 May 2020 18:14:16 +0000

On Wed, 2020-05-20 at 19:06 +0100, Richard Purdie wrote:
> On Wed, 2020-05-20 at 18:01 +0000, Trond Myklebust wrote:
> > Hi Richard,
> > 
> > On Wed, 2020-05-20 at 18:47 +0100, Richard Purdie wrote:
> > > Hi,
> > > 
> > > We have a cluster of machines where we're observing file accesses
> > > hanging over NFS. The clients showing the problems are Fedora and
> > > SUSE
> > > distros with the 5.6.11 kernel, e.g.:
> > > 
> > > Linux version 5.6.11-1-default (geeko@buildhost) (gcc version
> > > 9.3.1
> > > 20200406 
> > > [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2] (SUSE
> > > Linux)) 
> > > #1 SMP Wed May 6 10:42:09 UTC 2020 (91c024a)
> > > 
> > > In the example below we see a git clone hang, its having trouble
> > > reading a .pack file off the NFS share, the git process is in D
> > > state.
> > > I've included part of dmesg below with sysrq-w output.
> > > 
> > > Mount options:
> > > 
> > > rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,pr
> > > oto=
> > > tcp,timeo=600,retrans=2,sec=sys,local_lock=none
> > > 
> > > mountstats shows:
> > >  
> > > READ:
> > > 	632014263 ops (62%) 	629809108 errors (99%) 
> > > TEST_STATEID:
> > >  	363257078 ops (36%) 	363257078 errors (100%)
> > > 
> > > which is a clue on what is happening. I grabbed some data with
> > > tcpdump
> > > and it shows the READ getting NFS4ERR_BAD_STATEID, there is then
> > > a
> > > TEST_STATEID which gets NFS4ERR_NOTSUPP. This repeats infinitely
> > > in a
> > > loop.
> > > 
> > > The server is FreeNAS11.3 which does not have:
> > > https://github.com/HardenedBSD/hardenedBSD-stable/commit/63f6f19b0756b18f2e68d82cbe037f21f9a8c500
> > > applied so it will return NFS4ERR_NOTSUPP to TEST_STATEID.
> > > 
> > > I think something may be needed to stop Linux getting into an
> > > infinite
> > > loop with this, regardless of whether the spec says TEST_STATEID
> > > can
> > > get a NFS4ERR_NOTSUPP or not?
> > > 
> > > I freely admit I know little about much of this so I'm open to
> > > pointers. If we did remount as 4.0 we probably wouldn't see the
> > > issue
> > > as it would avoid the TEST_STATEID code.
> > 
> > TEST_STATEID is listed in RFC5661 Section 17 as REQUIRED to
> > implement
> > for NFSv4.1. We will not be able to support a server that violates
> > that
> > requirement.
> 
> Understood, I suspected as much.
> 
> Locking systems into an infinite loop doesn't seem like a good user
> experience though. Is there a way to handle that more gracefully?
> 

As I implied above, this is a 'server from hell' scenario that we
really can't be expected to support at all. I suggest downgrading to
NFSv4.0 until you can get a fix for the server.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx