On Wed, 2020-05-20 at 19:06 +0100, Richard Purdie wrote: > On Wed, 2020-05-20 at 18:01 +0000, Trond Myklebust wrote: > > Hi Richard, > > > > On Wed, 2020-05-20 at 18:47 +0100, Richard Purdie wrote: > > > Hi, > > > > > > We have a cluster of machines where we're observing file accesses > > > hanging over NFS. The clients showing the problems are Fedora and > > > SUSE > > > distros with the 5.6.11 kernel, e.g.: > > > > > > Linux version 5.6.11-1-default (geeko@buildhost) (gcc version > > > 9.3.1 > > > 20200406 > > > [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2] (SUSE > > > Linux)) > > > #1 SMP Wed May 6 10:42:09 UTC 2020 (91c024a) > > > > > > In the example below we see a git clone hang, its having trouble > > > reading a .pack file off the NFS share, the git process is in D > > > state. > > > I've included part of dmesg below with sysrq-w output. > > > > > > Mount options: > > > > > > rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,pr > > > oto= > > > tcp,timeo=600,retrans=2,sec=sys,local_lock=none > > > > > > mountstats shows: > > > > > > READ: > > > 632014263 ops (62%) 629809108 errors (99%) > > > TEST_STATEID: > > > 363257078 ops (36%) 363257078 errors (100%) > > > > > > which is a clue on what is happening. I grabbed some data with > > > tcpdump > > > and it shows the READ getting NFS4ERR_BAD_STATEID, there is then > > > a > > > TEST_STATEID which gets NFS4ERR_NOTSUPP. This repeats infinitely > > > in a > > > loop. > > > > > > The server is FreeNAS11.3 which does not have: > > > https://github.com/HardenedBSD/hardenedBSD-stable/commit/63f6f19b0756b18f2e68d82cbe037f21f9a8c500 > > > applied so it will return NFS4ERR_NOTSUPP to TEST_STATEID. > > > > > > I think something may be needed to stop Linux getting into an > > > infinite > > > loop with this, regardless of whether the spec says TEST_STATEID > > > can > > > get a NFS4ERR_NOTSUPP or not? > > > > > > I freely admit I know little about much of this so I'm open to > > > pointers. If we did remount as 4.0 we probably wouldn't see the > > > issue > > > as it would avoid the TEST_STATEID code. > > > > TEST_STATEID is listed in RFC5661 Section 17 as REQUIRED to > > implement > > for NFSv4.1. We will not be able to support a server that violates > > that > > requirement. > > Understood, I suspected as much. > > Locking systems into an infinite loop doesn't seem like a good user > experience though. Is there a way to handle that more gracefully? > As I implied above, this is a 'server from hell' scenario that we really can't be expected to support at all. I suggest downgrading to NFSv4.0 until you can get a fix for the server. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx