On Wed, 2020-05-20 at 18:01 +0000, Trond Myklebust wrote: > Hi Richard, > > On Wed, 2020-05-20 at 18:47 +0100, Richard Purdie wrote: > > Hi, > > > > We have a cluster of machines where we're observing file accesses > > hanging over NFS. The clients showing the problems are Fedora and > > SUSE > > distros with the 5.6.11 kernel, e.g.: > > > > Linux version 5.6.11-1-default (geeko@buildhost) (gcc version 9.3.1 > > 20200406 > > [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2] (SUSE Linux)) > > #1 SMP Wed May 6 10:42:09 UTC 2020 (91c024a) > > > > In the example below we see a git clone hang, its having trouble > > reading a .pack file off the NFS share, the git process is in D > > state. > > I've included part of dmesg below with sysrq-w output. > > > > Mount options: > > > > rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto= > > tcp,timeo=600,retrans=2,sec=sys,local_lock=none > > > > mountstats shows: > > > > READ: > > 632014263 ops (62%) 629809108 errors (99%) > > TEST_STATEID: > > 363257078 ops (36%) 363257078 errors (100%) > > > > which is a clue on what is happening. I grabbed some data with > > tcpdump > > and it shows the READ getting NFS4ERR_BAD_STATEID, there is then a > > TEST_STATEID which gets NFS4ERR_NOTSUPP. This repeats infinitely in a > > loop. > > > > The server is FreeNAS11.3 which does not have: > > https://github.com/HardenedBSD/hardenedBSD-stable/commit/63f6f19b0756b18f2e68d82cbe037f21f9a8c500 > > applied so it will return NFS4ERR_NOTSUPP to TEST_STATEID. > > > > I think something may be needed to stop Linux getting into an > > infinite > > loop with this, regardless of whether the spec says TEST_STATEID can > > get a NFS4ERR_NOTSUPP or not? > > > > I freely admit I know little about much of this so I'm open to > > pointers. If we did remount as 4.0 we probably wouldn't see the issue > > as it would avoid the TEST_STATEID code. > > TEST_STATEID is listed in RFC5661 Section 17 as REQUIRED to implement > for NFSv4.1. We will not be able to support a server that violates that > requirement. Understood, I suspected as much. Locking systems into an infinite loop doesn't seem like a good user experience though. Is there a way to handle that more gracefully? Cheers, Richard