Hi Richard, On Wed, 2020-05-20 at 18:47 +0100, Richard Purdie wrote: > Hi, > > We have a cluster of machines where we're observing file accesses > hanging over NFS. The clients showing the problems are Fedora and > SUSE > distros with the 5.6.11 kernel, e.g.: > > Linux version 5.6.11-1-default (geeko@buildhost) (gcc version 9.3.1 > 20200406 > [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2] (SUSE Linux)) > #1 SMP Wed May 6 10:42:09 UTC 2020 (91c024a) > > In the example below we see a git clone hang, its having trouble > reading a .pack file off the NFS share, the git process is in D > state. > I've included part of dmesg below with sysrq-w output. > > Mount options: > > rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto= > tcp,timeo=600,retrans=2,sec=sys,local_lock=none > > mountstats shows: > > READ: > 632014263 ops (62%) 629809108 errors (99%) > TEST_STATEID: > 363257078 ops (36%) 363257078 errors (100%) > > which is a clue on what is happening. I grabbed some data with > tcpdump > and it shows the READ getting NFS4ERR_BAD_STATEID, there is then a > TEST_STATEID which gets NFS4ERR_NOTSUPP. This repeats infinitely in a > loop. > > The server is FreeNAS11.3 which does not have: > https://github.com/HardenedBSD/hardenedBSD-stable/commit/63f6f19b0756b18f2e68d82cbe037f21f9a8c500 > applied so it will return NFS4ERR_NOTSUPP to TEST_STATEID. > > I think something may be needed to stop Linux getting into an > infinite > loop with this, regardless of whether the spec says TEST_STATEID can > get a NFS4ERR_NOTSUPP or not? > > I freely admit I know little about much of this so I'm open to > pointers. If we did remount as 4.0 we probably wouldn't see the issue > as it would avoid the TEST_STATEID code. TEST_STATEID is listed in RFC5661 Section 17 as REQUIRED to implement for NFSv4.1. We will not be able to support a server that violates that requirement. Cheers Trond -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx