Re: multiple NFS4ERR_STALE_STATEID on 3.12 (wheezy)

Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> · Tue, 18 Feb 2014 12:14:46 -0500

On Feb 18, 2014, at 11:30, Manuel Sabban <manuel.sabban@xxxxxxxxxxxxxxxxxxxx> wrote:

> Hi,
> 
> We have approximatively one hundred desktop computers with 3.12.6 kernel
> and debian wheezy system. NFS is used for homes. Mount options are
> "rw,nosuid,nodev,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,
> soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,local_lock=none".
> 
> The NFS server we use is the ZFS appliance from Oracle
> (http://www.oracle.com/us/products/servers-storage/storage/nas/zfs7420/overview/index.html). The
> server does some short-to-very-long pauses (from several minutes to
> several hours, because of a known bug acknowledged by oracle in our
> configuration) and we suspect that this behaviour trigger the behaviour
> described below.
> 
> What we understand is that when the server is back online, the client
> try to write something on the NFS and the server throw a STALE_STATEID
> error. And, then the client try again, with the same result, and try
> again, and again... This is happening at the rate of 3300 packets per
> second, on the example below.
> 
> At this point, the client get hung, and the enabled traces
> showed a full trace file of
> kworker/1:0-11993 [001] .... 1171115.807948: nfs4_read: error=-10023 (STALE_STATEID) fileid=00:1f:283 fhandle=0xb1863420 offset=0 count=12288
> kworker/1:0-11993 [001] .... 1171115.808543: nfs4_read: error=-10023 (STALE_STATEID) fileid=00:1f:283 fhandle=0xb1863420 offset=0 count=12288
> kworker/1:0-11993 [001] .... 1171115.809111: nfs4_read: error=-10023 (STALE_STATEID) fileid=00:1f:283 fhandle=0xb1863420 offset=0 count=12288
> 
> The network dump showed similar things with the NFS4ERR_STALE_STATEID
> error. Then, the computer has to be hard rebooted.
> 
> How can this behaviour be avoided ?
> 
> You will find debugging traces and network dump at
> http://perso.telecom-paristech.fr/~sabban/debugNFS/tsilinuxb96

So, the exact sequence in the wireshark dump is a successful RENEW followed by a READ with STALE_STATEID. I’m guessing that they still haven’t fixed the RENEW bug that we reported several years ago: if the lease has expired, then it should return NFS4ERR_STALE_CLIENTID, not NFS4_OK…

Yes, clients do rely on this behaviour...

_________________________________
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html