On Feb 18, 2014, at 11:30, Manuel Sabban <manuel.sabban@xxxxxxxxxxxxxxxxxxxx> wrote: > Hi, > > We have approximatively one hundred desktop computers with 3.12.6 kernel > and debian wheezy system. NFS is used for homes. Mount options are > "rw,nosuid,nodev,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255, > soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,local_lock=none". > > The NFS server we use is the ZFS appliance from Oracle > (http://www.oracle.com/us/products/servers-storage/storage/nas/zfs7420/overview/index.html). The > server does some short-to-very-long pauses (from several minutes to > several hours, because of a known bug acknowledged by oracle in our > configuration) and we suspect that this behaviour trigger the behaviour > described below. > > What we understand is that when the server is back online, the client > try to write something on the NFS and the server throw a STALE_STATEID > error. And, then the client try again, with the same result, and try > again, and again... This is happening at the rate of 3300 packets per > second, on the example below. > > At this point, the client get hung, and the enabled traces > showed a full trace file of > kworker/1:0-11993 [001] .... 1171115.807948: nfs4_read: error=-10023 (STALE_STATEID) fileid=00:1f:283 fhandle=0xb1863420 offset=0 count=12288 > kworker/1:0-11993 [001] .... 1171115.808543: nfs4_read: error=-10023 (STALE_STATEID) fileid=00:1f:283 fhandle=0xb1863420 offset=0 count=12288 > kworker/1:0-11993 [001] .... 1171115.809111: nfs4_read: error=-10023 (STALE_STATEID) fileid=00:1f:283 fhandle=0xb1863420 offset=0 count=12288 > > The network dump showed similar things with the NFS4ERR_STALE_STATEID > error. Then, the computer has to be hard rebooted. > > How can this behaviour be avoided ? > > You will find debugging traces and network dump at > http://perso.telecom-paristech.fr/~sabban/debugNFS/tsilinuxb96 So, the exact sequence in the wireshark dump is a successful RENEW followed by a READ with STALE_STATEID. I’m guessing that they still haven’t fixed the RENEW bug that we reported several years ago: if the lease has expired, then it should return NFS4ERR_STALE_CLIENTID, not NFS4_OK… Yes, clients do rely on this behaviour... _________________________________ Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html