Stuck NFSv4 mounts of Isilon filer with repeated NFS4ERR_STALE_CLIENTID errors

James Pearson <jcpearson@xxxxxxxxx> · Wed, 25 Mar 2020 10:30:36 +0000

We're seeing a number of Linux (CentOS 7.5) clients getting nfs:
server isilon not responding, still trying'  from various exports from
a Isilon

I appreciate we're using a vendor's Linux (out-of-date) kernel and a
third party filer, but if anyone can give me any pointers of how to
debug this issue, I would be grateful (we also have a support case
open with the Isilon vendor)

Running tshark on a client when this issue happens (taken several
hours after the issue happened), we get repeating:

  1   12:18:11 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW CID: 0xde68
  2   12:18:11 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call In
1) RENEW Status: NFS4ERR_STALE_CLIENTID
  4   12:18:16 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW CID: 0xde68
  5   12:18:16 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call In
4) RENEW Status: NFS4ERR_STALE_CLIENTID
  7   12:18:21 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW CID: 0xde68
  8   12:18:21 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call In
7) RENEW Status: NFS4ERR_STALE_CLIENTID
...

My knowledge of NFSv4 is sketchy, but from my (partial) reading of
rfc7530 shouldn't the client be sending a SETCLIENTID in response to a
NFS4ERR_STALE_CLIENTID - which doesn't appear to be happening here?

Although the server hasn't rebooted since the client mounted the file
system - so not sure what might be going on ?

We are upgrading clients to the latest CentOS (RHEL) 7.7 to see if
that 'fixes' the issue - but would appreciate any other pointers

Thanks

James Pearson