* Trond Myklebust (Trond.Myklebust@xxxxxxxxxx) wrote: > > Using the same kernel, same mountpoint as before, we're currently > > experiencing a loop involving NFS4ERR_STALE_CLIENTID. ... > The problem seems like a split-brain issue on the server... On the one > hand, it is happily telling us that our lease is OK when we RENEW. Then > when we try to use said lease in an OPEN, it is replying with > STALE_CLIENTID. Thank you for the quick update, especially at the weekend. I'm wondering if it is possible that the STALE_CLIENTID issue is a by-product of the BAD_STATEID issue from earlier. We have observed several times the BAD_STATEID loop, but the CLIENTID problem only seemed to occur when all 40+ nodes were all showing problems. After killing off sufficient processes, the some of the machines then recovered of their own accord. So your conclusion that there is a server issue sounds reasonable. On any such possible backoff, the previous case was with quite small requests in quite a tight loop that seemed to cause the server grief. This morning, a machine with a 10GbE interface had a BAD_STATEID issue but involving some much larger writes[1], resulting in 1.6Gbit/sec from that machine alone. Thankfully there was only a second machine with 1GbE interfaces bringing the total up to 2.5Gbit/sec. It is this ability for a group of clients to make matters worse that is just as bad as any fault with Solaris. (In a similar vein, it can be just as frustrating trying to get a client to stop looping like this - it is often impossible to kill the process that triggered the problem; for these, we had to resort to deleting the files using NFSv3 (which was working quite happily)) Thank you again, ..david [1] Capture: ftp://ftp.kw.bbc.co.uk/davidf/priv/waquahso.pcap -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html