> -----Original Message----- > From: David Flynn [mailto:davidf@xxxxxxxxxxxx] > Sent: Saturday, October 29, 2011 8:02 PM > To: Myklebust, Trond > Cc: David Flynn; linux-nfs@xxxxxxxxxxxxxxx; Chuck Lever > Subject: Re: NFS4ERR_STALE_CLIENTID loop > > * Trond Myklebust (Trond.Myklebust@xxxxxxxxxx) wrote: > > > Using the same kernel, same mountpoint as before, we're currently > > > experiencing a loop involving NFS4ERR_STALE_CLIENTID. > ... > > The problem seems like a split-brain issue on the server... On the one > > hand, it is happily telling us that our lease is OK when we RENEW. > > Then when we try to use said lease in an OPEN, it is replying with > > STALE_CLIENTID. > > Thank you for the quick update, especially at the weekend. I'm wondering if > it is possible that the STALE_CLIENTID issue is a by-product of the > BAD_STATEID issue from earlier. We have observed several times the > BAD_STATEID loop, but the CLIENTID problem only seemed to occur when all > 40+ nodes were all showing problems. > > After killing off sufficient processes, the some of the machines then > recovered of their own accord. So your conclusion that there is a server issue > sounds reasonable. > > On any such possible backoff, the previous case was with quite small > requests in quite a tight loop that seemed to cause the server grief. > This morning, a machine with a 10GbE interface had a BAD_STATEID issue but > involving some much larger writes[1], resulting in 1.6Gbit/sec from that > machine alone. Thankfully there was only a second machine with 1GbE > interfaces bringing the total up to 2.5Gbit/sec. > > It is this ability for a group of clients to make matters worse that is just as bad > as any fault with Solaris. Sure, but gone are the days when NFS had "reference implementations" that everyone had to interoperate with. NFSv4 is a fully documented protocol which describes how both clients and servers are supposed to work. If either the client or server fail to work according to that documentation, then bad things will happen. While I could litter the client code with lots of little tricks to be "defensive" in the face of buggy servers, that isn't going to solve anything: the server will still be buggy, and the client will still be faced with a situation that it cannot resolve. > (In a similar vein, it can be just as frustrating trying to get a client to stop > looping like this - it is often impossible to kill the process that triggered the > problem; for these, we had to resort to deleting the files using NFSv3 (which > was working quite happily)) 'kill -9' should in principle work to kill off the process. Was that failing to work? Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html