> -----Original Message----- > From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx] > Sent: Saturday, October 29, 2011 8:23 PM > To: Myklebust, Trond > Cc: David Flynn; linux-nfs@xxxxxxxxxxxxxxx > Subject: Re: NFS4ERR_STALE_CLIENTID loop > > > On Oct 29, 2011, at 2:22 PM, Myklebust, Trond wrote: > > >> -----Original Message----- > >> From: David Flynn [mailto:davidf@xxxxxxxxxxxx] > >> Sent: Saturday, October 29, 2011 8:02 PM > >> To: Myklebust, Trond > >> Cc: David Flynn; linux-nfs@xxxxxxxxxxxxxxx; Chuck Lever > >> Subject: Re: NFS4ERR_STALE_CLIENTID loop > >> > >> * Trond Myklebust (Trond.Myklebust@xxxxxxxxxx) wrote: > >>>> Using the same kernel, same mountpoint as before, we're currently > >>>> experiencing a loop involving NFS4ERR_STALE_CLIENTID. > >> ... > >>> The problem seems like a split-brain issue on the server... On the > > one > >>> hand, it is happily telling us that our lease is OK when we RENEW. > >>> Then when we try to use said lease in an OPEN, it is replying with > >>> STALE_CLIENTID. > >> > >> Thank you for the quick update, especially at the weekend. I'm > > wondering if > >> it is possible that the STALE_CLIENTID issue is a by-product of the > >> BAD_STATEID issue from earlier. We have observed several times the > >> BAD_STATEID loop, but the CLIENTID problem only seemed to occur when > > all > >> 40+ nodes were all showing problems. > >> > >> After killing off sufficient processes, the some of the machines then > >> recovered of their own accord. So your conclusion that there is a > > server issue > >> sounds reasonable. > >> > >> On any such possible backoff, the previous case was with quite small > >> requests in quite a tight loop that seemed to cause the server grief. > >> This morning, a machine with a 10GbE interface had a BAD_STATEID > >> issue > > but > >> involving some much larger writes[1], resulting in 1.6Gbit/sec from > > that > >> machine alone. Thankfully there was only a second machine with 1GbE > >> interfaces bringing the total up to 2.5Gbit/sec. > >> > >> It is this ability for a group of clients to make matters worse that > > is just as bad > >> as any fault with Solaris. > > > > Sure, but gone are the days when NFS had "reference implementations" > > that everyone had to interoperate with. NFSv4 is a fully documented > > protocol which describes how both clients and servers are supposed to > > work. If either the client or server fail to work according to that > > documentation, then bad things will happen. > > > > While I could litter the client code with lots of little tricks to be > > "defensive" in the face of buggy servers, that isn't going to solve > > anything: the server will still be buggy, and the client will still be > > faced with a situation that it cannot resolve. > > > >> (In a similar vein, it can be just as frustrating trying to get a > > client to stop > >> looping like this - it is often impossible to kill the process that > > triggered the > >> problem; for these, we had to resort to deleting the files using > >> NFSv3 > > (which > >> was working quite happily)) > > > > 'kill -9' should in principle work to kill off the process. Was that > > failing to work? > > The trick is knowing which process to kill. Generally you have to kill the state > manager thread in this case. > No. Not in this case: the state manager was doing RENEW, then exiting because it was told all is A_OK by the server. The open() system call would be the one that was looping. Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html