On Oct 29, 2011, at 2:22 PM, Myklebust, Trond wrote: >> -----Original Message----- >> From: David Flynn [mailto:davidf@xxxxxxxxxxxx] >> Sent: Saturday, October 29, 2011 8:02 PM >> To: Myklebust, Trond >> Cc: David Flynn; linux-nfs@xxxxxxxxxxxxxxx; Chuck Lever >> Subject: Re: NFS4ERR_STALE_CLIENTID loop >> >> * Trond Myklebust (Trond.Myklebust@xxxxxxxxxx) wrote: >>>> Using the same kernel, same mountpoint as before, we're currently >>>> experiencing a loop involving NFS4ERR_STALE_CLIENTID. >> ... >>> The problem seems like a split-brain issue on the server... On the > one >>> hand, it is happily telling us that our lease is OK when we RENEW. >>> Then when we try to use said lease in an OPEN, it is replying with >>> STALE_CLIENTID. >> >> Thank you for the quick update, especially at the weekend. I'm > wondering if >> it is possible that the STALE_CLIENTID issue is a by-product of the >> BAD_STATEID issue from earlier. We have observed several times the >> BAD_STATEID loop, but the CLIENTID problem only seemed to occur when > all >> 40+ nodes were all showing problems. >> >> After killing off sufficient processes, the some of the machines then >> recovered of their own accord. So your conclusion that there is a > server issue >> sounds reasonable. >> >> On any such possible backoff, the previous case was with quite small >> requests in quite a tight loop that seemed to cause the server grief. >> This morning, a machine with a 10GbE interface had a BAD_STATEID issue > but >> involving some much larger writes[1], resulting in 1.6Gbit/sec from > that >> machine alone. Thankfully there was only a second machine with 1GbE >> interfaces bringing the total up to 2.5Gbit/sec. >> >> It is this ability for a group of clients to make matters worse that > is just as bad >> as any fault with Solaris. > > Sure, but gone are the days when NFS had "reference implementations" > that everyone had to interoperate with. NFSv4 is a fully documented > protocol which describes how both clients and servers are supposed to > work. If either the client or server fail to work according to that > documentation, then bad things will happen. > > While I could litter the client code with lots of little tricks to be > "defensive" in the face of buggy servers, that isn't going to solve > anything: the server will still be buggy, and the client will still be > faced with a situation that it cannot resolve. > >> (In a similar vein, it can be just as frustrating trying to get a > client to stop >> looping like this - it is often impossible to kill the process that > triggered the >> problem; for these, we had to resort to deleting the files using NFSv3 > (which >> was working quite happily)) > > 'kill -9' should in principle work to kill off the process. Was that > failing to work? The trick is knowing which process to kill. Generally you have to kill the state manager thread in this case. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html