> -----Original Message----- > From: David Flynn [mailto:davidf@xxxxxxxxxxxx] > Sent: Saturday, October 29, 2011 9:53 PM > To: Myklebust, Trond > Cc: Chuck Lever; J. Bruce Fields; David Flynn; linux-nfs@xxxxxxxxxxxxxxx > Subject: Re: NFS4ERR_STALE_CLIENTID loop > > * Myklebust, Trond (Trond.Myklebust@xxxxxxxxxx) wrote: > > > -----Original Message----- > > > From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx] On Oct 29, 2011, > > > at 2:47 PM, J. Bruce Fields wrote: > > > > Yes, and it's not something I care that strongly about, really, my > > > > only observation is that this sort of failure (an implementation > > > > bug on one side or another resulting in a loop) seems to have been > > > > common (based on no hard data, just my vague memories of list > > > > threads), and the results fairly obnoxious (possibly even for > > > > unrelated hosts on the network). > > > > So if there's some simple way to fail more gracefully it might be > > > > helpful. > > > > > > For what it's worth, I agree that client implementations should > > > attempt to behave more gracefully in the face of server problems, be > > > they the result of bugs or the result of other issues specific to > > > that server. Problems like this make NFSv4 as a protocol look bad. > > > > I can't see what a client can do in this situation except possibly > > just give up after a while and throw a SERVER_BROKEN error (which > > means data loss). That still won't make NFSv4 look good... > > Indeed, it is a quite the dilemma. > > I agree that giving and guaranteeing unattended data loss is bad (data loss at > the behest of an operator is ok, afterall they can always fence a broken > machine). > > Looking at some of the logs again, even going back to the very original case, it > appears to be about 600us between retries (RTT=400us). Is there any way to > make that less aggressive?, eg 1s? -- that'd reduce the impact by three > orders of magnitude. What would be the down-side? How often do you > expect to get a BAD_STATEID error? BAD_STATEID is a different matter, and is one that we should have resolved in the NFS client in the upstream kernel. At least on newer clients, we should be trying to reopen the file and re-establish all locks when we get a BAD_STATEID. Can you please remind us which kernel you are using? That said... Even on new clients, the recovery attempt may fail due to the STALE_CLIENTID bug. That will still hit us when we call OPEN in order to get a new stateid. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html