RE: NFS4ERR_STALE_CLIENTID loop

"Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> · Sat, 29 Oct 2011 11:22:01 -0700

> -----Original Message-----
> From: David Flynn [mailto:davidf@xxxxxxxxxxxx]
> Sent: Saturday, October 29, 2011 8:02 PM
> To: Myklebust, Trond
> Cc: David Flynn; linux-nfs@xxxxxxxxxxxxxxx; Chuck Lever
> Subject: Re: NFS4ERR_STALE_CLIENTID loop
> 
> * Trond Myklebust (Trond.Myklebust@xxxxxxxxxx) wrote:
> > > Using the same kernel, same mountpoint as before, we're currently
> > > experiencing a loop involving NFS4ERR_STALE_CLIENTID.
> ...
> > The problem seems like a split-brain issue on the server... On the
one
> > hand, it is happily telling us that our lease is OK when we RENEW.
> > Then when we try to use said lease in an OPEN, it is replying with
> > STALE_CLIENTID.
> 
> Thank you for the quick update, especially at the weekend.  I'm
wondering if
> it is possible that the STALE_CLIENTID issue is a by-product of the
> BAD_STATEID issue from earlier.  We have observed several times the
> BAD_STATEID loop, but the CLIENTID problem only seemed to occur when
all
> 40+ nodes were all showing problems.
> 
> After killing off sufficient processes, the some of the machines then
> recovered of their own accord.  So your conclusion that there is a
server issue
> sounds reasonable.
> 
> On any such possible backoff, the previous case was with quite small
> requests in quite a tight loop that seemed to cause the server grief.
> This morning, a machine with a 10GbE interface had a BAD_STATEID issue
but
> involving some much larger writes[1], resulting in 1.6Gbit/sec from
that
> machine alone.  Thankfully there was only a second machine with 1GbE
> interfaces bringing the total up to 2.5Gbit/sec.
> 
> It is this ability for a group of clients to make matters worse that
is just as bad
> as any fault with Solaris.

Sure, but gone are the days when NFS had "reference implementations"
that everyone had to interoperate with. NFSv4 is a fully documented
protocol which describes how both clients and servers are supposed to
work. If either the client or server fail to work according to that
documentation, then bad things will happen.

While I could litter the client code with lots of little tricks to be
"defensive" in the face of buggy servers, that isn't going to solve
anything: the server will still be buggy, and the client will still be
faced with a situation that it cannot resolve.

> (In a similar vein, it can be just as frustrating trying to get a
client to stop
> looping like this - it is often impossible to kill the process that
triggered the
> problem; for these, we had to resort to deleting the files using NFSv3
(which
> was working quite happily))

'kill -9' should in principle work to kill off the process. Was that
failing to work?

Cheers
   Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html