Re: NFS4ERR_STALE_CLIENTID loop

David Flynn <davidf@xxxxxxxxxxxx> · Sat, 29 Oct 2011 18:02:27 +0000

* Trond Myklebust (Trond.Myklebust@xxxxxxxxxx) wrote:
> > Using the same kernel, same mountpoint as before, we're currently
> > experiencing a loop involving NFS4ERR_STALE_CLIENTID.
...
> The problem seems like a split-brain issue on the server... On the one
> hand, it is happily telling us that our lease is OK when we RENEW. Then
> when we try to use said lease in an OPEN, it is replying with
> STALE_CLIENTID.

Thank you for the quick update, especially at the weekend.  I'm
wondering if it is possible that the STALE_CLIENTID issue is a by-product
of the BAD_STATEID issue from earlier.  We have observed several times
the BAD_STATEID loop, but the CLIENTID problem only seemed to occur when
all 40+ nodes were all showing problems.

After killing off sufficient processes, the some of the machines then
recovered of their own accord.  So your conclusion that there is a
server issue sounds reasonable.

On any such possible backoff, the previous case was with quite small
requests in quite a tight loop that seemed to cause the server grief.
This morning, a machine with a 10GbE interface had a BAD_STATEID issue
but involving some much larger writes[1], resulting in 1.6Gbit/sec from
that machine alone.  Thankfully there was only a second machine with
1GbE interfaces bringing the total up to 2.5Gbit/sec.

It is this ability for a group of clients to make matters worse that
is just as bad as any fault with Solaris.

(In a similar vein, it can be just as frustrating trying to get a client
to stop looping like this - it is often impossible to kill the process
that triggered the problem; for these, we had to resort to deleting
the files using NFSv3 (which was working quite happily))

Thank you again,
..david

[1] Capture: ftp://ftp.kw.bbc.co.uk/davidf/priv/waquahso.pcap
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html