Re: NFS: nfs4_reclaim_open_state: Lock reclaim failed! log spew

Olga Kornievskaia <aglo@xxxxxxxxx> · Thu, 17 Nov 2016 14:58:12 -0500

On Thu, Nov 17, 2016 at 2:32 PM, bfields@xxxxxxxxxxxx
<bfields@xxxxxxxxxxxx> wrote:
> On Thu, Nov 17, 2016 at 05:45:52PM +0000, Trond Myklebust wrote:
>> On Thu, 2016-11-17 at 11:31 -0500, J. Bruce Fields wrote:
>> > On Wed, Nov 16, 2016 at 02:55:05PM -0600, Jason L Tibbitts III wrote:
>> > >
>> > > I'm replying to a rather old message, but the issue has just now
>> > > popped
>> > > back up again.
>> > >
>> > > To recap, a client stops being able to access _any_ mount on a
>> > > particular server, and "NFS: nfs4_reclaim_open_state: Lock reclaim
>> > > failed!" appears several hundred times per second in the kernel
>> > > log.
>> > > The load goes up by one for ever process attempting to access any
>> > > mount
>> > > from that particular server.  Mounts to other servers are fine, and
>> > > other clients can mount things from that one server without
>> > > problems.
>> > >
>> > > When I kill every process keeping that particular mount active and
>> > > then
>> > > umount it, I see:
>> > >
>> > > NFS: nfs4_reclaim_open_state: unhandled error -10068
>> >
>> > NFS4ERR_RETRY_UNCACHED_REP.
>> >
>> > So, you're using NFSv4.1 or 4.2, and the server thinks that the
>> > client
>> > has reused a (slot, sequence number) pair, but the server doesn't
>> > have a
>> > cached response to return.
>> >
>> > Hard to know how that happened, and it's not shown in the below.
>> > Sounds like a bug, though.
>>
>> ...or a Ctrl-C....
>
> How does that happen?
>

If I may chime in...

Bruce, when an application sends a Ctrl-C and clients's session slot
has sent out an RPC but didn't process the reply, the client doesn't
know if the server processed that sequence id or not. In that case,
the client doesn't increment the sequence number. Normally the client
would handle getting such an error by retrying again (and resetting
the slots) but I think during recovery operation the client handles
errors differently (by just erroring). I believe the reasoning that we
don't want to be stuck trying to recover from the recovery from the
recovery etc...

Jason,

The UNCACHED_REP error is really not interesting as it's a consequence
of you having a client that already failed with an error of "unable to
reclaim the locks". I'm surprised that the application doesn't error
at this point with EIO. But that aside, I think I've seen this kind of
behavior due to client't callback channel going down (and not replying
to the CB_RECALLs and then server revoking state).

> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html