Re: NFSv4 empty RPC thrashing?

Andy Adamson <androsadamson@xxxxxxxxx> · Fri, 23 Dec 2011 13:56:44 -0500

Commit 0ced63d1 introduced in 3.0-rc1 fixes this condition.

NFSv4: Handle expired stateids when the lease is still valid

Currently, if the server returns NFS4ERR_EXPIRED in reply to a READ or
WRITE, but the RENEW test determines that the lease is still active, we
fail to recover and end up looping forever in a READ/WRITE + RENEW death
spiral.

Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
Cc: stable@xxxxxxxxxx

-->Andy

On Thu, Dec 22, 2011 at 2:25 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
> On Dec 22, 2011, at 1:36 PM, Chuck Lever wrote:
>
>> Hi Paul, long time!
>>
>> On Dec 22, 2011, at 1:31 PM, Paul Anderson wrote:
>>
>>> Issue: extremely high rate packets like so (tcpdump):
>>>
>>> 16:31:09.308678 IP (tos 0x0, ttl 64, id 41517, offset 0, flags [DF],
>>> proto TCP (6), length 144)
>>>   r20.xxx.edu.1362383749 > nfsb.xxx.edu.nfs: 88 null
>>> 16:31:09.308895 IP (tos 0x0, ttl 64, id 22578, offset 0, flags [DF],
>>> proto TCP (6), length 100)
>>>   nfsb.xxx.edu.nfs > r20.xxx.edu.1362383749: reply ok 44 null
>>> 16:31:09.308926 IP (tos 0x0, ttl 64, id 41518, offset 0, flags [DF],
>>> proto TCP (6), length 192)
>>>   r20.xxx.edu.1379160965 > nfsb.xxx.edu.nfs: 136 null
>>>
>>> All linux kernels are from kernel.org, version 2.6.38.5 with the
>>> addition of Mosix.  All userland is Ubuntu 10.04LTS.
>>>
>>> Scenario: the compute cluster is composed of 50-60 compute nodes, and
>>> 10 or so head nodes that act as compute/login and high rate NFS
>>> serving, for the purpose of largely sequential processing of high
>>> volume genetic sequencing data (one recent job was 50-70TiBytes in, 50
>>> TiBytes out).  We see this problem regularly (two servers are
>>> currently being hit this way right now), and is apparently cleared
>>> only on reboot of the server.
>>>
>>> Something in our use of the cluster appears to be triggering what
>>> looks like a race condition in the NFSv4 client/server communications.
>>> This issue prevents us from using NFS reliably in our cluster.
>>> Although we do very high I/O at times, this alone does not appear to
>>> be the trigger.  It is possibly a related to a problem of having SLURM
>>> starting 200-300 jobs at once, where each job hits a common NFS
>>> fileserver for the program binaries, for example.  In our cluster
>>> testing, this appears to reliably cause about half the jobs to fail
>>> while loading the program itself - they hang in D state indefinitely,
>>> but are killable.
>>>
>>> Looking at dozens of clients, we can do tcpdump, and see messages
>>> similar to the above being sent at a high rate from the gigabit
>>> connected compute nodes - the main indication being a context switch
>>> rate of 20-30K per second.  The 10 gigabit connected server is
>>> functioning, but seeing context switch rates of 200-300K per second -
>>> an exceptional rate that appears to slow down NFS services for all
>>> other users.  I have not done any extensive packet capture to
>>> determine actual traffic rates, but am pretty sure it is limited by
>>> wire speed and CPU.
>>>
>>> The client nodes in this scenario are not actively being used - some
>>> show zero processes in D state, others show dozens of jobs stuck in D
>>> state (these jobs are unkillable) - the NFSv4 server shows nfsd jobs
>>> running flat out.
>>>
>>> Mount commands look like this:
>>> for h in $servers do ;
>>>   mount -t nfs4 -o rw,soft,intr,nodev,nosuid,async ${h}:/ /net/$h
>>> done
>>>
>>> The NFSv4 servers all start using stock Ubuntu 10.04 setup - no tuning
>>> has been done.
>>>
>>> We can trivially get packet captures with more packets, but they are
>>> all similar - 15-20 client nodes all pounding one NFS server node.
>>
>> We'd need to see full-frame raw captures.  "tcpdump -s0 -w /tmp/raw"  Let's see a few megabytes.
>>
>> On the face of it, it looks like it could be a state reclaim loop, but I can't say until I see a full network capture.
>
> Paul sent me a pair of pcap traces off-list.
>
> This is yet another a state reclaim loop.  The server is returning NFS4ERR_EXPIRED to a READ request, but the client's subsequent RENEW gets NFS4_OK, so it doesn't do any state recovery.  It simply keeps retrying the READ.
>
> Paul, have you tried upgrading your clients to the latest kernel.org release?  Was there a recent network partition or server reboot that triggers these events?
>
> One reason this might happen is if the client is using an expired state ID for the READ, and then uses a valid client ID during the RENEW request.  This can happen if the client failed, some time earlier, to completely reclaim state after a server outage.
>
> Bruce, is there any way to look at the state ID and client ID tokens to verify they are for the same lease?
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html