Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 13 Dec 2010 11:19:10 -0500

On Dec 8, 2010, at 1:28 PM, Chuck Lever wrote:

> 
> On Dec 8, 2010, at 12:31 PM, Mark Hills wrote:
> 
>> On Wed, 8 Dec 2010, Chuck Lever wrote:
>> 
>>> On Dec 8, 2010, at 9:45 AM, Mark Hills wrote:
>>> 
>>>> On Wed, 1 Dec 2010, Chuck Lever wrote:
>>>> 
>>>>> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
>>>>> 
>>>> This returns the issue to giving up after the 9999 millisecond timeout 
>>>> above, which I think is making the client overly sensitive.
>>>> 
>>>> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. 
>>>> We are using mount.nfs for a v3 mount over TCP.
>>>> 
>>>> Locating this timeout wasn't easy. There's no '9999' in any of the 
>>>> nfs-utils or glibc's rpc code.
>>>> 
>>>> It turns out that the timeout originates in 
>>>> nfs-utils/support/nfs/rpc_socket.c as
>>>> 
>>>> #define NFSRPC_TIMEOUT_TCP	(10)
>>>> 
>>>> This is used to populate a struct timeval *timeout which is used and 
>>>> updated by nfs_connect_nb(). Then the remaining timeout of 9999 
>>>> milliseconds is used in an RPC clnt_call operation.
>>>> 
>>>> 10 seconds over TCP is simply too short to be reliable here; it's not a 
>>>> long period of time for a fatal timeout. I will attempt to test on our 
>>>> cluster with an extended timeout.
>>> 
>>> This is a TCP connect timeout, not a RPC retransmit timeout.  Waiting 10 
>>> seconds for a TCP connect is really not that obnoxious.  The reason this 
>>> is short is because we want mount.nfs to realize quickly that it's 
>>> making no progress on stuck servers.
>> 
>> If my understanding of the code is correct, the 10 second timeout is not 
>> just for the connect, but also for making an RPC call and receiving the 
>> response (the struct timeval is carried through). Hence the diagnosis in 
>> the previous mail.
>> 
>>>> I presume that the bug is that the timeout is too short, and not that 
>>>> the timeout should cause a retry in a layer above -- because 
>>>> re-sending the request over reliable TCP would not make sense.
>>> 
>>> The upper layer wouldn't resend the request, since the underlying 
>>> transport never connected.
>>> 
>>>> For UDP the retries are handled inside the RPC call, clntudp_call(). 
>>>> In which case, what is the reason the UDP timeout differs from TCP? 
>>>> And is it also too small?
>>> 
>>> RPC over UDP doesn't have to connect to the server.  I wonder what would 
>>> happen if you specified "mountproto=udp" when mounting exports on this 
>>> overloaded server.
>> 
>> I think we'd still see the same issue over UDP, perhaps even more brutally 
>> because of the shorter 3 second UDP timeout.
>> 
>> To clarify, we've deviated from the problem described in the original 
>> email; opening a TCP connection to the mountd which is the problem, but 
>> mountd's ability to send a message in response.
> 
> So, you've verified (with, say, a network trace) that the clients are connecting successfully, but mountd is not sending a reply to the MNT request with ten seconds?

I've opened https://bugzilla.linux-nfs.org/show_bug.cgi?id=196 to track the mount.nfs part of this issue.

It seems to me, however, that under normal circumstances nfs-utils 1.2.1 on Fedora 12 should be doing a kernel mount, and mount.nfs should not be attempting to send a MNT request.  Would it be possible to enable some trace debugging to capture the sequence of kernel events during a failed mount?

On the clients:

  # rpcdebug -m nfs -s mount

To disable:

  # rpcdebug -m nfs -c

Debugging messages will appear in /var/log/messages.

A full network trace could also be useful.  Just needed while your jobs are starting so we can capture the network traffic at mount time.

  # tcpdump -s 1600 -w /tmp/raw

Doing this on your clients would likely be more reliable than on the server.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html