On Dec 8, 2010, at 1:28 PM, Chuck Lever wrote: > > On Dec 8, 2010, at 12:31 PM, Mark Hills wrote: > >> On Wed, 8 Dec 2010, Chuck Lever wrote: >> >>> On Dec 8, 2010, at 9:45 AM, Mark Hills wrote: >>> >>>> On Wed, 1 Dec 2010, Chuck Lever wrote: >>>> >>>>> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote: >>>>> >>>> This returns the issue to giving up after the 9999 millisecond timeout >>>> above, which I think is making the client overly sensitive. >>>> >>>> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. >>>> We are using mount.nfs for a v3 mount over TCP. >>>> >>>> Locating this timeout wasn't easy. There's no '9999' in any of the >>>> nfs-utils or glibc's rpc code. >>>> >>>> It turns out that the timeout originates in >>>> nfs-utils/support/nfs/rpc_socket.c as >>>> >>>> #define NFSRPC_TIMEOUT_TCP (10) >>>> >>>> This is used to populate a struct timeval *timeout which is used and >>>> updated by nfs_connect_nb(). Then the remaining timeout of 9999 >>>> milliseconds is used in an RPC clnt_call operation. >>>> >>>> 10 seconds over TCP is simply too short to be reliable here; it's not a >>>> long period of time for a fatal timeout. I will attempt to test on our >>>> cluster with an extended timeout. >>> >>> This is a TCP connect timeout, not a RPC retransmit timeout. Waiting 10 >>> seconds for a TCP connect is really not that obnoxious. The reason this >>> is short is because we want mount.nfs to realize quickly that it's >>> making no progress on stuck servers. >> >> If my understanding of the code is correct, the 10 second timeout is not >> just for the connect, but also for making an RPC call and receiving the >> response (the struct timeval is carried through). Hence the diagnosis in >> the previous mail. >> >>>> I presume that the bug is that the timeout is too short, and not that >>>> the timeout should cause a retry in a layer above -- because >>>> re-sending the request over reliable TCP would not make sense. >>> >>> The upper layer wouldn't resend the request, since the underlying >>> transport never connected. >>> >>>> For UDP the retries are handled inside the RPC call, clntudp_call(). >>>> In which case, what is the reason the UDP timeout differs from TCP? >>>> And is it also too small? >>> >>> RPC over UDP doesn't have to connect to the server. I wonder what would >>> happen if you specified "mountproto=udp" when mounting exports on this >>> overloaded server. >> >> I think we'd still see the same issue over UDP, perhaps even more brutally >> because of the shorter 3 second UDP timeout. >> >> To clarify, we've deviated from the problem described in the original >> email; opening a TCP connection to the mountd which is the problem, but >> mountd's ability to send a message in response. > > So, you've verified (with, say, a network trace) that the clients are connecting successfully, but mountd is not sending a reply to the MNT request with ten seconds? I've opened https://bugzilla.linux-nfs.org/show_bug.cgi?id=196 to track the mount.nfs part of this issue. It seems to me, however, that under normal circumstances nfs-utils 1.2.1 on Fedora 12 should be doing a kernel mount, and mount.nfs should not be attempting to send a MNT request. Would it be possible to enable some trace debugging to capture the sequence of kernel events during a failed mount? On the clients: # rpcdebug -m nfs -s mount To disable: # rpcdebug -m nfs -c Debugging messages will appear in /var/log/messages. A full network trace could also be useful. Just needed while your jobs are starting so we can capture the network traffic at mount time. # tcpdump -s 1600 -w /tmp/raw Doing this on your clients would likely be more reliable than on the server. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html