On Apr 15, 2008, at 11:22 PM, Tom Tucker wrote: > On Tue, 2008-04-15 at 22:58 -0400, J. Bruce Fields wrote: >> On Tue, Apr 15, 2008 at 09:43:10PM -0500, Tom Tucker wrote: >>> >>> On Tue, 2008-04-15 at 11:12 -0400, J. Bruce Fields wrote: >>>> On Mon, Apr 14, 2008 at 11:48:33PM -0500, Tom Tucker wrote: >>>>> >>>>> Maybe this this is a TCP_BACKLOG issue? >>>> >>>> So, looking around.... There seems to be a global limit in >>>> /proc/sys/net/ipv4/tcp_max_syn_backlog (default 1024?); might be >>>> worth >>>> seeing what happens if that's increased, e.g., with >>>> >>>> echo 2048 >/proc/sys/net/ipv4/tcp_max_syn_backlog >>> >>> I think this represents the collective total for all listening >>> endpoints. I think we're only talking about mountd. >> >> Yes. >> >>> Shooting from the hip... >>> >>> My gray haired recollection is that the single connection default >>> is a >>> backlog of 10 (SYN received, not accepted connections). Additional >>> SYN's >>> received to this endpoint will be dropped...clients will retry the >>> SYN >>> as part of normal TCP retransmit... >>> >>> It might be that the CLOSE_WAIT's in the log are _normal_. That >>> is, they >>> reflect completed mount requests that are in the normal close >>> path. If >>> they never go away, then that's not normal. Is this the case? >> >> What he said was: >> >> "those fall over after some time and stay in CLOSE_WAIT state >> until I restart the nfs-kernel-server." >> >> Carsten, are you positive that the same sockets were in CLOSE_WAIT >> the >> whole time you were watching? And how long was it before you gave up >> and restarted? >> >>> Suppose the 10 is roughly correct. The remaining "jilted" clients >>> will >>> retransmit their SYN after a randomized exponential backoff. I >>> think you >>> can imagine that trying 1300+ connections of which only 10 succeed >>> and >>> then retrying 1300-10 based on a randomized exponential backoff >>> might >>> get you some pretty bad performance. >> >> Right, could be, but: >> >> ... >>>> Oh, but: Grepping the glibc rpc code, it looks like it calls >>>> listen with >>>> second argument SOMAXCONN == 128. You can confirm that by >>>> strace'ing >>>> rpc.mountd -F and looking for the listen call. >>>> >>>> And that socket's shared between all the mountd processes, so I >>>> guess >>>> that's the real limit. I don't see an easy way to adjust that. >>>> You'd >>>> also need to increase /proc/sys/net/core/somaxconn first. >>>> >>>> But none of this explains why we'd see connections stuck in >>>> CLOSE_WAIT >>>> indefinitely? >> >> So the limit appears to be more like 128, and (based on my quick >> look at >> the code) that appears to baked in to the glibc rpc code. >> >> Maybe you could code around that in mountd. Looks like the relevant >> code is in nfs-utils/support/include/rpcmisc.c:rpc_init(). > > If you really need to start 1300 mounts all at once then something > needs > to change. BTW even after you get past mountd, the server is going to > get pounded with SYN and RPC_NOP. Would it be worth trying UDP, just as an experiment? Force UDP for the mountd protocol by specifying the "mountproto=udp" option. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ NFS maillist - NFS@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/nfs _______________________________________________ Please note that nfs@xxxxxxxxxxxxxxxxxxxxx is being discontinued. Please subscribe to linux-nfs@xxxxxxxxxxxxxxx instead. http://vger.kernel.org/vger-lists.html#linux-nfs -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html