On Wed, Nov 17, 2010 at 09:08:26AM +1100, Neil Brown wrote: > On Tue, 16 Nov 2010 13:20:26 -0500 > "J. Bruce Fields" <bfields@xxxxxxxxxxxx> wrote: > > > On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote: > > > I am looking into an issue of hanging clients to a set of NFS servers, on > > > a large HPC cluster. > > > > > > My investigation took me to the RPC code, svc_create_socket(). > > > > > > if (protocol == IPPROTO_TCP) { > > > if ((error = kernel_listen(sock, 64)) < 0) > > > goto bummer; > > > } > > > > > > A fixed backlog of 64 connections at the server seems like it could be too > > > low on a cluster like this, particularly when the protocol opens and > > > closes the TCP connection. > > > > > > I wondered what is the rationale is behind this number, particuarly as it > > > is a fixed value. Perhaps there is a reason why this has no effect on > > > nfsd, or is this a FAQ for people on large systems? > > > > > > The servers show overflow of a listening queue, which I imagine is > > > related. > > > > > > $ netstat -s > > > [...] > > > TcpExt: > > > 6475 times the listen queue of a socket overflowed > > > 6475 SYNs to LISTEN sockets ignored > > > > > > The affected servers are old, kernel 2.6.9. But this limit of 64 is > > > consistent across that and the latest kernel source. > > > > Looks like the last time that was touched was 8 years ago, by Neil (below, from > > historical git archive). > > > > I'd be inclined to just keep doubling it until people don't complain, > > unless it's very expensive. (How much memory (or whatever else) does a > > pending connection tie up?) > > Surely we should "keep multiplying by 13" as that is what I did :-) > > There is a sysctl 'somaxconn' which limits what a process can ask for in the > listen() system call, but as we bypass this syscall it doesn't directly > affect nfsd. > It defaults to SOMAXCONN == 128 but can be raised arbitrarily by the sysadmin. > > There is another sysctl 'max_syn_backlog' which looks like a system-wide > limit to the connect backlog. > This defaults to 256. The comment says it is > adjusted between 128 and 1024 based on memory size, though that isn't clear > in the code (to me at least). This comment?: /* * Maximum number of SYN_RECV sockets in queue per LISTEN socket. * One SYN_RECV socket costs about 80bytes on a 32bit machine. * It would be better to replace it with a global counter for all sockets * but then some measure against one socket starving all other sockets * would be needed. * * It was 128 by default. Experiments with real servers show, that * it is absolutely not enough even at 100conn/sec. 256 cures most * of problems. This value is adjusted to 128 for very small machines * (<=32Mb of memory) and to 1024 on normal or better ones (>=256Mb). * Note : Dont forget somaxconn that may limit backlog too. */ int sysctl_max_syn_backlog = 256; Looks like net/ipv4/tcp.c:tcp_init() does the memory-based calculation. 80 bytes sounds small. > So we could: > - hard code a new number > - make this another sysctl configurable > - auto-adjust it so that it "just works". > > I would prefer the latter if it is possible. Possibly we could adjust it > based on the number of nfsd threads, like we do for receive buffer space. > Maybe something arbitrary like: > min(16 + 2 * number of threads, sock_net(sk)->core.sysctl_somaxconn) > > which would get the current 64 at 24 threads, and can easily push up to 128 > and beyond with more threads. > > Or is that too arbitrary? I kinda like the idea of piggybacking on an existing constant like sysctl_max_syn_backlog. Somebody else hopefully keeps it set to something reasonable, and we as a last resort it gives you a knob to twiddle. But number of threads would work OK too. At a minimum we should make sure we solve the original problem.... Mark, have you had a chance to check whether increasing that number to 128 or more is enough to solve your problem? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html