On Fri, Nov 19, 2010 at 06:17:25PM -0500, Trond Myklebust wrote: > On Fri, 2010-11-19 at 14:58 -0800, Simon Kirby wrote: > > On Fri, Nov 19, 2010 at 05:17:19PM -0500, Trond Myklebust wrote: > > > So what were all the > > > > > > 'lockd: server 10.10.52.xxx not responding, still trying' > > > > > > messages all about? There were quite a few of them for a number of > > > different servers in the moments leading up to the hang. Could it be a > > > problem with the switch these clients are attached to? > > > > If it were a switch problem, would we see port 2049 socket backlogs with > > netstat -tan or ss -tan? I haven't seen this at all when the problem > > occurs. All of the sockets are idle (and usually it seems to close them > > all except the one server that all of the slots are stuck on). tcpdump > > shows no problems, just very slow requests rates that match the rpc/nfs > > debugging. > > No retransmits that might indicate dropped packets at the switch? How > fast are the tcp ACKs from the server being returned? That tcpdump I sent included the ACKs, which all looked normal. Unfortunately, we haven't seen the problem again yet. Is your "Fix an infinite loop in call_refresh/call_refreshresult" patch possibly related? > > If the rpc slots are stuck full, would that cause lockd to print those > > timeouts? > > Yes. That would be the only kind of event that would trigger these > messages. and in this case, rpcinto -t and -u should look normal, I would assume, unless there is a switch/network issue? Still waiting for it to occur again to try those commands. Simon- -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html