Re: Timeout issue (similar to bugs 11061 and 11154), bisected

Trond Myklebust <trond.myklebust@xxxxxxxxxx> · Mon, 16 Feb 2009 08:04:19 -0500

On Mon, 2009-02-16 at 13:11 +0200, Arto Jantunen wrote:
> (I'm not subscribed, so please CC me on any replies)
> 
> I seem to have hit a NFS bug while upgrading a machine from Debian
> Etch to Debian Lenny. I have a NFS server running FreeBSD 7.0 RC1 and
> a bunch of clients running Linux. The ones running kernel 2.6.18 work
> perfectly, as do the ones running 2.6.24. The one I upgraded to 2.6.26
> fails. After 5-15 minutes of working normally the mount dies and I get
> the usual "nfs: server <server> not responding, still trying" in
> dmesg. The only way I have found to get the mount back is umount -f &&
> mount, waiting does not bring it back.
> 
> I have tested quite a bunch of different kernel versions, and starting
> from 25 and ending at the git tree last week they all fail in the same
> way. Bisecting tracks the problem to commit
> e06799f958bf7f9f8fae15f0c6f519953fb0257c
> 
> I originally thought that it was the same as bug 11154, but the
> patches attached to that bug do not fix this issue.
> 
> Any thoughts, patches, ideas?

That looks like the known problem with the NFS server failing to close
connections in a timely manner. There is a fix for this in

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=69b6ba3712b796a66595cfaf0a5ab4dfe1cf964a

There is also a client side patch that increases the robustness of the
client when it hits a buggy server, and that causes it to do the
equivalent of a linger2 timeout. That patch is as of yet not merged into
mainline, however I've attached it below together with a followup patch
that makes the timeout configurable...

Cheers
  Trond
Attachment:
linux-2.6.28-100-add_tcp_linger.dif

Description: application/dif
Attachment:
linux-2.6.28-101-add_tcp_linger_sysctl.dif

Description: application/dif