nfs client hangs after lazy unmount

Pete Holland <pholland27@xxxxxxxxx> · Tue, 6 Dec 2011 11:36:51 -0800

Hello, I'm trying to debug a problem we are seeing in our systems and
was hoping to get some insight if possible.
We operate a fairly large (~500) cluster of linux VMs running a
modified version of OpenWRT 10.03 (linux kernel version 2.6.32.27)
These machines run Samba (version 3.5.11) that share files from an NFS
(over TCP) filesystem.  The SMB daemons (and backing mounts) are
brought up and down continuously throughout the day, and in order to
make sure the take-down is timely, we kill -TERM the parent SMB
process and lazily unmount (umount -l) the NFS share.  Most of the
time, this works just fine.  On occasion (usually about once or twice
per day) we end up with an SMB process that is stuck in (as best as I
can tell) nfs_getattr.  At this point the CPU is 100% in io_wait.  Our
only recourse appears to be rebooting the system when this happens.

Other relavent facts:
1.  I backported the -local_lock=flock patches into this kernel.
2.  The mount options of the NFS mount are
rw,noatime,noexec,fg,hard,intr,tcp,rsize=32768,wsize=32768,local_lock=flock
3.  This behavior seems like it may be new (or at least it is
definitely more exacerbated) since we upgraded from the 8.09 OpenWRT
release which was using the 2.6.25.20 kernel.
4.  nfsstat doesn't indicate any network issues with retransmits or the like

I can provide SysRq-T trace of the SMB process as well as a System.map
file of my kernel build if that helps.  I have a few of the systems
currently in this state so I may be able to perform some live
debugging.  Let me know if there is anything else I can provide.

Thank you for any help.

- Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html