Re: ext4-nfsd interaction causes sporadic hang on rwsem_down_write_failed

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Mon, 12 Nov 2018 01:39:47 -0500

On Mon, Nov 12, 2018 at 04:38:34AM +0000, Kevin Liu wrote:
> Hi,
> 
> I recently submitted an NFS bug
> (https://bugzilla.kernel.org/show_bug.cgi?id=201655) where nfsd randomly
> locks up on rwsem_down_write_failed:

> So, starting with ext4, I was wondering if you had an idea of what the
> cause might be or where the fault truly lies.

Sorry, this isn't something I've seen before.  And it's not at all
obvious from the information in Bugzilla what's causing the deadlock.

The down_read() up appears to be in mm/memory.c, in
__access_remote_vm() getting called from proc_pid_cmdline_read().
access_Remote_vm is apparently trying to get a shared lock on
&mm->mmap_sem.  How this would get involved with the inode_lock() is
not immediately obvious.

Things I would suggest.

1) Try running your kernel console log throughn
./scripts/decode_stacktrace.sh so we can be sure we've correctly
assessed where the kernel is grabbing which lock.  Enabling
CONFIG_DEBUG_INFO and CONFIG_DEBUG_INFO_REDUCED will be helpful.

2) Try turning on CONFIG_LOCKDEP and see if this reports some
potential deadlock.

3) Try using sysrq-d to find all held locks (running the resulting
kernel console output through decode_stacktrace.sh will also eb
helpful).

Cheers,

					- Ted