Re: VFS rename hang

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Thu, 9 Apr 2020 20:42:12 +0000

On Thu, 2020-04-09 at 16:15 -0400, Olga Kornievskaia wrote:
> Hi Trond,
> 
> On Thu, Apr 9, 2020 at 3:16 PM Trond Myklebust <
> trondmy@xxxxxxxxxxxxxxx> wrote:
> > Hi Olga,
> > 
> > On Thu, 2020-04-09 at 13:14 -0400, Olga Kornievskaia wrote:
> > > Hi folks,
> > > 
> > > This is a rename on an NFS mount but the stack trace is not in
> > > NFS,
> > > but I'm curious if any body ran into this. Thanks.
> > > 
> > > Apr  7 13:34:53 scspr1865142002 kernel:      Not tainted 5.5.7 #1
> > > Apr  7 13:34:53 scspr1865142002 kernel: "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Apr  7 13:34:53 scspr1865142002 kernel: dt              D    0
> > > 24788
> > > 24323 0x00000080
> > > Apr  7 13:34:53 scspr1865142002 kernel: Call Trace:
> > > Apr  7 13:34:53 scspr1865142002 kernel: ? __schedule+0x2ca/0x6e0
> > > Apr  7 13:34:53 scspr1865142002 kernel: schedule+0x4a/0xb0
> > > Apr  7 13:34:53 scspr1865142002 kernel:
> > > schedule_preempt_disabled+0xa/0x10
> > > Apr  7 13:34:53 scspr1865142002 kernel:
> > > __mutex_lock.isra.11+0x233/0x4e0
> > > Apr  7 13:34:53 scspr1865142002 kernel: ?
> > > strncpy_from_user+0x47/0x160
> > > Apr  7 13:34:53 scspr1865142002 kernel: lock_rename+0x28/0xd0
> > > Apr  7 13:34:53 scspr1865142002 kernel: do_renameat2+0x1e7/0x4f0
> > > Apr  7 13:34:53 scspr1865142002 kernel:
> > > __x64_sys_rename+0x1c/0x20
> > > Apr  7 13:34:53 scspr1865142002 kernel: do_syscall_64+0x5b/0x200
> > > Apr  7 13:34:53 scspr1865142002 kernel:
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > Apr  7 13:34:53 scspr1865142002 kernel: RIP: 0033:0x7f747a10ac77
> > > Apr  7 13:34:53 scspr1865142002 kernel: Code: Bad RIP value.
> > > Apr  7 13:34:53 scspr1865142002 kernel: RSP:
> > > 002b:00007f7479f92948
> > > EFLAGS: 00000206 ORIG_RAX: 0000000000000052
> > > Apr  7 13:34:53 scspr1865142002 kernel: RAX: ffffffffffffffda
> > > RBX:
> > > 00000000023604c0 RCX: 00007f747a10ac77
> > > Apr  7 13:34:53 scspr1865142002 kernel: RDX: 0000000000000000
> > > RSI:
> > > 00007f7479f94a80 RDI: 00007f7479f96b80
> > > Apr  7 13:34:53 scspr1865142002 kernel: RBP: 0000000000000005
> > > R08:
> > > 00007f7479f9d700 R09: 00007f7479f9d700
> > > Apr  7 13:34:53 scspr1865142002 kernel: R10: 645f72656464616c
> > > R11:
> > > 0000000000000206 R12: 0000000000000001
> > > Apr  7 13:34:53 scspr1865142002 kernel: R13: 00007f7479f98c80
> > > R14:
> > > 00007f7479f9ad80 R15: 00007f7479f94a80
> > 
> > It looks like the rename locking (i.e. taking the inode mutex on
> > the
> > source and target directory) is hung. That likely indicates that
> > something else is leaking or holding onto one or more of the
> > directory
> > mutexes. Is some other thread/process perhaps also hung on the same
> > directory?
> 
> Thanks for the reply. I see several hung application processes with
> the same stack. Question now is there some NFS rename that's maybe
> hanging because server isn't replying (but I would think in that case
> I'd get a stack with a hung somewhere in NFS and there isn't one).
> This is also with nconnect so not sure if that has any effect on
> this.

This is happening on the client side, right? I wouldn't worry too much
about the fact that these stack traces end in the VFS layer. It may
still be trying to lock inodes from NFS. However, yes, I would normally
expect there to be at least some other process or thread stuck in the
NFS layer.

If this is NFSv4, is there perhaps a recovery thread stuck trying to
run? You might want to grep for nfs4_begin_drain_session in
/proc/*/stack

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx