Re: [bug report] kernel 5.18.0-rc4 oops with invalid wait context

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 11 May 2022 16:09:52 +0000

On Wed, 2022-05-11 at 14:57 +0000, Chuck Lever III wrote:
> 
> 
> > On May 10, 2022, at 10:24 AM, Chuck Lever III
> > <chuck.lever@xxxxxxxxxx> wrote:
> > 
> > 
> > 
> > > On May 3, 2022, at 3:11 PM, dai.ngo@xxxxxxxxxx wrote:
> > > 
> > > Hi,
> > > 
> > > I just noticed there were couple of oops in my Oracle6 in
> > > nfs4.dev network.
> > > I'm not sure who ran which tests (be useful to know) that caused
> > > these oops.
> > > 
> > > Here is the stack traces:
> > > 
> > > [286123.154006] BUG: sleeping function called from invalid
> > > context at kernel/locking/rwsem.c:1585
> > > [286123.155126] in_atomic(): 1, irqs_disabled(): 0, non_block: 0,
> > > pid: 3983, name: nfsd
> > > [286123.155872] preempt_count: 1, expected: 0
> > > [286123.156443] RCU nest depth: 0, expected: 0
> > > [286123.156771] 1 lock held by nfsd/3983:
> > > [286123.156786]  #0: ffff888006762520 (&clp->cl_lock){+.+.}-
> > > {2:2}, at: nfsd4_release_lockowner+0x306/0x850 [nfsd]
> > > [286123.156949] Preemption disabled at:
> > > [286123.156961] [<0000000000000000>] 0x0
> > > [286123.157520] CPU: 1 PID: 3983 Comm: nfsd Kdump: loaded Not
> > > tainted 5.18.0-rc4+ #1
> > > [286123.157539] Hardware name: innotek GmbH
> > > VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> > > [286123.157552] Call Trace:
> > > [286123.157565]  <TASK>
> > > [286123.157581]  dump_stack_lvl+0x57/0x7d
> > > [286123.157609]  __might_resched.cold+0x222/0x26b
> > > [286123.157644]  down_read_nested+0x68/0x420
> > > [286123.157671]  ? down_write_nested+0x130/0x130
> > > [286123.157686]  ? rwlock_bug.part.0+0x90/0x90
> > > [286123.157705]  ? rcu_read_lock_sched_held+0x81/0xb0
> > > [286123.157749]  xfs_file_fsync+0x3b9/0x820
> > > [286123.157776]  ? lock_downgrade+0x680/0x680
> > > [286123.157798]  ? xfs_filemap_pfn_mkwrite+0x10/0x10
> > > [286123.157823]  ? nfsd_file_put+0x100/0x100 [nfsd]
> > > [286123.157921]  nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
> > > [286123.158007]  nfsd_file_put+0x79/0x100 [nfsd]
> > > [286123.158088]  check_for_locks+0x152/0x200 [nfsd]
> > > [286123.158191]  nfsd4_release_lockowner+0x4cf/0x850 [nfsd]
> > > [286123.158393]  ? nfsd4_locku+0xd10/0xd10 [nfsd]
> > > [286123.158488]  ? rcu_read_lock_bh_held+0x90/0x90
> > > [286123.158525]  nfsd4_proc_compound+0xd15/0x25a0 [nfsd]
> > > [286123.158699]  nfsd_dispatch+0x4ed/0xc30 [nfsd]
> > > [286123.158974]  ? rcu_read_lock_bh_held+0x90/0x90
> > > [286123.159010]  svc_process_common+0xd8e/0x1b20 [sunrpc]
> > > [286123.159043]  ? svc_generic_rpcbind_set+0x450/0x450 [sunrpc]
> > > [286123.159043]  ? nfsd_svc+0xc50/0xc50 [nfsd]
> > > [286123.159043]  ? svc_sock_secure_port+0x27/0x40 [sunrpc]
> > > [286123.159043]  ? svc_recv+0x1100/0x2390 [sunrpc]
> > > [286123.159043]  svc_process+0x361/0x4f0 [sunrpc]
> > > [286123.159043]  nfsd+0x2d6/0x570 [nfsd]
> > > [286123.159043]  ? nfsd_shutdown_threads+0x2a0/0x2a0 [nfsd]
> > > [286123.159043]  kthread+0x29f/0x340
> > > [286123.159043]  ? kthread_complete_and_exit+0x20/0x20
> > > [286123.159043]  ret_from_fork+0x22/0x30
> > > [286123.159043]  </TASK>
> > > [286123.187052] BUG: scheduling while atomic:
> > > nfsd/3983/0x00000002
> > > [286123.187551] INFO: lockdep is turned off.
> > > [286123.187918] Modules linked in: nfsd auth_rpcgss nfs_acl lockd
> > > grace sunrpc
> > > [286123.188527] Preemption disabled at:
> > > [286123.188535] [<0000000000000000>] 0x0
> > > [286123.189255] CPU: 1 PID: 3983 Comm: nfsd Kdump: loaded
> > > Tainted: G        W         5.18.0-rc4+ #1
> > > [286123.190233] Hardware name: innotek GmbH
> > > VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> > > [286123.190910] Call Trace:
> > > [286123.190910]  <TASK>
> > > [286123.190910]  dump_stack_lvl+0x57/0x7d
> > > [286123.190910]  __schedule_bug.cold+0x133/0x143
> > > [286123.190910]  __schedule+0x16c9/0x20a0
> > > [286123.190910]  ? schedule_timeout+0x314/0x510
> > > [286123.190910]  ? __sched_text_start+0x8/0x8
> > > [286123.190910]  ? internal_add_timer+0xa4/0xe0
> > > [286123.190910]  schedule+0xd7/0x1f0
> > > [286123.190910]  schedule_timeout+0x319/0x510
> > > [286123.190910]  ? rcu_read_lock_held_common+0xe/0xa0
> > > [286123.190910]  ? usleep_range_state+0x150/0x150
> > > [286123.190910]  ? lock_acquire+0x331/0x490
> > > [286123.190910]  ? init_timer_on_stack_key+0x50/0x50
> > > [286123.190910]  ? do_raw_spin_lock+0x116/0x260
> > > [286123.190910]  ? rwlock_bug.part.0+0x90/0x90
> > > [286123.190910]  io_schedule_timeout+0x26/0x80
> > > [286123.190910]  wait_for_completion_io_timeout+0x16a/0x340
> > > [286123.190910]  ? rcu_read_lock_bh_held+0x90/0x90
> > > [286123.190910]  ? wait_for_completion+0x330/0x330
> > > [286123.190910]  submit_bio_wait+0x135/0x1d0
> > > [286123.190910]  ? submit_bio_wait_endio+0x40/0x40
> > > [286123.190910]  ? xfs_iunlock+0xd5/0x300
> > > [286123.190910]  ? bio_init+0x295/0x470
> > > [286123.190910]  blkdev_issue_flush+0x69/0x80
> > > [286123.190910]  ? blk_unregister_queue+0x1e0/0x1e0
> > > [286123.190910]  ? bio_kmalloc+0x90/0x90
> > > [286123.190910]  ? xfs_iunlock+0x1b4/0x300
> > > [286123.190910]  xfs_file_fsync+0x354/0x820
> > > [286123.190910]  ? lock_downgrade+0x680/0x680
> > > [286123.190910]  ? xfs_filemap_pfn_mkwrite+0x10/0x10
> > > [286123.190910]  ? nfsd_file_put+0x100/0x100 [nfsd]
> > > [286123.190910]  nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
> > > [286123.190910]  nfsd_file_put+0x79/0x100 [nfsd]
> > > [286123.190910]  check_for_locks+0x152/0x200 [nfsd]
> > > [286123.190910]  nfsd4_release_lockowner+0x4cf/0x850 [nfsd]
> > > [286123.190910]  ? nfsd4_locku+0xd10/0xd10 [nfsd]
> > > [286123.190910]  ? rcu_read_lock_bh_held+0x90/0x90
> > > [286123.190910]  nfsd4_proc_compound+0xd15/0x25a0 [nfsd]
> > > [286123.190910]  nfsd_dispatch+0x4ed/0xc30 [nfsd]
> > > [286123.190910]  ? rcu_read_lock_bh_held+0x90/0x90
> > > [286123.190910]  svc_process_common+0xd8e/0x1b20 [sunrpc]
> > > [286123.190910]  ? svc_generic_rpcbind_set+0x450/0x450 [sunrpc]
> > > [286123.190910]  ? nfsd_svc+0xc50/0xc50 [nfsd]
> > > [286123.190910]  ? svc_sock_secure_port+0x27/0x40 [sunrpc]
> > > [286123.190910]  ? svc_recv+0x1100/0x2390 [sunrpc]
> > > [286123.190910]  svc_process+0x361/0x4f0 [sunrpc]
> > > [286123.190910]  nfsd+0x2d6/0x570 [nfsd]
> > > [286123.190910]  ? nfsd_shutdown_threads+0x2a0/0x2a0 [nfsd]
> > > [286123.190910]  kthread+0x29f/0x340
> > > [286123.190910]  ? kthread_complete_and_exit+0x20/0x20
> > > [286123.190910]  ret_from_fork+0x22/0x30
> > > [286123.190910]  </TASK>
> > > 
> > > The problem is the process tries to sleep while holding the
> > > cl_lock by nfsd4_release_lockowner. I think the problem was
> > > introduced with the filemap_flush in nfsd_file_put since
> > > 'b6669305d35a nfsd: Reduce the number of calls to
> > > nfsd_file_gc()'.
> > > The filemap_flush is later replaced by nfsd_file_flush by
> > > '6b8a94332ee4f nfsd: Fix a write performance regression'.
> > 
> > That seems plausible, given the traces above.
> > 
> > But it begs the question: why was a vfs_fsync() needed by
> > RELEASE_LOCKOWNER in this case? I've tried to reproduce the
> > problem, and even added a might_sleep() call in nfsd_file_flush()
> > but haven't been able to reproduce.
> 
> Trond, I'm assuming you switched to a synchronous flush here to
> capture writeback errors. There's no other requirement for
> waiting for the flush to complete, right?

It's because the file is unhashed, so it is about to be closed and
garbage collected as soon as the refcount goes to zero.

> 
> To enable nfsd_file_put() to be invoked in atomic contexts again,
> would the following be a reasonable short term fix:
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 2c1b027774d4..96c8d07788f4 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -304,7 +304,7 @@ nfsd_file_put(struct nfsd_file *nf)
>  {
>         set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>         if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
> -               nfsd_file_flush(nf);
> +               filemap_flush(nf->nf_file->f_mapping);
>                 nfsd_file_put_noref(nf);
>         } else {
>                 nfsd_file_put_noref(nf);
> 
> 

filemap_flush() sleeps, and so does nfsd_file_put_noref() (it can call
filp_close() + fput()), so this kind of change isn't going to work no
matter how you massage it.

Is there any reason why you need a reference to the nfs_file there?
Wouldn't holding the fp->fi_lock be sufficient, since you're already in
an atomic context? As long as one of the entries in fp->fi_fds[] is
non-zero then you should be safe.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx