Re: [bug report] kernel 5.18.0-rc4 oops with invalid wait context

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 11 May 2022 17:31:23 +0000

On Wed, 2022-05-11 at 16:27 +0000, Chuck Lever III wrote:
> 
> 
> > On May 11, 2022, at 12:09 PM, Trond Myklebust
> > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Wed, 2022-05-11 at 14:57 +0000, Chuck Lever III wrote:
> > > 
> > > 
> > > > On May 10, 2022, at 10:24 AM, Chuck Lever III
> > > > <chuck.lever@xxxxxxxxxx> wrote:
> > > > 
> > > > 
> > > > 
> > > > > On May 3, 2022, at 3:11 PM, dai.ngo@xxxxxxxxxx wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I just noticed there were couple of oops in my Oracle6 in
> > > > > nfs4.dev network.
> > > > > I'm not sure who ran which tests (be useful to know) that
> > > > > caused
> > > > > these oops.
> > > > > 
> > > > > Here is the stack traces:
> > > > > 
> > > > > [286123.154006] BUG: sleeping function called from invalid
> > > > > context at kernel/locking/rwsem.c:1585
> > > > > [286123.155126] in_atomic(): 1, irqs_disabled(): 0,
> > > > > non_block: 0,
> > > > > pid: 3983, name: nfsd
> > > > > [286123.155872] preempt_count: 1, expected: 0
> > > > > [286123.156443] RCU nest depth: 0, expected: 0
> > > > > [286123.156771] 1 lock held by nfsd/3983:
> > > > > [286123.156786]  #0: ffff888006762520 (&clp->cl_lock){+.+.}-
> > > > > {2:2}, at: nfsd4_release_lockowner+0x306/0x850 [nfsd]
> > > > > [286123.156949] Preemption disabled at:
> > > > > [286123.156961] [<0000000000000000>] 0x0
> > > > > [286123.157520] CPU: 1 PID: 3983 Comm: nfsd Kdump: loaded Not
> > > > > tainted 5.18.0-rc4+ #1
> > > > > [286123.157539] Hardware name: innotek GmbH
> > > > > VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> > > > > [286123.157552] Call Trace:
> > > > > [286123.157565]  <TASK>
> > > > > [286123.157581]  dump_stack_lvl+0x57/0x7d
> > > > > [286123.157609]  __might_resched.cold+0x222/0x26b
> > > > > [286123.157644]  down_read_nested+0x68/0x420
> > > > > [286123.157671]  ? down_write_nested+0x130/0x130
> > > > > [286123.157686]  ? rwlock_bug.part.0+0x90/0x90
> > > > > [286123.157705]  ? rcu_read_lock_sched_held+0x81/0xb0
> > > > > [286123.157749]  xfs_file_fsync+0x3b9/0x820
> > > > > [286123.157776]  ? lock_downgrade+0x680/0x680
> > > > > [286123.157798]  ? xfs_filemap_pfn_mkwrite+0x10/0x10
> > > > > [286123.157823]  ? nfsd_file_put+0x100/0x100 [nfsd]
> > > > > [286123.157921]  nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
> > > > > [286123.158007]  nfsd_file_put+0x79/0x100 [nfsd]
> > > > > [286123.158088]  check_for_locks+0x152/0x200 [nfsd]
> > > > > [286123.158191]  nfsd4_release_lockowner+0x4cf/0x850 [nfsd]
> > > > > [286123.158393]  ? nfsd4_locku+0xd10/0xd10 [nfsd]
> > > > > [286123.158488]  ? rcu_read_lock_bh_held+0x90/0x90
> > > > > [286123.158525]  nfsd4_proc_compound+0xd15/0x25a0 [nfsd]
> > > > > [286123.158699]  nfsd_dispatch+0x4ed/0xc30 [nfsd]
> > > > > [286123.158974]  ? rcu_read_lock_bh_held+0x90/0x90
> > > > > [286123.159010]  svc_process_common+0xd8e/0x1b20 [sunrpc]
> > > > > [286123.159043]  ? svc_generic_rpcbind_set+0x450/0x450
> > > > > [sunrpc]
> > > > > [286123.159043]  ? nfsd_svc+0xc50/0xc50 [nfsd]
> > > > > [286123.159043]  ? svc_sock_secure_port+0x27/0x40 [sunrpc]
> > > > > [286123.159043]  ? svc_recv+0x1100/0x2390 [sunrpc]
> > > > > [286123.159043]  svc_process+0x361/0x4f0 [sunrpc]
> > > > > [286123.159043]  nfsd+0x2d6/0x570 [nfsd]
> > > > > [286123.159043]  ? nfsd_shutdown_threads+0x2a0/0x2a0 [nfsd]
> > > > > [286123.159043]  kthread+0x29f/0x340
> > > > > [286123.159043]  ? kthread_complete_and_exit+0x20/0x20
> > > > > [286123.159043]  ret_from_fork+0x22/0x30
> > > > > [286123.159043]  </TASK>
> > > > > [286123.187052] BUG: scheduling while atomic:
> > > > > nfsd/3983/0x00000002
> > > > > [286123.187551] INFO: lockdep is turned off.
> > > > > [286123.187918] Modules linked in: nfsd auth_rpcgss nfs_acl
> > > > > lockd
> > > > > grace sunrpc
> > > > > [286123.188527] Preemption disabled at:
> > > > > [286123.188535] [<0000000000000000>] 0x0
> > > > > [286123.189255] CPU: 1 PID: 3983 Comm: nfsd Kdump: loaded
> > > > > Tainted: G        W         5.18.0-rc4+ #1
> > > > > [286123.190233] Hardware name: innotek GmbH
> > > > > VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> > > > > [286123.190910] Call Trace:
> > > > > [286123.190910]  <TASK>
> > > > > [286123.190910]  dump_stack_lvl+0x57/0x7d
> > > > > [286123.190910]  __schedule_bug.cold+0x133/0x143
> > > > > [286123.190910]  __schedule+0x16c9/0x20a0
> > > > > [286123.190910]  ? schedule_timeout+0x314/0x510
> > > > > [286123.190910]  ? __sched_text_start+0x8/0x8
> > > > > [286123.190910]  ? internal_add_timer+0xa4/0xe0
> > > > > [286123.190910]  schedule+0xd7/0x1f0
> > > > > [286123.190910]  schedule_timeout+0x319/0x510
> > > > > [286123.190910]  ? rcu_read_lock_held_common+0xe/0xa0
> > > > > [286123.190910]  ? usleep_range_state+0x150/0x150
> > > > > [286123.190910]  ? lock_acquire+0x331/0x490
> > > > > [286123.190910]  ? init_timer_on_stack_key+0x50/0x50
> > > > > [286123.190910]  ? do_raw_spin_lock+0x116/0x260
> > > > > [286123.190910]  ? rwlock_bug.part.0+0x90/0x90
> > > > > [286123.190910]  io_schedule_timeout+0x26/0x80
> > > > > [286123.190910]  wait_for_completion_io_timeout+0x16a/0x340
> > > > > [286123.190910]  ? rcu_read_lock_bh_held+0x90/0x90
> > > > > [286123.190910]  ? wait_for_completion+0x330/0x330
> > > > > [286123.190910]  submit_bio_wait+0x135/0x1d0
> > > > > [286123.190910]  ? submit_bio_wait_endio+0x40/0x40
> > > > > [286123.190910]  ? xfs_iunlock+0xd5/0x300
> > > > > [286123.190910]  ? bio_init+0x295/0x470
> > > > > [286123.190910]  blkdev_issue_flush+0x69/0x80
> > > > > [286123.190910]  ? blk_unregister_queue+0x1e0/0x1e0
> > > > > [286123.190910]  ? bio_kmalloc+0x90/0x90
> > > > > [286123.190910]  ? xfs_iunlock+0x1b4/0x300
> > > > > [286123.190910]  xfs_file_fsync+0x354/0x820
> > > > > [286123.190910]  ? lock_downgrade+0x680/0x680
> > > > > [286123.190910]  ? xfs_filemap_pfn_mkwrite+0x10/0x10
> > > > > [286123.190910]  ? nfsd_file_put+0x100/0x100 [nfsd]
> > > > > [286123.190910]  nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
> > > > > [286123.190910]  nfsd_file_put+0x79/0x100 [nfsd]
> > > > > [286123.190910]  check_for_locks+0x152/0x200 [nfsd]
> > > > > [286123.190910]  nfsd4_release_lockowner+0x4cf/0x850 [nfsd]
> > > > > [286123.190910]  ? nfsd4_locku+0xd10/0xd10 [nfsd]
> > > > > [286123.190910]  ? rcu_read_lock_bh_held+0x90/0x90
> > > > > [286123.190910]  nfsd4_proc_compound+0xd15/0x25a0 [nfsd]
> > > > > [286123.190910]  nfsd_dispatch+0x4ed/0xc30 [nfsd]
> > > > > [286123.190910]  ? rcu_read_lock_bh_held+0x90/0x90
> > > > > [286123.190910]  svc_process_common+0xd8e/0x1b20 [sunrpc]
> > > > > [286123.190910]  ? svc_generic_rpcbind_set+0x450/0x450
> > > > > [sunrpc]
> > > > > [286123.190910]  ? nfsd_svc+0xc50/0xc50 [nfsd]
> > > > > [286123.190910]  ? svc_sock_secure_port+0x27/0x40 [sunrpc]
> > > > > [286123.190910]  ? svc_recv+0x1100/0x2390 [sunrpc]
> > > > > [286123.190910]  svc_process+0x361/0x4f0 [sunrpc]
> > > > > [286123.190910]  nfsd+0x2d6/0x570 [nfsd]
> > > > > [286123.190910]  ? nfsd_shutdown_threads+0x2a0/0x2a0 [nfsd]
> > > > > [286123.190910]  kthread+0x29f/0x340
> > > > > [286123.190910]  ? kthread_complete_and_exit+0x20/0x20
> > > > > [286123.190910]  ret_from_fork+0x22/0x30
> > > > > [286123.190910]  </TASK>
> > > > > 
> > > > > The problem is the process tries to sleep while holding the
> > > > > cl_lock by nfsd4_release_lockowner. I think the problem was
> > > > > introduced with the filemap_flush in nfsd_file_put since
> > > > > 'b6669305d35a nfsd: Reduce the number of calls to
> > > > > nfsd_file_gc()'.
> > > > > The filemap_flush is later replaced by nfsd_file_flush by
> > > > > '6b8a94332ee4f nfsd: Fix a write performance regression'.
> > > > 
> > > > That seems plausible, given the traces above.
> > > > 
> > > > But it begs the question: why was a vfs_fsync() needed by
> > > > RELEASE_LOCKOWNER in this case? I've tried to reproduce the
> > > > problem, and even added a might_sleep() call in
> > > > nfsd_file_flush()
> > > > but haven't been able to reproduce.
> > > 
> > > Trond, I'm assuming you switched to a synchronous flush here to
> > > capture writeback errors. There's no other requirement for
> > > waiting for the flush to complete, right?
> > 
> > It's because the file is unhashed, so it is about to be closed and
> > garbage collected as soon as the refcount goes to zero.
> > 
> > > 
> > > To enable nfsd_file_put() to be invoked in atomic contexts again,
> > > would the following be a reasonable short term fix:
> > > 
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index 2c1b027774d4..96c8d07788f4 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -304,7 +304,7 @@ nfsd_file_put(struct nfsd_file *nf)
> > >  {
> > >         set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > >         if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
> > > -               nfsd_file_flush(nf);
> > > +               filemap_flush(nf->nf_file->f_mapping);
> > >                 nfsd_file_put_noref(nf);
> > >         } else {
> > >                 nfsd_file_put_noref(nf);
> > > 
> > > 
> > 
> > filemap_flush() sleeps, and so does nfsd_file_put_noref() (it can
> > call
> > filp_close() + fput()), so this kind of change isn't going to work
> > no
> > matter how you massage it.
> 
> Er. Wouldn't that mean we would have seen "sleep while spinlock is
> held" BUGs since nfsd4_release_lockowner() was added? It's done
> at least an fput() while holding clp->cl_lock since it was added,
> I think.

There is nothing magical about using WB_SYNC_NONE as far as the
writeback code is concerned. write_cache_pages() will still happily
call lock_page() and sleep on that lock if it is contended. The
writepage/writepages code will happily try to allocate memory if
necessary.

The only difference is that it won't sleep waiting for the PG_writeback
bit.

So, no, you can't safely call filemap_flush() from a spin locked
context, and
yes, the bug has been there from day 1. It was not introduced by me.

Also, as I said, nfsd_file_put_noref() is not safe to call from a spin
locked context. Again, this was not introduced any time recently.

> 
> > Is there any reason why you need a reference to the nfs_file there?
> > Wouldn't holding the fp->fi_lock be sufficient, since you're
> > already in
> > an atomic context? As long as one of the entries in fp->fi_fds[] is
> > non-zero then you should be safe.
> 
> Sure, check_for_locks() seems to be the only problematic caller.
> Other callers appear to be careful to call nfsd_file_put() only
> after releasing spinlocks.
> 
> I would like a fix that can be backported without fuss. I was
> thinking changing check_for_locks() might get a little too
> hairy for that, but I'll check it out.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx