Re: [bug report] kernel 5.18.0-rc4 oops with invalid wait context

dai.ngo@xxxxxxxxxx · Wed, 11 May 2022 18:24:34 -0700

On 5/11/22 11:57 AM, Chuck Lever III wrote:

On May 11, 2022, at 1:31 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:

On Wed, 2022-05-11 at 16:27 +0000, Chuck Lever III wrote:

On May 11, 2022, at 12:09 PM, Trond Myklebust
<trondmy@xxxxxxxxxxxxxxx> wrote:

On Wed, 2022-05-11 at 14:57 +0000, Chuck Lever III wrote:

On May 10, 2022, at 10:24 AM, Chuck Lever III
<chuck.lever@xxxxxxxxxx> wrote:



On May 3, 2022, at 3:11 PM, dai.ngo@xxxxxxxxxx wrote:

Hi,

I just noticed there were couple of oops in my Oracle6 in
nfs4.dev network.
I'm not sure who ran which tests (be useful to know) that
caused
these oops.

Here is the stack traces:

[286123.154006] BUG: sleeping function called from invalid
context at kernel/locking/rwsem.c:1585
[286123.155126] in_atomic(): 1, irqs_disabled(): 0,
non_block: 0,
pid: 3983, name: nfsd
[286123.155872] preempt_count: 1, expected: 0
[286123.156443] RCU nest depth: 0, expected: 0
[286123.156771] 1 lock held by nfsd/3983:
[286123.156786]  #0: ffff888006762520 (&clp->cl_lock){+.+.}-
{2:2}, at: nfsd4_release_lockowner+0x306/0x850 [nfsd]
[286123.156949] Preemption disabled at:
[286123.156961] [<0000000000000000>] 0x0
[286123.157520] CPU: 1 PID: 3983 Comm: nfsd Kdump: loaded Not
tainted 5.18.0-rc4+ #1
[286123.157539] Hardware name: innotek GmbH
VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[286123.157552] Call Trace:
[286123.157565]  <TASK>
[286123.157581]  dump_stack_lvl+0x57/0x7d
[286123.157609]  __might_resched.cold+0x222/0x26b
[286123.157644]  down_read_nested+0x68/0x420
[286123.157671]  ? down_write_nested+0x130/0x130
[286123.157686]  ? rwlock_bug.part.0+0x90/0x90
[286123.157705]  ? rcu_read_lock_sched_held+0x81/0xb0
[286123.157749]  xfs_file_fsync+0x3b9/0x820
[286123.157776]  ? lock_downgrade+0x680/0x680
[286123.157798]  ? xfs_filemap_pfn_mkwrite+0x10/0x10
[286123.157823]  ? nfsd_file_put+0x100/0x100 [nfsd]
[286123.157921]  nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
[286123.158007]  nfsd_file_put+0x79/0x100 [nfsd]
[286123.158088]  check_for_locks+0x152/0x200 [nfsd]
[286123.158191]  nfsd4_release_lockowner+0x4cf/0x850 [nfsd]
[286123.158393]  ? nfsd4_locku+0xd10/0xd10 [nfsd]
[286123.158488]  ? rcu_read_lock_bh_held+0x90/0x90
[286123.158525]  nfsd4_proc_compound+0xd15/0x25a0 [nfsd]
[286123.158699]  nfsd_dispatch+0x4ed/0xc30 [nfsd]
[286123.158974]  ? rcu_read_lock_bh_held+0x90/0x90
[286123.159010]  svc_process_common+0xd8e/0x1b20 [sunrpc]
[286123.159043]  ? svc_generic_rpcbind_set+0x450/0x450
[sunrpc]
[286123.159043]  ? nfsd_svc+0xc50/0xc50 [nfsd]
[286123.159043]  ? svc_sock_secure_port+0x27/0x40 [sunrpc]
[286123.159043]  ? svc_recv+0x1100/0x2390 [sunrpc]
[286123.159043]  svc_process+0x361/0x4f0 [sunrpc]
[286123.159043]  nfsd+0x2d6/0x570 [nfsd]
[286123.159043]  ? nfsd_shutdown_threads+0x2a0/0x2a0 [nfsd]
[286123.159043]  kthread+0x29f/0x340
[286123.159043]  ? kthread_complete_and_exit+0x20/0x20
[286123.159043]  ret_from_fork+0x22/0x30
[286123.159043]  </TASK>
[286123.187052] BUG: scheduling while atomic:
nfsd/3983/0x00000002
[286123.187551] INFO: lockdep is turned off.
[286123.187918] Modules linked in: nfsd auth_rpcgss nfs_acl
lockd
grace sunrpc
[286123.188527] Preemption disabled at:
[286123.188535] [<0000000000000000>] 0x0
[286123.189255] CPU: 1 PID: 3983 Comm: nfsd Kdump: loaded
Tainted: G        W         5.18.0-rc4+ #1
[286123.190233] Hardware name: innotek GmbH
VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[286123.190910] Call Trace:
[286123.190910]  <TASK>
[286123.190910]  dump_stack_lvl+0x57/0x7d
[286123.190910]  __schedule_bug.cold+0x133/0x143
[286123.190910]  __schedule+0x16c9/0x20a0
[286123.190910]  ? schedule_timeout+0x314/0x510
[286123.190910]  ? __sched_text_start+0x8/0x8
[286123.190910]  ? internal_add_timer+0xa4/0xe0
[286123.190910]  schedule+0xd7/0x1f0
[286123.190910]  schedule_timeout+0x319/0x510
[286123.190910]  ? rcu_read_lock_held_common+0xe/0xa0
[286123.190910]  ? usleep_range_state+0x150/0x150
[286123.190910]  ? lock_acquire+0x331/0x490
[286123.190910]  ? init_timer_on_stack_key+0x50/0x50
[286123.190910]  ? do_raw_spin_lock+0x116/0x260
[286123.190910]  ? rwlock_bug.part.0+0x90/0x90
[286123.190910]  io_schedule_timeout+0x26/0x80
[286123.190910]  wait_for_completion_io_timeout+0x16a/0x340
[286123.190910]  ? rcu_read_lock_bh_held+0x90/0x90
[286123.190910]  ? wait_for_completion+0x330/0x330
[286123.190910]  submit_bio_wait+0x135/0x1d0
[286123.190910]  ? submit_bio_wait_endio+0x40/0x40
[286123.190910]  ? xfs_iunlock+0xd5/0x300
[286123.190910]  ? bio_init+0x295/0x470
[286123.190910]  blkdev_issue_flush+0x69/0x80
[286123.190910]  ? blk_unregister_queue+0x1e0/0x1e0
[286123.190910]  ? bio_kmalloc+0x90/0x90
[286123.190910]  ? xfs_iunlock+0x1b4/0x300
[286123.190910]  xfs_file_fsync+0x354/0x820
[286123.190910]  ? lock_downgrade+0x680/0x680
[286123.190910]  ? xfs_filemap_pfn_mkwrite+0x10/0x10
[286123.190910]  ? nfsd_file_put+0x100/0x100 [nfsd]
[286123.190910]  nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
[286123.190910]  nfsd_file_put+0x79/0x100 [nfsd]
[286123.190910]  check_for_locks+0x152/0x200 [nfsd]
[286123.190910]  nfsd4_release_lockowner+0x4cf/0x850 [nfsd]
[286123.190910]  ? nfsd4_locku+0xd10/0xd10 [nfsd]
[286123.190910]  ? rcu_read_lock_bh_held+0x90/0x90
[286123.190910]  nfsd4_proc_compound+0xd15/0x25a0 [nfsd]
[286123.190910]  nfsd_dispatch+0x4ed/0xc30 [nfsd]
[286123.190910]  ? rcu_read_lock_bh_held+0x90/0x90
[286123.190910]  svc_process_common+0xd8e/0x1b20 [sunrpc]
[286123.190910]  ? svc_generic_rpcbind_set+0x450/0x450
[sunrpc]
[286123.190910]  ? nfsd_svc+0xc50/0xc50 [nfsd]
[286123.190910]  ? svc_sock_secure_port+0x27/0x40 [sunrpc]
[286123.190910]  ? svc_recv+0x1100/0x2390 [sunrpc]
[286123.190910]  svc_process+0x361/0x4f0 [sunrpc]
[286123.190910]  nfsd+0x2d6/0x570 [nfsd]
[286123.190910]  ? nfsd_shutdown_threads+0x2a0/0x2a0 [nfsd]
[286123.190910]  kthread+0x29f/0x340
[286123.190910]  ? kthread_complete_and_exit+0x20/0x20
[286123.190910]  ret_from_fork+0x22/0x30
[286123.190910]  </TASK>

The problem is the process tries to sleep while holding the
cl_lock by nfsd4_release_lockowner. I think the problem was
introduced with the filemap_flush in nfsd_file_put since
'b6669305d35a nfsd: Reduce the number of calls to
nfsd_file_gc()'.
The filemap_flush is later replaced by nfsd_file_flush by
'6b8a94332ee4f nfsd: Fix a write performance regression'.
That seems plausible, given the traces above.

But it begs the question: why was a vfs_fsync() needed by
RELEASE_LOCKOWNER in this case? I've tried to reproduce the
problem, and even added a might_sleep() call in
nfsd_file_flush()
but haven't been able to reproduce.
Trond, I'm assuming you switched to a synchronous flush here to
capture writeback errors. There's no other requirement for
waiting for the flush to complete, right?
It's because the file is unhashed, so it is about to be closed and
garbage collected as soon as the refcount goes to zero.

To enable nfsd_file_put() to be invoked in atomic contexts again,
would the following be a reasonable short term fix:

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 2c1b027774d4..96c8d07788f4 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -304,7 +304,7 @@ nfsd_file_put(struct nfsd_file *nf)
  {
         set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
         if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) {
-               nfsd_file_flush(nf);
+               filemap_flush(nf->nf_file->f_mapping);
                 nfsd_file_put_noref(nf);
         } else {
                 nfsd_file_put_noref(nf);


filemap_flush() sleeps, and so does nfsd_file_put_noref() (it can
call
filp_close() + fput()), so this kind of change isn't going to work
no
matter how you massage it.
Er. Wouldn't that mean we would have seen "sleep while spinlock is
held" BUGs since nfsd4_release_lockowner() was added? It's done
at least an fput() while holding clp->cl_lock since it was added,
I think.

There is nothing magical about using WB_SYNC_NONE as far as the
writeback code is concerned. write_cache_pages() will still happily
call lock_page() and sleep on that lock if it is contended. The
writepage/writepages code will happily try to allocate memory if
necessary.

The only difference is that it won't sleep waiting for the PG_writeback
bit.

So, no, you can't safely call filemap_flush() from a spin locked
context, and
yes, the bug has been there from day 1. It was not introduced by me.

Also, as I said, nfsd_file_put_noref() is not safe to call from a spin
locked context. Again, this was not introduced any time recently.
OK. I'm trying to figure out how bad the problem is and how
far back to apply the fix.

I agree that the arrangement of the code path means
nfsd4_release_lockowner() has always called fput() or
filemap_flush() while cl_lock was held.

But again, I'm not aware of recent instances of this particular
splat. So I'm wondering if a recent change has made this issue
easier to hit. We might not be able to answer that until we
find out how the bake-a-thon testers managed to trigger the
issue on Dai's server.

I'm able to reproduce this problem with unmodified 5.18.0-rc6
by enhancing the pynfs st_releaselockowner.py to remove the
file before sending the RELEASE_LOCKOWNER.

Here is the diff of the test:

[root@nfsvmf25 pynfs]# git diff nfs4.0/servertests/st_releaselockowner.py
diff --git a/nfs4.0/servertests/st_releaselockowner.py b/nfs4.0/servertests/st_releaselockowner.py
index ccd10fff3ffa..a32f8ed70634 100644
--- a/nfs4.0/servertests/st_releaselockowner.py
+++ b/nfs4.0/servertests/st_releaselockowner.py
@@ -19,8 +19,12 @@ def testFile(t, env):
     res = c.unlock_file(1, fh, res.lockid)
     check(res)
     
+    ops = c.use_obj(c.homedir) + [op.remove(t.word())]
+    res = c.compound(ops)
+    check(res)
 
     # Release lockowner
     owner = lock_owner4(c.clientid, b"lockowner_RLOWN1")
     res = c.compound([op.release_lockowner(owner)])
     check(res)
[root@nfsvmf25 pynfs]#

Here is the stack traces of the problem:

May 11 18:11:54 localhost kernel: BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1513
May 11 18:11:54 localhost kernel: in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 4234, name: nfsd
May 11 18:11:54 localhost kernel: preempt_count: 1, expected: 0
May 11 18:11:54 localhost kernel: RCU nest depth: 0, expected: 0
May 11 18:11:54 localhost kernel: INFO: lockdep is turned off.
May 11 18:11:54 localhost kernel: Preemption disabled at:
May 11 18:11:54 localhost kernel: [<0000000000000000>] 0x0
May 11 18:11:54 localhost kernel: CPU: 0 PID: 4234 Comm: nfsd Kdump: loaded Tainted: G        W         5.18.0-rc6 #1
May 11 18:11:54 localhost kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
May 11 18:11:54 localhost kernel: Call Trace:
May 11 18:11:54 localhost kernel: <TASK>
May 11 18:11:54 localhost kernel: dump_stack_lvl+0x57/0x7d
May 11 18:11:54 localhost kernel: __might_resched.cold+0x222/0x26b
May 11 18:11:54 localhost kernel: down_write+0x61/0x130
May 11 18:11:54 localhost kernel: ? down_write_killable_nested+0x150/0x150
May 11 18:11:54 localhost kernel: btrfs_inode_lock+0x38/0x70
May 11 18:11:54 localhost kernel: btrfs_sync_file+0x291/0x1020
May 11 18:11:54 localhost kernel: ? rcu_read_lock_bh_held+0x90/0x90
May 11 18:11:54 localhost kernel: ? lock_acquire+0x2f1/0x490
May 11 18:11:54 localhost kernel: ? start_ordered_ops.constprop.0+0xe0/0xe0
May 11 18:11:54 localhost kernel: ? find_any_file+0x64/0x100 [nfsd]
May 11 18:11:54 localhost kernel: ? nfsd_file_put+0x100/0x100 [nfsd]
May 11 18:11:54 localhost kernel: nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
May 11 18:11:54 localhost kernel: nfsd_file_put+0x79/0x100 [nfsd]
May 11 18:11:54 localhost kernel: check_for_locks+0x152/0x200 [nfsd]
May 11 18:11:54 localhost kernel: nfsd4_release_lockowner+0x4cf/0x850 [nfsd]
May 11 18:11:54 localhost kernel: ? lock_downgrade+0x680/0x680
May 11 18:11:54 localhost kernel: ? nfsd4_locku+0xd10/0xd10 [nfsd]
May 11 18:11:54 localhost kernel: ? rcu_read_lock_bh_held+0x90/0x90
May 11 18:11:54 localhost kernel: nfsd4_proc_compound+0xd15/0x25a0 [nfsd]
May 11 18:11:54 localhost kernel: ? rcu_read_lock_held_common+0xe/0xa0
May 11 18:11:54 localhost kernel: nfsd_dispatch+0x4ed/0xc30 [nfsd]
May 11 18:11:54 localhost kernel: ? svc_reserve+0xb5/0x130 [sunrpc]
May 11 18:11:54 localhost kernel: svc_process_common+0xd8e/0x1b20 [sunrpc]
May 11 18:11:54 localhost kernel: ? svc_generic_rpcbind_set+0x450/0x450 [sunrpc]
May 11 18:11:54 localhost kernel: ? nfsd_svc+0xc50/0xc50 [nfsd]
May 11 18:11:54 localhost kernel: ? svc_sock_secure_port+0x36/0x40 [sunrpc]
May 11 18:11:54 localhost kernel: ? svc_recv+0x1100/0x2390 [sunrpc]
May 11 18:11:54 localhost kernel: svc_process+0x361/0x4f0 [sunrpc]
May 11 18:11:54 localhost kernel: nfsd+0x2d6/0x570 [nfsd]
May 11 18:11:54 localhost kernel: ? nfsd_shutdown_threads+0x2a0/0x2a0 [nfsd]
May 11 18:11:54 localhost kernel: kthread+0x29f/0x340
May 11 18:11:54 localhost kernel: ? kthread_complete_and_exit+0x20/0x20
May 11 18:11:54 localhost kernel: ret_from_fork+0x22/0x30
May 11 18:11:54 localhost kernel: </TASK>

-Dai



Is there any reason why you need a reference to the nfs_file there?
Wouldn't holding the fp->fi_lock be sufficient, since you're
already in
an atomic context? As long as one of the entries in fp->fi_fds[] is
non-zero then you should be safe.
Sure, check_for_locks() seems to be the only problematic caller.
Other callers appear to be careful to call nfsd_file_put() only
after releasing spinlocks.

I would like a fix that can be backported without fuss. I was
thinking changing check_for_locks() might get a little too
hairy for that, but I'll check it out.
Notably: check_for_locks() needs to drop fi_lock before taking
flc_lock because the OPEN path can take flc_lock first, then
fi_lock, via nfsd_break_deleg_cb(). Holding a reference to the
nfsd_file guarantees that the inode won't go away while
check_for_locks() examines the flc_posix list without holding
fi_lock.

So my first take on this was we need nfsd4_release_lockowner()
to drop cl_lock before check_for_locks() is called.


--
Chuck Lever