> On May 11, 2022, at 2:57 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > >> On May 11, 2022, at 1:31 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: >> >> On Wed, 2022-05-11 at 16:27 +0000, Chuck Lever III wrote: >>> >>> >>>> On May 11, 2022, at 12:09 PM, Trond Myklebust >>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>> >>>> On Wed, 2022-05-11 at 14:57 +0000, Chuck Lever III wrote: >>>>> >>>>> >>>>>> On May 10, 2022, at 10:24 AM, Chuck Lever III >>>>>> <chuck.lever@xxxxxxxxxx> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On May 3, 2022, at 3:11 PM, dai.ngo@xxxxxxxxxx wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I just noticed there were couple of oops in my Oracle6 in >>>>>>> nfs4.dev network. >>>>>>> I'm not sure who ran which tests (be useful to know) that >>>>>>> caused >>>>>>> these oops. >>>>>>> >>>>>>> Here is the stack traces: >>>>>>> >>>>>>> [286123.154006] BUG: sleeping function called from invalid >>>>>>> context at kernel/locking/rwsem.c:1585 >>>>>>> [286123.155126] in_atomic(): 1, irqs_disabled(): 0, >>>>>>> non_block: 0, >>>>>>> pid: 3983, name: nfsd >>>>>>> [286123.155872] preempt_count: 1, expected: 0 >>>>>>> [286123.156443] RCU nest depth: 0, expected: 0 >>>>>>> [286123.156771] 1 lock held by nfsd/3983: >>>>>>> [286123.156786] #0: ffff888006762520 (&clp->cl_lock){+.+.}- >>>>>>> {2:2}, at: nfsd4_release_lockowner+0x306/0x850 [nfsd] >>>>>>> [286123.156949] Preemption disabled at: >>>>>>> [286123.156961] [<0000000000000000>] 0x0 >>>>>>> [286123.157520] CPU: 1 PID: 3983 Comm: nfsd Kdump: loaded Not >>>>>>> tainted 5.18.0-rc4+ #1 >>>>>>> [286123.157539] Hardware name: innotek GmbH >>>>>>> VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 >>>>>>> [286123.157552] Call Trace: >>>>>>> [286123.157565] <TASK> >>>>>>> [286123.157581] dump_stack_lvl+0x57/0x7d >>>>>>> [286123.157609] __might_resched.cold+0x222/0x26b >>>>>>> [286123.157644] down_read_nested+0x68/0x420 >>>>>>> [286123.157671] ? down_write_nested+0x130/0x130 >>>>>>> [286123.157686] ? rwlock_bug.part.0+0x90/0x90 >>>>>>> [286123.157705] ? rcu_read_lock_sched_held+0x81/0xb0 >>>>>>> [286123.157749] xfs_file_fsync+0x3b9/0x820 >>>>>>> [286123.157776] ? lock_downgrade+0x680/0x680 >>>>>>> [286123.157798] ? xfs_filemap_pfn_mkwrite+0x10/0x10 >>>>>>> [286123.157823] ? nfsd_file_put+0x100/0x100 [nfsd] >>>>>>> [286123.157921] nfsd_file_flush.isra.0+0x1b/0x220 [nfsd] >>>>>>> [286123.158007] nfsd_file_put+0x79/0x100 [nfsd] >>>>>>> [286123.158088] check_for_locks+0x152/0x200 [nfsd] >>>>>>> [286123.158191] nfsd4_release_lockowner+0x4cf/0x850 [nfsd] >>>>>>> [286123.158393] ? nfsd4_locku+0xd10/0xd10 [nfsd] >>>>>>> [286123.158488] ? rcu_read_lock_bh_held+0x90/0x90 >>>>>>> [286123.158525] nfsd4_proc_compound+0xd15/0x25a0 [nfsd] >>>>>>> [286123.158699] nfsd_dispatch+0x4ed/0xc30 [nfsd] >>>>>>> [286123.158974] ? rcu_read_lock_bh_held+0x90/0x90 >>>>>>> [286123.159010] svc_process_common+0xd8e/0x1b20 [sunrpc] >>>>>>> [286123.159043] ? svc_generic_rpcbind_set+0x450/0x450 >>>>>>> [sunrpc] >>>>>>> [286123.159043] ? nfsd_svc+0xc50/0xc50 [nfsd] >>>>>>> [286123.159043] ? svc_sock_secure_port+0x27/0x40 [sunrpc] >>>>>>> [286123.159043] ? svc_recv+0x1100/0x2390 [sunrpc] >>>>>>> [286123.159043] svc_process+0x361/0x4f0 [sunrpc] >>>>>>> [286123.159043] nfsd+0x2d6/0x570 [nfsd] >>>>>>> [286123.159043] ? nfsd_shutdown_threads+0x2a0/0x2a0 [nfsd] >>>>>>> [286123.159043] kthread+0x29f/0x340 >>>>>>> [286123.159043] ? kthread_complete_and_exit+0x20/0x20 >>>>>>> [286123.159043] ret_from_fork+0x22/0x30 >>>>>>> [286123.159043] </TASK> >>>>>>> [286123.187052] BUG: scheduling while atomic: >>>>>>> nfsd/3983/0x00000002 >>>>>>> [286123.187551] INFO: lockdep is turned off. >>>>>>> [286123.187918] Modules linked in: nfsd auth_rpcgss nfs_acl >>>>>>> lockd >>>>>>> grace sunrpc >>>>>>> [286123.188527] Preemption disabled at: >>>>>>> [286123.188535] [<0000000000000000>] 0x0 >>>>>>> [286123.189255] CPU: 1 PID: 3983 Comm: nfsd Kdump: loaded >>>>>>> Tainted: G W 5.18.0-rc4+ #1 >>>>>>> [286123.190233] Hardware name: innotek GmbH >>>>>>> VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 >>>>>>> [286123.190910] Call Trace: >>>>>>> [286123.190910] <TASK> >>>>>>> [286123.190910] dump_stack_lvl+0x57/0x7d >>>>>>> [286123.190910] __schedule_bug.cold+0x133/0x143 >>>>>>> [286123.190910] __schedule+0x16c9/0x20a0 >>>>>>> [286123.190910] ? schedule_timeout+0x314/0x510 >>>>>>> [286123.190910] ? __sched_text_start+0x8/0x8 >>>>>>> [286123.190910] ? internal_add_timer+0xa4/0xe0 >>>>>>> [286123.190910] schedule+0xd7/0x1f0 >>>>>>> [286123.190910] schedule_timeout+0x319/0x510 >>>>>>> [286123.190910] ? rcu_read_lock_held_common+0xe/0xa0 >>>>>>> [286123.190910] ? usleep_range_state+0x150/0x150 >>>>>>> [286123.190910] ? lock_acquire+0x331/0x490 >>>>>>> [286123.190910] ? init_timer_on_stack_key+0x50/0x50 >>>>>>> [286123.190910] ? do_raw_spin_lock+0x116/0x260 >>>>>>> [286123.190910] ? rwlock_bug.part.0+0x90/0x90 >>>>>>> [286123.190910] io_schedule_timeout+0x26/0x80 >>>>>>> [286123.190910] wait_for_completion_io_timeout+0x16a/0x340 >>>>>>> [286123.190910] ? rcu_read_lock_bh_held+0x90/0x90 >>>>>>> [286123.190910] ? wait_for_completion+0x330/0x330 >>>>>>> [286123.190910] submit_bio_wait+0x135/0x1d0 >>>>>>> [286123.190910] ? submit_bio_wait_endio+0x40/0x40 >>>>>>> [286123.190910] ? xfs_iunlock+0xd5/0x300 >>>>>>> [286123.190910] ? bio_init+0x295/0x470 >>>>>>> [286123.190910] blkdev_issue_flush+0x69/0x80 >>>>>>> [286123.190910] ? blk_unregister_queue+0x1e0/0x1e0 >>>>>>> [286123.190910] ? bio_kmalloc+0x90/0x90 >>>>>>> [286123.190910] ? xfs_iunlock+0x1b4/0x300 >>>>>>> [286123.190910] xfs_file_fsync+0x354/0x820 >>>>>>> [286123.190910] ? lock_downgrade+0x680/0x680 >>>>>>> [286123.190910] ? xfs_filemap_pfn_mkwrite+0x10/0x10 >>>>>>> [286123.190910] ? nfsd_file_put+0x100/0x100 [nfsd] >>>>>>> [286123.190910] nfsd_file_flush.isra.0+0x1b/0x220 [nfsd] >>>>>>> [286123.190910] nfsd_file_put+0x79/0x100 [nfsd] >>>>>>> [286123.190910] check_for_locks+0x152/0x200 [nfsd] >>>>>>> [286123.190910] nfsd4_release_lockowner+0x4cf/0x850 [nfsd] >>>>>>> [286123.190910] ? nfsd4_locku+0xd10/0xd10 [nfsd] >>>>>>> [286123.190910] ? rcu_read_lock_bh_held+0x90/0x90 >>>>>>> [286123.190910] nfsd4_proc_compound+0xd15/0x25a0 [nfsd] >>>>>>> [286123.190910] nfsd_dispatch+0x4ed/0xc30 [nfsd] >>>>>>> [286123.190910] ? rcu_read_lock_bh_held+0x90/0x90 >>>>>>> [286123.190910] svc_process_common+0xd8e/0x1b20 [sunrpc] >>>>>>> [286123.190910] ? svc_generic_rpcbind_set+0x450/0x450 >>>>>>> [sunrpc] >>>>>>> [286123.190910] ? nfsd_svc+0xc50/0xc50 [nfsd] >>>>>>> [286123.190910] ? svc_sock_secure_port+0x27/0x40 [sunrpc] >>>>>>> [286123.190910] ? svc_recv+0x1100/0x2390 [sunrpc] >>>>>>> [286123.190910] svc_process+0x361/0x4f0 [sunrpc] >>>>>>> [286123.190910] nfsd+0x2d6/0x570 [nfsd] >>>>>>> [286123.190910] ? nfsd_shutdown_threads+0x2a0/0x2a0 [nfsd] >>>>>>> [286123.190910] kthread+0x29f/0x340 >>>>>>> [286123.190910] ? kthread_complete_and_exit+0x20/0x20 >>>>>>> [286123.190910] ret_from_fork+0x22/0x30 >>>>>>> [286123.190910] </TASK> >>>>>>> >>>>>>> The problem is the process tries to sleep while holding the >>>>>>> cl_lock by nfsd4_release_lockowner. I think the problem was >>>>>>> introduced with the filemap_flush in nfsd_file_put since >>>>>>> 'b6669305d35a nfsd: Reduce the number of calls to >>>>>>> nfsd_file_gc()'. >>>>>>> The filemap_flush is later replaced by nfsd_file_flush by >>>>>>> '6b8a94332ee4f nfsd: Fix a write performance regression'. >>>>>> >>>>>> That seems plausible, given the traces above. >>>>>> >>>>>> But it begs the question: why was a vfs_fsync() needed by >>>>>> RELEASE_LOCKOWNER in this case? I've tried to reproduce the >>>>>> problem, and even added a might_sleep() call in >>>>>> nfsd_file_flush() >>>>>> but haven't been able to reproduce. >>>>> >>>>> Trond, I'm assuming you switched to a synchronous flush here to >>>>> capture writeback errors. There's no other requirement for >>>>> waiting for the flush to complete, right? >>>> >>>> It's because the file is unhashed, so it is about to be closed and >>>> garbage collected as soon as the refcount goes to zero. >>>> >>>>> >>>>> To enable nfsd_file_put() to be invoked in atomic contexts again, >>>>> would the following be a reasonable short term fix: >>>>> >>>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c >>>>> index 2c1b027774d4..96c8d07788f4 100644 >>>>> --- a/fs/nfsd/filecache.c >>>>> +++ b/fs/nfsd/filecache.c >>>>> @@ -304,7 +304,7 @@ nfsd_file_put(struct nfsd_file *nf) >>>>> { >>>>> set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags); >>>>> if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) { >>>>> - nfsd_file_flush(nf); >>>>> + filemap_flush(nf->nf_file->f_mapping); >>>>> nfsd_file_put_noref(nf); >>>>> } else { >>>>> nfsd_file_put_noref(nf); >>>>> >>>>> >>>> >>>> filemap_flush() sleeps, and so does nfsd_file_put_noref() (it can >>>> call >>>> filp_close() + fput()), so this kind of change isn't going to work >>>> no >>>> matter how you massage it. >>> >>> Er. Wouldn't that mean we would have seen "sleep while spinlock is >>> held" BUGs since nfsd4_release_lockowner() was added? It's done >>> at least an fput() while holding clp->cl_lock since it was added, >>> I think. >> >> >> There is nothing magical about using WB_SYNC_NONE as far as the >> writeback code is concerned. write_cache_pages() will still happily >> call lock_page() and sleep on that lock if it is contended. The >> writepage/writepages code will happily try to allocate memory if >> necessary. >> >> The only difference is that it won't sleep waiting for the PG_writeback >> bit. >> >> So, no, you can't safely call filemap_flush() from a spin locked >> context, and >> yes, the bug has been there from day 1. It was not introduced by me. >> >> Also, as I said, nfsd_file_put_noref() is not safe to call from a spin >> locked context. Again, this was not introduced any time recently. > > OK. I'm trying to figure out how bad the problem is and how > far back to apply the fix. > > I agree that the arrangement of the code path means > nfsd4_release_lockowner() has always called fput() or > filemap_flush() while cl_lock was held. > > But again, I'm not aware of recent instances of this particular > splat. So I'm wondering if a recent change has made this issue > easier to hit. We might not be able to answer that until we > find out how the bake-a-thon testers managed to trigger the > issue on Dai's server. > > >>>> Is there any reason why you need a reference to the nfs_file there? >>>> Wouldn't holding the fp->fi_lock be sufficient, since you're >>>> already in >>>> an atomic context? As long as one of the entries in fp->fi_fds[] is >>>> non-zero then you should be safe. >>> >>> Sure, check_for_locks() seems to be the only problematic caller. >>> Other callers appear to be careful to call nfsd_file_put() only >>> after releasing spinlocks. >>> >>> I would like a fix that can be backported without fuss. I was >>> thinking changing check_for_locks() might get a little too >>> hairy for that, but I'll check it out. > > Notably: check_for_locks() needs to drop fi_lock before taking > flc_lock because the OPEN path can take flc_lock first, then Bzzt. Not OPEN, but rather nfsd_file_acquire(). > fi_lock, via nfsd_break_deleg_cb(). Holding a reference to the > nfsd_file guarantees that the inode won't go away while > check_for_locks() examines the flc_posix list without holding > fi_lock. > > So my first take on this was we need nfsd4_release_lockowner() > to drop cl_lock before check_for_locks() is called. > > > -- > Chuck Lever -- Chuck Lever