----- Original Message ----- > From: "J. Bruce Fields" <bfields@xxxxxxxxxxxx> > To: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx> > Cc: "Chuck Lever" <chuck.lever@xxxxxxxxxx>, "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx> > Sent: Friday, July 23, 2021 4:00:12 PM > Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load > Sorry, took me a while to get to this because I'm not sure what to > suggest: > > On Mon, Jul 05, 2021 at 04:44:29AM -0500, Timothy Pearson wrote: >> We've been dealing with a fairly nasty NFS-related problem off and on >> for the past couple of years. The host is a large POWER server with >> several external SAS arrays attached, using BTRFS for cold storage of >> large amounts of data. The main symptom is that under heavy sustained >> NFS write traffic using certain file types (see below) a core will >> suddenly lock up, continually spewing a backtrace similar to the one >> I've pasted below. > > Is this the very *first* backtrace you get in the log? Is there > anything suspicious right before it? (Other than the warnings you > mention in the next message?) Nothing in the logs before that. >> While this immediately halts all NFS traffic to >> the affected client (which is never the same client as the machine >> doing the large file transfer), the larger issue is that over the next >> few minutes / hours the entire host will gradually degrade in >> responsiveness until it grinds to a complete halt. Once the core >> stall occurs we have been unable to find any way to restore the >> machine to full functionality or avoid the degradation and eventual >> hang short of a hard power down and restart. > >> >> Tens of GB of compressed data in a single file seems to be fairly good at >> triggering the problem, whereas raw disk images or other regularly patterned >> data tend not to be. The underlying hardware is functioning perfectly with no >> problems noted, and moving the files without NFS avoids the bug. >> >> We've been using a workaround involving purposefully pausing (SIGSTOP) the file >> transfer process on the client as soon as other clients start to show a >> slowdown. This hack avoids the bug entirely provided the host is allowed to >> catch back up prior to resuming (SIGCONT) the file transfer process. From >> this, it seems something is going very wrong within the NFS stack under high >> storage I/O pressure and high storage write latency (timeout?) -- it should >> simply pause transfers while the storage subsystem catches up, not lock up a >> core and force a host restart. Interesting, sometimes it does exactly what it >> is supposed to and does pause and wait for the storage subsystem, but around >> 20% of the time it just triggers this bug and stalls a core. >> >> This bug has been present since at least 4.14 and is still present in the latest >> 5.12.14 version. >> >> As the machine is in production, it is difficult to gather further information >> or test patches, however we would be able to apply patches to the kernel that >> would potentially restore stability with enough advance scheduling. >> >> Sample backtrace below: >> >> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU >> [16846.426202] rcu: 32-....: (5249 ticks this GP) >> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986 >> [16846.426241] (t=5251 jiffies g=2720809 q=756724) >> [16846.426273] NMI backtrace for cpu 32 >> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1 >> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc] >> [16846.426406] Call Trace: >> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114 >> (unreliable) >> [16846.426483] [c000200010823290] [c00000000075aebc] >> nmi_cpu_backtrace+0xfc/0x150 >> [16846.426506] [c000200010823310] [c00000000075b0a8] >> nmi_trigger_cpumask_backtrace+0x198/0x1f0 >> [16846.426577] [c0002000108233b0] [c000000000072818] >> arch_trigger_cpumask_backtrace+0x28/0x40 >> [16846.426621] [c0002000108233d0] [c000000000202db8] >> rcu_dump_cpu_stacks+0x158/0x1b8 >> [16846.426667] [c000200010823470] [c000000000201828] >> rcu_sched_clock_irq+0x908/0xb10 >> [16846.426708] [c000200010823560] [c0000000002141d0] >> update_process_times+0xc0/0x140 >> [16846.426768] [c0002000108235a0] [c00000000022dd34] >> tick_sched_handle.isra.18+0x34/0xd0 >> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0 >> [16846.426856] [c000200010823610] [c00000000021577c] >> __hrtimer_run_queues+0x16c/0x370 >> [16846.426903] [c000200010823690] [c000000000216378] >> hrtimer_interrupt+0x128/0x2f0 >> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310 >> [16846.426989] [c0002000108237a0] [c000000000016c54] >> replay_soft_interrupts+0x124/0x2e0 >> [16846.427045] [c000200010823990] [c000000000016f14] >> arch_local_irq_restore+0x104/0x170 >> [16846.427103] [c0002000108239c0] [c00000000017247c] >> mod_delayed_work_on+0x8c/0xe0 >> [16846.427149] [c000200010823a20] [c00800000819fe04] >> rpc_set_queue_timer+0x5c/0x80 [sunrpc] >> [16846.427234] [c000200010823a40] [c0080000081a096c] >> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc] >> [16846.427324] [c000200010823a90] [c0080000081a3080] >> rpc_sleep_on_timeout+0x88/0x110 [sunrpc] >> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530 >> [nfsd] > > I think this has to be the rpc_delay in the -NFS4ERR_DELAY case of > nfsd4_cb_sequence_done. I don't know what would cause a lockup there. > Maybe the rpc_task we've passed in is corrupted somehow? > > We could try to add some instrumentation in that case. I don't think > that should be a common error case. I guess the client has probably hit > one of the NFS4ERR_DELAY cases in > fs/nfs/callback_proc.c:nfs4_callback_sequence? > > There might be a way to get some more information out of this with some > tracing that you could run in production, but I'm not sure what to > suggest off the top of my head. Appreciate the response regardless!