On Mon, Aug 09, 2021 at 01:15:33PM -0400, hedrick@xxxxxxxxxxx wrote: > There seems to be a soft lockup message on the console, but that’s all > I can find. > > I’m currently considering whether it’s best to move to NFS 4.0, which > seems not to cause the issue, or 4.2 with delegations disabled. This > is the primary server for the department. If it fails, everything > fails, VMs because read-only, user jobs fai, etc. > > We ran for a year before this showed up, so I’m pretty sure going to > 4.0 will fix it. I thought you also upgraded the kernel at the same time? (What were the two kernels involved?) So we don't know whether it's a new kernel bug, or an NFSv4.2-specific bug, or something else. > But I have use cases for ACLs that will only work > with 4.2. NFSv4 ACLs on the Linux server are the same in 4.0 and 4.2. > Since the problem seems to be in the callback mechanism, and > as far as I can tell that’s only used for delegations, I assume > turning off delegations will fix it. It could be. Though I asked mainly as a way to help narrow down where the problem is. --b. > We’ve also had a history of issues with 4.2 problems on clients. > That’s why we backed off to 4.0 initially. Clients were seeing hangs. > > It’s discouraging to hear that even the most recent kernel has > problems. > > > On Aug 9, 2021, at 1:06 PM, Timothy Pearson > > <tpearson@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > > Did you see anything else at all on the terminal? The inability to > > log in is sadly familiar, our boxes are configured to dump a trace > > over serial every 120 seconds or so if they lock up > > (/proc/sys/kernel/hung_task_timeout_secs) and I'm not sure you'd see > > anything past the callback messages without that active. > > > > FWIW we ended up (mostly) working around the problem by moving the > > critical systems (which are all NFSv3) to a new server, but that's a > > stopgap measure as we were looking to deploy NFSv4 on a broader > > scale. My gut feeling is the failure occurs under heavy load where > > too many NFSv4 requests from a single client are pending due to a > > combination of storage and network saturation, but it's proven very > > difficult to debug -- even splitting the v3 hosts from the larger > > NFS server (reducing traffic + storage load) seems to have > > temporarily stabilized things. > > > > ----- Original Message ----- > >> From: hedrick@xxxxxxxxxxx To: "Timothy Pearson" > >> <tpearson@xxxxxxxxxxxxxxxxxxxxx> Cc: "J. Bruce Fields" > >> <bfields@xxxxxxxxxxxx>, "Chuck Lever" <chuck.lever@xxxxxxxxxx>, > >> "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx> Sent: Monday, August 9, > >> 2021 11:31:25 AM Subject: Re: CPU stall, eventual host hang with > >> BTRFS + NFS under heavy load > > > >> Incidentally, we’re a computer science department. We have such a > >> variety of students and researchers that it’s impossible to know > >> what they are all doing. Historically, if there’s a bug in > >> anyth9ing, we’ll see it, and usually enough for it to be fatal. > >> > >> question: is backing off to 4.0 or disabling delegations likely to > >> have more of an impact on performance? > >> > >>> On Aug 9, 2021, at 12:17 PM, hedrick@xxxxxxxxxxx wrote: > >>> > >>> I just found this because we’ve been dealing with hangs of our > >>> primary NFS server. This is ubuntu 20.04, which is 5.10. > >>> > >>> Right before the hang: > >>> > >>> Aug 8 21:50:46 communis.lcsr.rutgers.edu > >>> <http://communis.lcsr.rutgers.edu/> kernel: [294852.644801] > >>> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > >>> 00000000b260cf95 xid e3faa54e Aug 8 21:51:54 > >>> communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/> > >>> kernel: [294921.252531] receive_cb_reply: Got unrecognized reply: > >>> calldir 0x1 xpt_bc_xprt 00000000b260cf95 xid f0faa54e > >>> > >>> > >>> I looked at the code, and this seems to be an NFS4.1 callback. We > >>> just started seeing the problem after upgrading most of our hosts > >>> in a way that caused them to move from NFS 4.0 to 4.2. I assume > >>> 4.2 is using the 4.1 callback. Rather than disabling delegations, > >>> we’re moving back to NFS 4.0 on the clients (except ESXi). > >>> > >>> We’re using ZFS, so this isn’t just btrfs. > >>> > >>> I’m afraid I don’t have any backtrace. I was going to get more > >>> information, but it happened late at night and we were unable to > >>> get into the system to gather information. Just had to reboot. > >>> > >>>> On Jul 5, 2021, at 5:47 AM, Timothy Pearson > >>>> <tpearson@xxxxxxxxxxxxxxxxxxxxx > >>>> <mailto:tpearson@xxxxxxxxxxxxxxxxxxxxx>> wrote: > >>>> > >>>> Forgot to add -- sometimes, right before the core stall and > >>>> backtrace, we see messages similar to the following: > >>>> > >>>> [16825.408854] receive_cb_reply: Got unrecognized reply: calldir > >>>> 0x1 xpt_bc_xprt 0000000051f43ff7 xid 2e0c9b7a [16825.414070] > >>>> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > >>>> 0000000051f43ff7 xid 2f0c9b7a [16825.414360] receive_cb_reply: > >>>> Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7 > >>>> xid 300c9b7a > >>>> > >>>> We're not sure if they are related or not. > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx > >>>>> <mailto:tpearson@xxxxxxxxxxxxxxxxxxxxx>> To: "J. Bruce Fields" > >>>>> <bfields@xxxxxxxxxxxx <mailto:bfields@xxxxxxxxxxxx>>, "Chuck > >>>>> Lever" <chuck.lever@xxxxxxxxxx <mailto:chuck.lever@xxxxxxxxxx>> > >>>>> Cc: "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx > >>>>> <mailto:linux-nfs@xxxxxxxxxxxxxxx>> Sent: Monday, July 5, 2021 > >>>>> 4:44:29 AM Subject: CPU stall, eventual host hang with BTRFS + > >>>>> NFS under heavy load > >>>> > >>>>> We've been dealing with a fairly nasty NFS-related problem off > >>>>> and on for the past couple of years. The host is a large POWER > >>>>> server with several external SAS arrays attached, using BTRFS > >>>>> for cold storage of large amounts of data. The main symptom is > >>>>> that under heavy sustained NFS write traffic using certain file > >>>>> types (see below) a core will suddenly lock up, continually > >>>>> spewing a backtrace similar to the one I've pasted below. While > >>>>> this immediately halts all NFS traffic to the affected client > >>>>> (which is never the same client as the machine doing the large > >>>>> file transfer), the larger issue is that over the next few > >>>>> minutes / hours the entire host will gradually degrade in > >>>>> responsiveness until it grinds to a complete halt. Once the > >>>>> core stall occurs we have been unable to find any way to restore > >>>>> the machine to full functionality or avoid the degradation and > >>>>> eventual hang short of a hard power down and restart. > >>>>> > >>>>> Tens of GB of compressed data in a single file seems to be > >>>>> fairly good at triggering the problem, whereas raw disk images > >>>>> or other regularly patterned data tend not to be. The > >>>>> underlying hardware is functioning perfectly with no problems > >>>>> noted, and moving the files without NFS avoids the bug. > >>>>> > >>>>> We've been using a workaround involving purposefully pausing > >>>>> (SIGSTOP) the file transfer process on the client as soon as > >>>>> other clients start to show a slowdown. This hack avoids the > >>>>> bug entirely provided the host is allowed to catch back up prior > >>>>> to resuming (SIGCONT) the file transfer process. From this, it > >>>>> seems something is going very wrong within the NFS stack under > >>>>> high storage I/O pressure and high storage write latency > >>>>> (timeout?) -- it should simply pause transfers while the storage > >>>>> subsystem catches up, not lock up a core and force a host > >>>>> restart. Interesting, sometimes it does exactly what it is > >>>>> supposed to and does pause and wait for the storage subsystem, > >>>>> but around 20% of the time it just triggers this bug and stalls > >>>>> a core. > >>>>> > >>>>> This bug has been present since at least 4.14 and is still > >>>>> present in the latest 5.12.14 version. > >>>>> > >>>>> As the machine is in production, it is difficult to gather > >>>>> further information or test patches, however we would be able to > >>>>> apply patches to the kernel that would potentially restore > >>>>> stability with enough advance scheduling. > >>>>> > >>>>> Sample backtrace below: > >>>>> > >>>>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU > >>>>> [16846.426202] rcu: 32-....: (5249 ticks this GP) > >>>>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986 > >>>>> [16846.426241] (t=5251 jiffies g=2720809 q=756724) > >>>>> [16846.426273] NMI backtrace for cpu 32 [16846.426298] CPU: 32 > >>>>> PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1 > >>>>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc] > >>>>> [16846.426406] Call Trace: [16846.426429] [c000200010823250] > >>>>> [c00000000074e630] dump_stack+0xc4/0x114 (unreliable) > >>>>> [16846.426483] [c000200010823290] [c00000000075aebc] > >>>>> nmi_cpu_backtrace+0xfc/0x150 [16846.426506] [c000200010823310] > >>>>> [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0 > >>>>> [16846.426577] [c0002000108233b0] [c000000000072818] > >>>>> arch_trigger_cpumask_backtrace+0x28/0x40 [16846.426621] > >>>>> [c0002000108233d0] [c000000000202db8] > >>>>> rcu_dump_cpu_stacks+0x158/0x1b8 [16846.426667] > >>>>> [c000200010823470] [c000000000201828] > >>>>> rcu_sched_clock_irq+0x908/0xb10 [16846.426708] > >>>>> [c000200010823560] [c0000000002141d0] > >>>>> update_process_times+0xc0/0x140 [16846.426768] > >>>>> [c0002000108235a0] [c00000000022dd34] > >>>>> tick_sched_handle.isra.18+0x34/0xd0 [16846.426808] > >>>>> [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0 > >>>>> [16846.426856] [c000200010823610] [c00000000021577c] > >>>>> __hrtimer_run_queues+0x16c/0x370 [16846.426903] > >>>>> [c000200010823690] [c000000000216378] > >>>>> hrtimer_interrupt+0x128/0x2f0 [16846.426947] [c000200010823740] > >>>>> [c000000000029494] timer_interrupt+0x134/0x310 [16846.426989] > >>>>> [c0002000108237a0] [c000000000016c54] > >>>>> replay_soft_interrupts+0x124/0x2e0 [16846.427045] > >>>>> [c000200010823990] [c000000000016f14] > >>>>> arch_local_irq_restore+0x104/0x170 [16846.427103] > >>>>> [c0002000108239c0] [c00000000017247c] > >>>>> mod_delayed_work_on+0x8c/0xe0 [16846.427149] [c000200010823a20] > >>>>> [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc] > >>>>> [16846.427234] [c000200010823a40] [c0080000081a096c] > >>>>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc] > >>>>> [16846.427324] [c000200010823a90] [c0080000081a3080] > >>>>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc] [16846.427388] > >>>>> [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530 > >>>>> [nfsd] [16846.427457] [c000200010823b60] [c0080000081a0a0c] > >>>>> rpc_exit_task+0x84/0x1d0 [sunrpc] [16846.427520] > >>>>> [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760 > >>>>> [sunrpc] [16846.427598] [c000200010823c30] [c0080000081a2b18] > >>>>> rpc_async_schedule+0x40/0x70 [sunrpc] [16846.427687] > >>>>> [c000200010823c60] [c000000000170bf0] > >>>>> process_one_work+0x290/0x580 [16846.427736] [c000200010823d00] > >>>>> [c000000000170f68] worker_thread+0x88/0x620 [16846.427813] > >>>>> [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0 > >>>>> [16846.427865] [c000200010823e10] [c00000000000d6ec] > >>>>> ret_from_kernel_thread+0x5c/0x70 [16873.869180] watchdog: BUG: > >>>>> soft lockup - CPU#32 stuck for 49s! [kworker/u130:25:10624] > >>>>> [16873.869245] Modules linked in: rpcsec_gss_krb5 > >>>>> iscsi_target_mod target_core_user uio target_core_pscsi > >>>>> target_core_file target_core_iblock target_core_mod tun > >>>>> nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd > >>>>> vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$ > >>>>> [16873.869413] linear mlx4_ib ib_uverbs ib_core raid1 md_mod > >>>>> sd_mod t10_pi hid_generic usbhid hid ses enclosure > >>>>> crct10dif_vpmsum crc32c_vpmsum xhci_pci xhci_hcd ixgbe mlx4_core > >>>>> mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy xfrm_algo mdio > >>>>> libphy aacraid igb raid_cl$ [16873.869889] CPU: 32 PID: 10624 > >>>>> Comm: kworker/u130:25 Not tainted 5.12.14 #1 [16873.869966] > >>>>> Workqueue: rpciod rpc_async_schedule [sunrpc] [16873.870023] > >>>>> NIP: c000000000711300 LR: c0080000081a0708 CTR: > >>>>> c0000000007112a0 [16873.870073] REGS: c0002000108237d0 TRAP: > >>>>> 0900 Not tainted (5.12.14) [16873.870109] MSR: > >>>>> 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: > >>>>> 24004842 XER: 00000000 [16873.870146] CFAR: c0080000081d8054 > >>>>> IRQMASK: 0 GPR00: c0080000081a0748 c000200010823a70 > >>>>> c0000000015c0700 c0000000e2227a40 GPR04: c0000000e2227a40 > >>>>> c0000000e2227a40 c000200ffb6cc0a8 0000000000000018 GPR08: > >>>>> 0000000000000000 5deadbeef0000122 c0080000081ffd18 > >>>>> c0080000081d8040 GPR12: c0000000007112a0 c000200fff7fee00 > >>>>> c00000000017b6c8 c000000090d9ccc0 GPR16: 0000000000000000 > >>>>> 0000000000000000 0000000000000000 0000000000000000 GPR20: > >>>>> 0000000000000000 0000000000000000 0000000000000000 > >>>>> 0000000000000040 GPR24: 0000000000000000 0000000000000000 > >>>>> fffffffffffffe00 0000000000000001 GPR28: c00000001a62f000 > >>>>> c0080000081a0988 c0080000081ffd10 c0000000e2227a00 > >>>>> [16873.870452] NIP [c000000000711300] > >>>>> __list_del_entry_valid+0x60/0x100 [16873.870507] LR > >>>>> [c0080000081a0708] > >>>>> rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]