On 7/3/24 5:11 PM, Bharata B Rao wrote: > Many soft and hard lockups are seen with upstream kernel when running a > bunch of tests that include FIO and LTP filesystem test on 10 NVME > disks. The lockups can appear anywhere between 2 to 48 hours. Originally > this was reported on a large customer VM instance with passthrough NVME > disks on older kernels(v5.4 based). However, similar problems were > reproduced when running the tests on bare metal with latest upstream > kernel (v6.10-rc3). Other lockups with different signatures are seen but > in this report, only those related to MM area are being discussed. > Also note that the subsequent description is related to the lockups in > bare metal upstream (and not VM). > > The general observation is that the problem usually surfaces when the > system free memory goes very low and page cache/buffer consumption hits > the ceiling. Most of the times the two contended locks are lruvec and > inode->i_lock spinlocks. > > - Could this be a scalability issue in LRU list handling and/or page > cache invalidation typical to a large system configuration? Seems to me it could be (except that ZONE_DMA corner case) a general scalability issue in that you tweak some part of the kernel and the contention moves elsewhere. At least in MM we have per-node locks so this means 256 CPUs per lock? It used to be that there were not that many (cores/threads) per a physical CPU and its NUMA node, so many cpus would mean also more NUMA nodes where the locks contention would distribute among them. I think you could try fakenuma to create these nodes artificially and see if it helps for the MM part. But if the contention moves to e.g. an inode lock, I'm not sure what to do about that then. > - Are there any MM/FS tunables that could help here? > > Hardware configuration > ====================== > Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads) > Memory: 1.5 TB > 10 NVME - 3.5TB each > available: 2 nodes (0-1) > node 0 cpus: 0-127,256-383 > node 0 size: 773727 MB > node 1 cpus: 128-255,384-511 > node 1 size: 773966 MB > > Workload details > ================ > Workload includes concurrent runs of FIO and a few FS tests from LTP. > > FIO is run with a size of 1TB on each NVME partition with different > combinations of ioengine/blocksize/mode parameters and buffered-IO. > Selected FS tests from LTP are run on 256GB partitions of all NVME > disks. This is the typical NVME partition layout. > > nvme2n1 259:4 0 3.5T 0 disk > ├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1 > └─nvme2n1p2 259:7 0 3.2T 0 part > > Though many different runs exist in the workload, the combination that > results in the problem is buffered-IO run with sync engine. > > fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \ > -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \ > -numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest > > Watchdog threshold was reduced to 5s to reproduce the problem early and > all CPU backtrace enabled. > > Problem details and analysis > ============================ > One of the hard lockups which was observed and analyzed in detail is this: > > kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284 > kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9 > kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: Call Trace: > kernel: <NMI> > kernel: ? show_regs+0x69/0x80 > kernel: ? watchdog_hardlockup_check+0x19e/0x360 > <SNIP> > kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: </NMI> > kernel: <TASK> > kernel: ? __pfx_lru_add_fn+0x10/0x10 > kernel: _raw_spin_lock_irqsave+0x42/0x50 > kernel: folio_lruvec_lock_irqsave+0x62/0xb0 > kernel: folio_batch_move_lru+0x79/0x2a0 > kernel: folio_add_lru+0x6d/0xf0 > kernel: filemap_add_folio+0xba/0xe0 > kernel: __filemap_get_folio+0x137/0x2e0 > kernel: ext4_da_write_begin+0x12c/0x270 > kernel: generic_perform_write+0xbf/0x200 > kernel: ext4_buffered_write_iter+0x67/0xf0 > kernel: ext4_file_write_iter+0x70/0x780 > kernel: vfs_write+0x301/0x420 > kernel: ksys_write+0x67/0xf0 > kernel: __x64_sys_write+0x19/0x20 > kernel: x64_sys_call+0x1689/0x20d0 > kernel: do_syscall_64+0x6b/0x110 > kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: RIP: > 0033:0x7fe21c314887 > > With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock > acquisition. We measured the lruvec spinlock start, end and hold > time(htime) using sched_clock(), along with a BUG() if the hold time was > more than 10s. The below case shows that lruvec spin lock was held for ~25s. > > kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime > 27963324369895, htime 25889317166 > kernel: ------------[ cut here ]------------ > kernel: kernel BUG at include/linux/memcontrol.h:1677! > kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W > 6.10.0-rc3-qspindbg #10 > kernel: RIP: 0010:shrink_active_list+0x40a/0x520 > > And the corresponding trace point for the above: > kswapd0-3211 [021] dN.1. 27963.324332: mm_vmscan_lru_isolate: > classzone=0 order=0 nr_requested=1 nr_scanned=156946361 > nr_skipped=156946360 nr_taken=1 lru=active_file > > This shows that isolate_lru_folios() is scanning through a huge number > (~150million) of folios (order=0) with lruvec spinlock held. This is > happening because a large number of folios are being skipped to isolate > a few ZONE_DMA folios. Though the number of folios to be scanned is > bounded (32), there exists a genuine case where this can become > unbounded, i.e. in case where folios are skipped. > > Meminfo output shows that the free memory is around ~2% and page/buffer > cache grows very high when the lockup happens. > > MemTotal: 1584835956 kB > MemFree: 27805664 kB > MemAvailable: 1568099004 kB > Buffers: 1386120792 kB > Cached: 151894528 kB > SwapCached: 30620 kB > Active: 1043678892 kB > Inactive: 494456452 kB > > Often times, the perf output at the time of the problem shows heavy > contention on lruvec spin lock. Similar contention is also observed with > inode i_lock (in clear_shadow_entry path) > > 98.98% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath > | > --98.96%--native_queued_spin_lock_slowpath > | > --98.96%--_raw_spin_lock_irqsave > folio_lruvec_lock_irqsave > | > --98.78%--folio_batch_move_lru > | > --98.63%--deactivate_file_folio > mapping_try_invalidate > invalidate_mapping_pages > invalidate_bdev > blkdev_common_ioctl > blkdev_ioctl > __x64_sys_ioctl > x64_sys_call > do_syscall_64 > entry_SYSCALL_64_after_hwframe > > Some experiments tried > ====================== > 1) When MGLRU was enabled many soft lockups were observed, no hard > lockups were seen for 48 hours run. Below is once such soft lockup. > > kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649] > kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L > 6.10.0-rc3-mglru-irqstrc #24 > kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: Call Trace: > kernel: <IRQ> > kernel: ? show_regs+0x69/0x80 > kernel: ? watchdog_timer_fn+0x223/0x2b0 > kernel: ? __pfx_watchdog_timer_fn+0x10/0x10 > <SNIP> > kernel: </IRQ> > kernel: <TASK> > kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: _raw_spin_lock+0x38/0x50 > kernel: clear_shadow_entry+0x3d/0x100 > kernel: ? __pfx_workingset_update_node+0x10/0x10 > kernel: mapping_try_invalidate+0x117/0x1d0 > kernel: invalidate_mapping_pages+0x10/0x20 > kernel: invalidate_bdev+0x3c/0x50 > kernel: blkdev_common_ioctl+0x5f7/0xa90 > kernel: blkdev_ioctl+0x109/0x270 > kernel: x64_sys_call+0x1215/0x20d0 > kernel: do_syscall_64+0x7e/0x130 > > This happens to be contending on inode i_lock spinlock. > > Below preemptirqsoff trace points to preemption being disabled for more > than 10s and the lock in picture is lruvec spinlock. > > # tracer: preemptirqsoff > # > # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc > # -------------------------------------------------------------------- > # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 > HP:0 #P:512) > # ----------------- > # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0) > # ----------------- > # => started at: deactivate_file_folio > # => ended at: deactivate_file_folio > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > fio-2701523 128...1. 0us$: deactivate_file_folio > <-deactivate_file_folio > fio-2701523 128.N.1. 10382681us : deactivate_file_folio > <-deactivate_file_folio > fio-2701523 128.N.1. 10382683us : tracer_preempt_on > <-deactivate_file_folio > fio-2701523 128.N.1. 10382691us : <stack trace> > => deactivate_file_folio > => mapping_try_invalidate > => invalidate_mapping_pages > => invalidate_bdev > => blkdev_common_ioctl > => blkdev_ioctl > => __x64_sys_ioctl > => x64_sys_call > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > 2) Increased low_watermark_threshold to 10% to prevent system from > entering into extremely low memory situation. Although hard lockups > weren't seen, but soft lockups (clear_shadow_entry()) were still seen. > > 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a > socket can be further partitioned into smaller NUMA nodes. With NPS=4, > there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in > the system. This was done to check if having more number of kswapd > threads working on lesser number of folios per node would make a > difference. However here too, multiple soft lockups were seen (in > clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed. > > Any insights/suggestion into these lockups and suggestions are welcome! > > Regards, > Bharata. >