Hard and soft lockups with FIO and LTP runs on a large system

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Many soft and hard lockups are seen with upstream kernel when running a bunch of tests that include FIO and LTP filesystem test on 10 NVME disks. The lockups can appear anywhere between 2 to 48 hours. Originally this was reported on a large customer VM instance with passthrough NVME disks on older kernels(v5.4 based). However, similar problems were reproduced when running the tests on bare metal with latest upstream kernel (v6.10-rc3). Other lockups with different signatures are seen but in this report, only those related to MM area are being discussed. Also note that the subsequent description is related to the lockups in bare metal upstream (and not VM).

The general observation is that the problem usually surfaces when the system free memory goes very low and page cache/buffer consumption hits the ceiling. Most of the times the two contended locks are lruvec and inode->i_lock spinlocks.

- Could this be a scalability issue in LRU list handling and/or page cache invalidation typical to a large system configuration?
- Are there any MM/FS tunables that could help here?

Hardware configuration
======================
Dual socket  AMD EPYC 128 Core processor (256 cores, 512 threads)
Memory: 1.5 TB
10 NVME - 3.5TB each
available: 2 nodes (0-1)
node 0 cpus: 0-127,256-383
node 0 size: 773727 MB
node 1 cpus: 128-255,384-511
node 1 size: 773966 MB

Workload details
================
Workload includes concurrent runs of FIO and a few FS tests from LTP.

FIO is run with a size of 1TB on each NVME partition with different combinations of ioengine/blocksize/mode parameters and buffered-IO. Selected FS tests from LTP are run on 256GB partitions of all NVME disks. This is the typical NVME partition layout.

nvme2n1      259:4   0   3.5T  0 disk
├─nvme2n1p1  259:6   0   256G  0 part /data_nvme2n1p1
└─nvme2n1p2  259:7   0   3.2T  0 part

Though many different runs exist in the workload, the combination that results in the problem is buffered-IO run with sync engine.

fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30  --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest

Watchdog threshold was reduced to 5s to reproduce the problem early and all CPU backtrace enabled.

Problem details and analysis
============================
One of the hard lockups which was observed and analyzed in detail is this:

kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel:  <NMI>
kernel:  ? show_regs+0x69/0x80
kernel:  ? watchdog_hardlockup_check+0x19e/0x360
<SNIP>
kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel:  </NMI>
kernel:  <TASK>
kernel:  ? __pfx_lru_add_fn+0x10/0x10
kernel: _raw_spin_lock_irqsave+0x42/0x50
kernel: folio_lruvec_lock_irqsave+0x62/0xb0
kernel: folio_batch_move_lru+0x79/0x2a0
kernel: folio_add_lru+0x6d/0xf0
kernel: filemap_add_folio+0xba/0xe0
kernel: __filemap_get_folio+0x137/0x2e0
kernel: ext4_da_write_begin+0x12c/0x270
kernel: generic_perform_write+0xbf/0x200
kernel: ext4_buffered_write_iter+0x67/0xf0
kernel: ext4_file_write_iter+0x70/0x780
kernel: vfs_write+0x301/0x420
kernel: ksys_write+0x67/0xf0
kernel: __x64_sys_write+0x19/0x20
kernel: x64_sys_call+0x1689/0x20d0
kernel: do_syscall_64+0x6b/0x110
kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: RIP: 0033:0x7fe21c314887

With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock acquisition. We measured the lruvec spinlock start, end and hold time(htime) using sched_clock(), along with a BUG() if the hold time was more than 10s. The below case shows that lruvec spin lock was held for ~25s.

kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime 27963324369895, htime 25889317166
kernel: ------------[ cut here ]------------
kernel: kernel BUG at include/linux/memcontrol.h:1677!
kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W 6.10.0-rc3-qspindbg #10
kernel: RIP: 0010:shrink_active_list+0x40a/0x520

And the corresponding trace point for the above:
kswapd0-3211 [021] dN.1. 27963.324332: mm_vmscan_lru_isolate: classzone=0 order=0 nr_requested=1 nr_scanned=156946361 nr_skipped=156946360 nr_taken=1 lru=active_file

This shows that isolate_lru_folios() is scanning through a huge number (~150million) of folios (order=0) with lruvec spinlock held. This is happening because a large number of folios are being skipped to isolate a few ZONE_DMA folios. Though the number of folios to be scanned is bounded (32), there exists a genuine case where this can become unbounded, i.e. in case where folios are skipped.

Meminfo output shows that the free memory is around ~2% and page/buffer cache grows very high when the lockup happens.

MemTotal:       1584835956 kB
MemFree:        27805664 kB
MemAvailable:   1568099004 kB
Buffers:        1386120792 kB
Cached:         151894528 kB
SwapCached:        30620 kB
Active:         1043678892 kB
Inactive:       494456452 kB

Often times, the perf output at the time of the problem shows heavy contention on lruvec spin lock. Similar contention is also observed with inode i_lock (in clear_shadow_entry path)

98.98%  fio    [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
   |
    --98.96%--native_queued_spin_lock_slowpath
       |
        --98.96%--_raw_spin_lock_irqsave
                  folio_lruvec_lock_irqsave
                  |
                   --98.78%--folio_batch_move_lru
                       |
                        --98.63%--deactivate_file_folio
                                  mapping_try_invalidate
                                  invalidate_mapping_pages
                                  invalidate_bdev
                                  blkdev_common_ioctl
                                  blkdev_ioctl
                                  __x64_sys_ioctl
                                  x64_sys_call
                                  do_syscall_64
                                  entry_SYSCALL_64_after_hwframe

Some experiments tried
======================
1) When MGLRU was enabled many soft lockups were observed, no hard lockups were seen for 48 hours run. Below is once such soft lockup.

kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L 6.10.0-rc3-mglru-irqstrc #24
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel:  <IRQ>
kernel:  ? show_regs+0x69/0x80
kernel:  ? watchdog_timer_fn+0x223/0x2b0
kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
<SNIP>
kernel:  </IRQ>
kernel:  <TASK>
kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel:  _raw_spin_lock+0x38/0x50
kernel:  clear_shadow_entry+0x3d/0x100
kernel:  ? __pfx_workingset_update_node+0x10/0x10
kernel:  mapping_try_invalidate+0x117/0x1d0
kernel:  invalidate_mapping_pages+0x10/0x20
kernel:  invalidate_bdev+0x3c/0x50
kernel:  blkdev_common_ioctl+0x5f7/0xa90
kernel:  blkdev_ioctl+0x109/0x270
kernel:  x64_sys_call+0x1215/0x20d0
kernel:  do_syscall_64+0x7e/0x130

This happens to be contending on inode i_lock spinlock.

Below preemptirqsoff trace points to preemption being disabled for more than 10s and the lock in picture is lruvec spinlock.

    # tracer: preemptirqsoff
    #
    # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
    # --------------------------------------------------------------------
# latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:512)
    #    -----------------
    #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
    #    -----------------
    #  => started at: deactivate_file_folio
    #  => ended at:   deactivate_file_folio
    #
    #
    #                    _------=> CPU#
    #                   / _-----=> irqs-off/BH-disabled
    #                  | / _----=> need-resched
    #                  || / _---=> hardirq/softirq
    #                  ||| / _--=> preempt-depth
    #                  |||| / _-=> migrate-disable
    #                  ||||| /     delay
    #  cmd     pid     |||||| time  |   caller
    #     \   /        ||||||  \    |    /
fio-2701523 128...1. 0us$: deactivate_file_folio <-deactivate_file_folio fio-2701523 128.N.1. 10382681us : deactivate_file_folio <-deactivate_file_folio fio-2701523 128.N.1. 10382683us : tracer_preempt_on <-deactivate_file_folio
         fio-2701523 128.N.1. 10382691us : <stack trace>
     => deactivate_file_folio
     => mapping_try_invalidate
     => invalidate_mapping_pages
     => invalidate_bdev
     => blkdev_common_ioctl
     => blkdev_ioctl
     => __x64_sys_ioctl
     => x64_sys_call
     => do_syscall_64
     => entry_SYSCALL_64_after_hwframe

2) Increased low_watermark_threshold to 10% to prevent system from entering into extremely low memory situation. Although hard lockups weren't seen, but soft lockups (clear_shadow_entry()) were still seen.

3) AMD has a BIOS setting called NPS (Nodes per socket), using which a socket can be further partitioned into smaller NUMA nodes. With NPS=4, there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in the system. This was done to check if having more number of kswapd threads working on lesser number of folios per node would make a difference. However here too, multiple soft lockups were seen (in clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.

Any insights/suggestion into these lockups and suggestions are welcome!

Regards,
Bharata.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux