Re: Hard and soft lockups with FIO and LTP runs on a large system

Bharata B Rao <bharata@xxxxxxx> · Wed, 17 Jul 2024 16:01:05 +0530

On 17-Jul-24 3:12 PM, Vlastimil Babka wrote:
On 7/3/24 5:11 PM, Bharata B Rao wrote:
Many soft and hard lockups are seen with upstream kernel when running a
bunch of tests that include FIO and LTP filesystem test on 10 NVME
disks. The lockups can appear anywhere between 2 to 48 hours. Originally
this was reported on a large customer VM instance with passthrough NVME
disks on older kernels(v5.4 based). However, similar problems were
reproduced when running the tests on bare metal with latest upstream
kernel (v6.10-rc3). Other lockups with different signatures are seen but
in this report, only those related to MM area are being discussed.
Also note that the subsequent description is related to the lockups in
bare metal upstream (and not VM).

The general observation is that the problem usually surfaces when the
system free memory goes very low and page cache/buffer consumption hits
the ceiling. Most of the times the two contended locks are lruvec and
inode->i_lock spinlocks.

- Could this be a scalability issue in LRU list handling and/or page
cache invalidation typical to a large system configuration?

Seems to me it could be (except that ZONE_DMA corner case) a general
scalability issue in that you tweak some part of the kernel and the
contention moves elsewhere. At least in MM we have per-node locks so this
means 256 CPUs per lock? It used to be that there were not that many
(cores/threads) per a physical CPU and its NUMA node, so many cpus would
mean also more NUMA nodes where the locks contention would distribute among
them. I think you could try fakenuma to create these nodes artificially and
see if it helps for the MM part. But if the contention moves to e.g. an
inode lock, I'm not sure what to do about that then.

See below...

<SNIP>

3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
socket can be further partitioned into smaller NUMA nodes. With NPS=4,
there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
the system. This was done to check if having more number of kswapd
threads working on lesser number of folios per node would make a
difference. However here too, multiple  soft lockups were seen (in
clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.

These are some softlockups seen with NPS4 mode.

watchdog: BUG: soft lockup - CPU#315 stuck for 11s! [kworker/315:1H:5153]
CPU: 315 PID: 5153 Comm: kworker/315:1H Kdump: loaded Not tainted 
6.10.0-rc3-enbprftw #12
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:handle_softirqs+0x70/0x2f0
Call Trace:
  <IRQ>
  __irq_exit_rcu+0x68/0x90
  irq_exit_rcu+0x12/0x20
  sysvec_apic_timer_interrupt+0x85/0xb0
  </IRQ>
  <TASK>
  asm_sysvec_apic_timer_interrupt+0x1f/0x30
RIP: 0010:iommu_dma_map_page+0xca/0x2c0
  dma_map_page_attrs+0x20d/0x2a0
  nvme_prep_rq.part.0+0x63d/0x940 [nvme]
  nvme_queue_rq+0x82/0x210 [nvme]
  blk_mq_dispatch_rq_list+0x289/0x6d0
  __blk_mq_sched_dispatch_requests+0x142/0x5f0
  blk_mq_sched_dispatch_requests+0x36/0x70
  blk_mq_run_work_fn+0x73/0x90
  process_one_work+0x185/0x3d0
  worker_thread+0x2ce/0x3e0
  kthread+0xe5/0x120
  ret_from_fork+0x3d/0x60
  ret_from_fork_asm+0x1a/0x30

watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [fio:19820]
CPU: 0 PID: 19820 Comm: fio Kdump: loaded Tainted: G             L 
6.10.0-rc3-enbprftw #12
RIP: 0010:native_queued_spin_lock_slowpath+0x2b8/0x300
Call Trace:
  <IRQ>
  </IRQ>
  <TASK>
  _raw_spin_lock+0x2d/0x40
  clear_shadow_entry+0x3d/0x100
  mapping_try_invalidate+0x11b/0x1e0
  invalidate_mapping_pages+0x14/0x20
  invalidate_bdev+0x40/0x50
  blkdev_common_ioctl+0x5f7/0xa90
  blkdev_ioctl+0x10d/0x270
  __x64_sys_ioctl+0x99/0xd0
  x64_sys_call+0x1219/0x20d0
  do_syscall_64+0x51/0x120
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fc92fc3ec6b
  </TASK>

The above one (clear_shadow_entry) has since been fixed by Yu Zhao and 
fix is in mm tree.

We had seen a couple of scenarios with zone lock contention from page 
free and slab free code paths, as reported here: 
https://lore.kernel.org/linux-mm/b68e43d4-91f2-4481-80a9-d166c0a43584@xxxxxxx/

Would you have any insights on these?

Regards,
Bharata.