Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

Marcin Wanat <private@xxxxxxxxxxxxxx> · Mon, 27 May 2024 10:22:02 +0200

On 22.05.2024 12:13, Marcin Wanat wrote:
On 22.05.2024 07:37, Zhaoyang Huang wrote:
On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@xxxxxxxxxxxxxx> 
wrote:

On 21.05.2024 03:00, Zhaoyang Huang wrote:
On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang 
<huangzhaoyang@xxxxxxxxx> wrote:

On Tue, May 21, 2024 at 3:42 AM Marcin Wanat 
<private@xxxxxxxxxxxxxx> wrote:

On 15.04.2024 03:50, Zhaoyang Huang wrote:
I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT 
affected.
However, with long-term kernels 6.1.XX and 6.6.XX,
(tested at least 10 different versions), this lockup always appears
after 2-30 days, similar to the report in the original thread.
The more load (for example, copying a lot of local files while
serving 20Gbps traffic), the higher the chance that the bug will 
appear.

I haven't been able to reproduce this during synthetic tests,
but it always occurs in production on 6.1.X and 6.6.X within 2-30 
days.
If anyone can provide a patch, I can test it on multiple machines
over the next few days.
Could you please try this one which could be applied on 6.6 
directly. Thank you!
URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1- 
zhaoyang.huang@xxxxxxxxxx/

Unfortunately, I am unable to cleanly apply this patch against the
latest 6.6.31
Please try below one which works on my v6.6 based android. Thank you
for your test in advance :D

mm/huge_memory.c | 22 ++++++++++++++--------
  1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c

I have compiled 6.6.31 with this patch and will test it on multiple 
machines over the next 30 days. I will provide an update after 30 days 
if everything is fine or sooner if any of the hosts experience the same 
soft lockup again.

First server with 6.6.31 and this patch hang today. Soft lockup changed 
to hard lockup:

[26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
[26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit 
ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit 
nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables 
nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency 
intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200 
intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi 
intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si 
drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal 
ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr 
drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul 
crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio 
i40e libata megaraid_sas dca ghash_clmulni_intel wmi
[26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G 
     W          6.6.31.el9 #3
[26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS 
V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
[26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0 
a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3 
90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
[26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
[26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX: 
0000000000000000
[26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff9ad6c6f67050
[26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09: 
0000000000000067
[26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000046
[26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15: 
ffffe1138aa98000
[26887.389707] FS:  0000000000000000(0000) GS:ffff9ade20340000(0000) 
knlGS:0000000000000000
[26887.389708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4: 
00000000007706e0
[26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[26887.389713] PKRU: 55555554
[26887.389714] Call Trace:
[26887.389717]  <NMI>
[26887.389720]  ? watchdog_hardlockup_check+0xac/0x150
[26887.389725]  ? __perf_event_overflow+0x102/0x1d0
[26887.389729]  ? handle_pmi_common+0x189/0x3e0
[26887.389735]  ? set_pte_vaddr_p4d+0x4a/0x60
[26887.389738]  ? flush_tlb_one_kernel+0xa/0x20
[26887.389742]  ? native_set_fixmap+0x65/0x80
[26887.389745]  ? ghes_copy_tofrom_phys+0x75/0x110
[26887.389751]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
[26887.389755]  ? intel_pmu_handle_irq+0x10b/0x230
[26887.389756]  ? perf_event_nmi_handler+0x28/0x50
[26887.389759]  ? nmi_handle+0x58/0x150
[26887.389764]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389768]  ? default_do_nmi+0x6b/0x170
[26887.389770]  ? exc_nmi+0x12c/0x1a0
[26887.389772]  ? end_repeat_nmi+0x16/0x1f
[26887.389777]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389780]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389784]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389787]  </NMI>
[26887.389788]  <TASK>
[26887.389789]  __raw_spin_lock_irqsave+0x3d/0x50
[26887.389793]  folio_lruvec_lock_irqsave+0x5e/0x90
[26887.389798]  __page_cache_release+0x68/0x230
[26887.389801]  ? remove_migration_ptes+0x5c/0x80
[26887.389807]  __folio_put+0x24/0x60
[26887.389808]  __split_huge_page+0x368/0x520
[26887.389812]  split_huge_page_to_list+0x4b3/0x570
[26887.389816]  deferred_split_scan+0x1c8/0x290
[26887.389819]  do_shrink_slab+0x12f/0x2d0
[26887.389824]  shrink_slab_memcg+0x133/0x1d0
[26887.389829]  shrink_node_memcgs+0x18e/0x1d0
[26887.389832]  shrink_node+0xa7/0x370
[26887.389836]  balance_pgdat+0x332/0x6f0
[26887.389842]  kswapd+0xf0/0x190
[26887.389845]  ? balance_pgdat+0x6f0/0x6f0
[26887.389848]  kthread+0xee/0x120
[26887.389851]  ? kthread_complete_and_exit+0x20/0x20
[26887.389853]  ret_from_fork+0x2d/0x50
[26887.389857]  ? kthread_complete_and_exit+0x20/0x20
[26887.389859]  ret_from_fork_asm+0x11/0x20
[26887.389864]  </TASK>
[26887.389865] Kernel panic - not syncing: Hard LOCKUP
[26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G 
     W          6.6.31.el9 #3
[26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS 
V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
[26887.389870] Call Trace:
[26887.389871]  <NMI>
[26887.389872]  dump_stack_lvl+0x44/0x60
[26887.389877]  panic+0x241/0x330
[26887.389881]  nmi_panic+0x2f/0x40
[26887.389883]  watchdog_hardlockup_check+0x119/0x150
[26887.389886]  __perf_event_overflow+0x102/0x1d0
[26887.389889]  handle_pmi_common+0x189/0x3e0
[26887.389893]  ? set_pte_vaddr_p4d+0x4a/0x60
[26887.389896]  ? flush_tlb_one_kernel+0xa/0x20
[26887.389899]  ? native_set_fixmap+0x65/0x80
[26887.389902]  ? ghes_copy_tofrom_phys+0x75/0x110
[26887.389906]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
[26887.389909]  intel_pmu_handle_irq+0x10b/0x230
[26887.389911]  perf_event_nmi_handler+0x28/0x50
[26887.389913]  nmi_handle+0x58/0x150
[26887.389916]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389920]  default_do_nmi+0x6b/0x170
[26887.389922]  exc_nmi+0x12c/0x1a0
[26887.389923]  end_repeat_nmi+0x16/0x1f
[26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0 
a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3 
90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
[26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
[26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX: 
0000000000000000
[26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff9ad6c6f67050
[26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09: 
0000000000000067
[26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000046
[26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15: 
ffffe1138aa98000
[26887.389940]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389943]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389946]  </NMI>
[26887.389947]  <TASK>
[26887.389947]  __raw_spin_lock_irqsave+0x3d/0x50
[26887.389950]  folio_lruvec_lock_irqsave+0x5e/0x90
[26887.389953]  __page_cache_release+0x68/0x230
[26887.389955]  ? remove_migration_ptes+0x5c/0x80
[26887.389958]  __folio_put+0x24/0x60
[26887.389960]  __split_huge_page+0x368/0x520
[26887.389963]  split_huge_page_to_list+0x4b3/0x570
[26887.389967]  deferred_split_scan+0x1c8/0x290
[26887.389971]  do_shrink_slab+0x12f/0x2d0
[26887.389974]  shrink_slab_memcg+0x133/0x1d0
[26887.389978]  shrink_node_memcgs+0x18e/0x1d0
[26887.389982]  shrink_node+0xa7/0x370
[26887.389985]  balance_pgdat+0x332/0x6f0
[26887.389991]  kswapd+0xf0/0x190
[26887.389994]  ? balance_pgdat+0x6f0/0x6f0
[26887.389997]  kthread+0xee/0x120
[26887.389998]  ? kthread_complete_and_exit+0x20/0x20
[26887.390000]  ret_from_fork+0x2d/0x50
[26887.390003]  ? kthread_complete_and_exit+0x20/0x20
[26887.390004]  ret_from_fork_asm+0x11/0x20
[26887.390009]  </TASK>