NFS Server Issues with RDMA in Kernel 6.13.6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Lucas added an attachment on Kernel.org Bugzilla:

Created attachment 307819
NFS over RDMA - Watchdog detected hard LOCKUP on cpu

Hi

I am experiencing stability and performance issues when using NFS (kernel 6.13.6) over rdma protocol.
All what I need to do to trigger the issue is connect client and start read / write operations.

Fastest way to reproduce issue is by running fio job:

fio --name=test --rw=randwrite --bs=4k --filename=/mnt/nfs/test.io --size=40G --direct=1 --numjobs=18 --iodepth=24 --exitall --group_reporting --ioengine=libaio --time_based --runtime=300

Dmesg says: "watchdog: Watchdog detected hard LOCKUP on cpu "

[  976.676922] watchdog: Watchdog detected hard LOCKUP on cpu 182
[  976.676931] Modules linked in: xfs(E) brd(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) br_netfilter(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) xt_recent(E) null_blk(E) nvme_fabrics(E) nvme(E) nvme_core(E) overlay(E) ip6t_REJECT(E) nf_reject_ipv6(E) xt_hl(E) ip6t_rt(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_LOG(E) nf_log_syslog(E) xt_multiport(E) nft_limit(E) xt_limit(E) xt_addrtype(E) xt_tcpudp(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_compat(E) nf_tables(E) binfmt_misc(E) nfnetlink(E) nls_iso8859_1(E) rpcrdma(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) sunrpc(E) rdma_ucm(E) ib_iser(E) libiscsi(E) ipmi_ssif(E) scsi_transport_iscsi(E) rdma_cm(E) ib_umad(E) kvm_amd(E) ib_ipoib(E) iw_cm(E) kvm(E) ib_cm(E) rapl(E) bridge(E) stp(E) llc(E) joydev(E) input_leds(E) ccp(E) ee1004(E) ptdma(E) k10temp(E) acpi_ipmi(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E)
  mac_hid(E) sch_fq_codel(E) bonding(E)
[  976.677035]  efi_pstore(E) ip_tables(E) x_tables(E) autofs4(E) btrfs(E) blake2b_generic(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid1(E) raid0(E) mlx5_ib(E) ib_uverbs(E) ib_core(E) ast(E) drm_client_lib(E) drm_shmem_helper(E) hid_generic(E) mlx5_core(E) mpt3sas(E) rndis_host(E) igb(E) raid_class(E) drm_kms_helper(E) usbhid(E) cdc_ether(E) dca(E) mlxfw(E) crct10dif_pclmul(E) crc32_pclmul(E) polyval_clmulni(E) polyval_generic(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) sha1_ssse3(E) usbnet(E) psample(E) i2c_algo_bit(E) scsi_transport_sas(E) ahci(E) drm(E) mii(E) hid(E) libahci(E) i2c_piix4(E) tls(E) i2c_smbus(E) pci_hyperv_intf(E) aesni_intel(E) crypto_simd(E) cryptd(E)
[  976.677112] CPU: 182 UID: 0 PID: 20143 Comm: nfsd Kdump: loaded Tainted: G            E      6.13.6+ #1
[  976.677118] Tainted: [E]=UNSIGNED_MODULE
[  976.677120] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS 2.8 01/26/2024
[  976.677123] RIP: 0010:native_queued_spin_lock_slowpath+0x244/0x320
[  976.677138] Code: ff ff 41 83 c6 01 41 c1 e5 10 41 c1 e6 12 45 09 ee 44 89 f0 c1 e8 10 66 41 87 44 24 02 89 c2 c1 e2 10 75 5f 31 d2 eb 02 f3 90 <41> 8b 04 24 66 85 c0 75 f5 89 c1 66 31 c9 44 39 f1 0f 84 97 00 00
[  976.677141] RSP: 0018:ffffa16f2de8b948 EFLAGS: 00000002
[  976.677145] RAX: 0000000001d00001 RBX: ffff8bfa8e937bc0 RCX: 0000000000000001
[  976.677147] RDX: ffff8bfa8bd37bc0 RSI: 0000000003340000 RDI: ffff8bfb9fffbe08
[  976.677149] RBP: ffffa16f2de8b970 R08: 0000000000000000 R09: 0000000000000000
[  976.677151] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8bfb9fffbe08
[  976.677153] R13: ffff8c3a8e037bc0 R14: 0000000002dc0000 R15: 00000000000000cc
[  976.677155] FS:  0000000000000000(0000) GS:ffff8bfa8e900000(0000) knlGS:0000000000000000
[  976.677158] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  976.677160] CR2: 00007fb9b1da66b0 CR3: 00000003e84bc000 CR4: 0000000000350ef0
[  976.677162] Call Trace:
[  976.677166]  <NMI>
[  976.677172]  ? show_regs+0x71/0x90
[  976.677182]  ? watchdog_hardlockup_check+0x1ac/0x380
[  976.677189]  ? srso_return_thunk+0x5/0x5f
[  976.677194]  ? watchdog_overflow_callback+0x69/0x80
[  976.677198]  ? __perf_event_overflow+0x153/0x450
[  976.677206]  ? srso_return_thunk+0x5/0x5f
[  976.677211]  ? perf_event_overflow+0x19/0x30
[  976.677215]  ? x86_pmu_handle_irq+0x189/0x210
[  976.677225]  ? srso_return_thunk+0x5/0x5f
[  976.677228]  ? flush_tlb_one_kernel+0xe/0x40
[  976.677234]  ? srso_return_thunk+0x5/0x5f
[  976.677237]  ? set_pte_vaddr_p4d+0x58/0x80
[  976.677244]  ? srso_return_thunk+0x5/0x5f
[  976.677247]  ? set_pte_vaddr+0x89/0xc0
[  976.677250]  ? cc_platform_has+0x30/0x40
[  976.677256]  ? srso_return_thunk+0x5/0x5f
[  976.677259]  ? native_set_fixmap+0x6b/0xa0
[  976.677262]  ? srso_return_thunk+0x5/0x5f
[  976.677265]  ? ghes_copy_tofrom_phys+0x7c/0x130
[  976.677274]  ? srso_return_thunk+0x5/0x5f
[  976.677277]  ? __ghes_peek_estatus.isra.0+0x4e/0xd0
[  976.677282]  ? amd_pmu_handle_irq+0x48/0xc0
[  976.677287]  ? perf_event_nmi_handler+0x2d/0x60
[  976.677290]  ? nmi_handle+0x67/0x190
[  976.677295]  ? default_do_nmi+0x45/0x150
[  976.677301]  ? exc_nmi+0x13e/0x1e0
[  976.677304]  ? end_repeat_nmi+0xf/0x53
[  976.677313]  ? native_queued_spin_lock_slowpath+0x244/0x320
[  976.677317]  ? native_queued_spin_lock_slowpath+0x244/0x320
[  976.677322]  ? native_queued_spin_lock_slowpath+0x244/0x320
[  976.677325]  </NMI>
[  976.677326]  <TASK>
[  976.677329]  _raw_spin_lock_irqsave+0x5c/0x80
[  976.677334]  alloc_iova+0x92/0x290
[  976.677341]  ? current_time+0x2d/0x120
[  976.677348]  alloc_iova_fast+0x1fb/0x400
[  976.677351]  ? srso_return_thunk+0x5/0x5f
[  976.677354]  ? touch_atime+0x1f/0x110
[  976.677360]  iommu_dma_alloc_iova+0xa2/0x190
[  976.677365]  iommu_dma_map_sg+0x447/0x4e0
[  976.677373]  __dma_map_sg_attrs+0x139/0x1b0
[  976.677380]  dma_map_sgtable+0x21/0x50
[  976.677386]  rdma_rw_ctx_init+0x6c/0x820 [ib_core]
[  976.677525]  ? common_perm_cond+0x4d/0x210
[  976.677532]  ? srso_return_thunk+0x5/0x5f
[  976.677538]  ? xfs_vn_getattr+0xe2/0x3c0 [xfs]
[  976.677700]  svc_rdma_rw_ctx_init+0x49/0xf0 [rpcrdma]
[  976.677725]  svc_rdma_build_writes+0xa5/0x210 [rpcrdma]
[  976.677746]  ? __pfx_svc_rdma_pagelist_to_sg+0x10/0x10 [rpcrdma]
[  976.677767]  ? svc_rdma_send_write_list+0xf4/0x290 [rpcrdma]
[  976.677790]  svc_rdma_xb_write+0x7d/0xb0 [rpcrdma]
[  976.677811]  svc_rdma_send_write_list+0x144/0x290 [rpcrdma]
[  976.677834]  ? nfsd_cache_update+0x57/0x2c0 [nfsd]
[  976.677889]  svc_rdma_sendto+0x99/0x510 [rpcrdma]
[  976.677912]  ? svcauth_unix_release+0x1e/0x80 [sunrpc]
[  976.677968]  svc_send+0x49/0x140 [sunrpc]
[  976.678013]  svc_process+0x166/0x200 [sunrpc]
[  976.678058]  svc_recv+0x8a1/0xaa0 [sunrpc]
[  976.678101]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[  976.678144]  nfsd+0xa7/0x110 [nfsd]
[  976.678183]  kthread+0xe4/0x120
[  976.678188]  ? __pfx_kthread+0x10/0x10
[  976.678192]  ret_from_fork+0x46/0x70
[  976.678197]  ? __pfx_kthread+0x10/0x10
[  976.678200]  ret_from_fork_asm+0x1a/0x30
[  976.678210]  </TASK>



Full log attached.

File: dmesg-6.13.6.log (text/plain)
Size: 407.10 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307819
---
NFS over RDMA - Watchdog detected hard LOCKUP on cpu

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)





[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux