On Wed, 29 May 2024 at 17:18, Yuval El-Hanany <yuvale@xxxxxxxxxxx> wrote: > > Ouch, > not what I was hoping for :-(. I’m gonna think about it myself. You would still need to disable XSK redirect from a CPUMAP XDP program. The rx_queue_index would be of the originating CPU (if we still fix the rx_queue_index but it can’t be left as is anyhow), and XSK redirect would be able to feed to the same queue from different CPUs, if I understand the architecture correctly. > > I’ll try to add locking and see the performance hit. It’s only the XSK redirect target that would need a lock? May be add a flag to the socket or umem that allows multicpu access and introduces a lock? I really wanna get this working. For my use case, it would be the optimal solution. I might even go for a private kernel if there’s no way to get it in production kernel. You would need to add a spinlock in __xsk_rcv_zc(), __xsk_rcv() (for skb mode), and xsk_flush(). Basically to all functions that access the ring, i.e. uses functions that start with xskq_ that need to be protected. To get anything like this into the standard kernel, the spinlocks would have to be disabled (since they are costly) when only one kthread/napi id is using the socket, but enabled when two or more kthreads decide to redirect to the same socket. Since sockets can be added dynamically to the XSKMAP and the presence of multiple sockets in the XSKMAP does not mean that they will all be accessed by a napi instance, this complicates the implementation a lot. I really do not want to hurt the performance of the common case that we have today. But a private patch seems doable. Another option would be to find a way to do a lock-less multi-producer / single-consumer queue using the current interface on the consumer / user-space side. Do not know if this is possible though. It was tricky enough to get the current SPSC ring correct on all architectures. A MPSC ring will be harder. Adding spin-locks would be a lot simpler. > > Thanks, > Yuval. > > > On 29 May 2024, at 16:25, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote: > > > > CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe. > > > > On Wed, 29 May 2024 at 13:07, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote: > >> > >> On Wed, 29 May 2024 at 11:08, Yuval El-Hanany <YuvalE@xxxxxxxxxxx> wrote: > >>> > >>> Hi, > >>> I got kernel panic on Kernel 5.15.117 (with the shared umem kernel patch). The scenario was simple. Instead of redirecting to a different cpu using CPUMAP, I’ve tried redirecting traffic from a CPU core to a different CPU core using XSK sockets. I used 2 cores. When each redirected to the other, all worked well. When redirecting traffic from both cores to one core, I got a kernel panic almost immediately under load. The flush_list seems to be per cpu, but somehow it’s messed up when two cores access the it? > >>> > >>> Thanks, > >>> Yuval. > >> > >> Thanks Yuval. Will take a look at it. > > > > I think the sad conclusion here is that I need to revert the patch > > that I sent to you. The problem is that the rings between user-space > > and kernel-space are single producer / single consumer and allowing > > two or more NAPI threads to access the same ring by performing a > > redirect to the same socket at the same time will break this. The > > flush is likely solvable, but the addition of entries to the same ring > > would require introducing locking (or a completely new ring type) > > which is going to be way too expensive. With the old code, checking > > that the queue_index of the queue it got the packet on being equal to > > the socket's queue_index would disallow this and not trigger this > > problem. > > > > I have no idea why I did not think about this earlier. My sincere > > apologies for that. I will scratch my head for a while more, but I am > > not hopeful that I will come up with a good solution for this. > > > >>> Two different dumps. > >>> > >>> Dump 1: > >>> > >>> 2024-05-29T01:[ 306.997548] BUG: kernel NULL pointer dereference, address: 0000000000000008 > >>> [ 307.088372] #PF: supervisor read access in kernel mode > >>> [ 307.149079] #PF: error_code(0x0000) - not-present page > >>> [ 307.209774] PGD 10f131067 P4D 10f131067 PUD 102642067 PMD 0 > >>> [ 307.276608] Oops: 0000 [#1] SMP > >>> [ 307.313712] CPU: 3 PID: 1919 Comm: sp1 Tainted: P OE 5.15.117-1-ULP-NG #1 > >>> [ 307.408219] Hardware name: Radware Radware/Default string, BIOS 5.25 (785A.015) 05/11/2023 > >>> [ 307.505779] RIP: 0010:xsk_flush+0xb/0x40 > >>> [ 307.552099] Code: a0 03 00 00 01 b8 e4 ff ff ff eb dc 49 83 85 a0 03 00 00 01 b8 e4 ff ff ff eb cd 0f 1f 40 00 48 8b 87 40 03 00 00 55 48 89 e5 <8b> 50 08 48 8b 40 10 89 10 48 8b 87 68 03 00 00 48 8b 80 80 00 00 > >>> [ 307.773694] RSP: 0000:ffffb7ae01037c80 EFLAGS: 00010287 > >>> [ 307.835401] RAX: 0000000000000000 RBX: ffffa0a88f8ab768 RCX: ffffa0a88f8abac0 > >>> [ 307.919670] RDX: ffffa0a88f8abac0 RSI: 0000000000000004 RDI: ffffa0a88f8ab768 > >>> [ 308.003922] RBP: ffffb7ae01037c80 R08: ffffa0a10b3e0000 R09: 000000000000769f > >>> [ 308.088172] R10: ffffa0a1035ca000 R11: 000000000d7f9180 R12: ffffa0a88f8ab768 > >>> [ 308.172405] R13: ffffa0a88f8ebac0 R14: ffffa0a2ef135300 R15: 0000000000000155 > >>> [ 308.256635] FS: 00007ffff7e97a80(0000) GS:ffffa0a88f8c0000(0000) knlGS:0000000000000000 > >>> [ 308.352186] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [ 308.420043] CR2: 0000000000000008 CR3: 000000010cf6e000 CR4: 0000000000750ee0 > >>> [ 308.504309] PKRU: 55555554 > >>> [ 308.536296] Call Trace: > >>> [ 308.565209] <TASK> > >>> [ 308.590026] ? show_regs+0x56/0x60 > >>> [ 308.630218] ? __die_body+0x1a/0x60 > >>> [ 308.671433] ? __die+0x25/0x30 > >>> [ 308.707529] ? page_fault_oops+0xc0/0x440 > >>> [ 308.754897] ? do_sys_poll+0x47c/0x5e0 > >>> [ 308.799188] ? do_user_addr_fault+0x319/0x6e0 > >>> [ 308.850659] ? exc_page_fault+0x6c/0x130 > >>> [ 308.896992] ? asm_exc_page_fault+0x27/0x30 > >>> [ 308.946398] ? xsk_flush+0xb/0x40 > >>> [ 308.985546] __xsk_map_flush+0x3a/0x80 > >>> [ 309.029824] xdp_do_flush+0x13/0x20 > >>> [ 309.071043] i40e_finalize_xdp_rx+0x44/0x50 [i40e] > >>> 56:07-08:00 NOTI[ 309.127653] i40e_clean_rx_irq_zc+0x132/0x500 [i40e] > >>> [ 309.202736] i40e_napi_poll+0x119/0x1270 [i40e] > >>> CE slb: real se[ 309.256285] ? xsk_sendmsg+0xf4/0x100 > >>> [ 309.315969] ? sock_sendmsg+0x2e/0x40 > >>> [ 309.359244] __napi_poll+0x23/0x160 > >>> [ 309.400482] net_rx_action+0x232/0x290 > >>> [ 309.444778] __do_softirq+0xd0/0x270 > >>> [ 309.487012] irq_exit_rcu+0x74/0xa0 > >>> [ 309.528241] common_interrupt+0x83/0xa0 > >>> [ 309.573577] asm_common_interrupt+0x27/0x40 > >>> [ 309.623017] RIP: 0033:0x5cb685 > >>> [ 309.659115] Code: b0 fb ff ff 48 8b 8d b8 fb ff ff 44 89 f8 23 44 91 08 74 09 31 d2 89 d1 e9 22 fe ff ff 83 85 88 fb ff ff 01 8b 95 88 fb ff ff <39> 15 79 31 1d 04 0f 87 c3 fd ff ff 4c 8b 9d b8 fb ff ff e9 a2 f8 > >>> [ 309.880853] RSP: 002b:00007fffffffd820 EFLAGS: 00000206 > >>> [ 309.942608] RAX: 0000000000000000 RBX: 0000000000000047 RCX: 00000000050912bc > >>> [ 310.026922] RDX: 0000000000000033 RSI: 0000000000000000 RDI: 0000000000000001 > >>> [ 310.111228] RBP: 00007fffffffdd50 R08: 0000000000000001 R09: 0000000000000008 > >>> [ 310.195548] R10: 0000000002f68e00 R11: 00000000050912bc R12: 0000000000000047 > >>> [ 310.279877] R13: 0000000000000022 R14: 0000000004f08560 R15: 00000000ff55557f > >>> [ 310.364168] </TASK> > >>> [ 310.390014] Modules linked in: uio i40e ec(PO) evdev ncs_acpi(O) bonding sr_mod cdrom usb_storage dwmac_intel stmmac i2c_i801 pcs_xpcs phylink i2c_smbus [last unloaded: i40e] > >>> rvice r5, IP 4.4[ 310.573779] CR2: 0000000000000008 > >>> [ 310.629359] ---[ end trace 78cf9f96f477759d ]--- > >>> .4.5:80 operational, affected virt 2.2.2.10 > >>> > >>> Dump 2: > >>> > >>> 2024-05-29T01:56:07-08:00 NOTICE slb: real serve[ 311.250217] RIP: 0010:xsk_flush+0xb/0x40 > >>> [ 311.300334] Code: a0 03 00 00 01 b8 e4 ff ff ff eb dc 49 83 85 a0 03 00 00 01 b8 e4 ff ff ff eb cd 0f 1f 40 00 48 8b 87 40 03 00 00 55 48 89 e5 <8b> 50 08 48 8b 40 10 89 10 48 8b 87 68 03 00 00 48 8b 80 80 00 00 > >>> [ 311.522070] RSP: 0000:ffffb7ae01037c80 EFLAGS: 00010287 > >>> [ 311.583811] RAX: 0000000000000000 RBX: ffffa0a88f8ab768 RCX: ffffa0a88f8abac0 > >>> [ 311.668118] RDX: ffffa0a88f8abac0 RSI: 0000000000000004 RDI: ffffa0a88f8ab768 > >>> [ 311.752405] RBP: ffffb7ae01037c80 R08: ffffa0a10b3e0000 R09: 000000000000769f > >>> [ 311.836716] R10: ffffa0a1035ca000 R11: 000000000d7f9180 R12: ffffa0a88f8ab768 > >>> [ 311.921040] R13: ffffa0a88f8ebac0 R14: ffffa0a2ef135300 R15: 0000000000000155 > >>> [ 312.005358] FS: 00007ffff7e97a80(0000) GS:ffffa0a88f8c0000(0000) knlGS:0000000000000000 > >>> [ 312.100975] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [ 312.168886] CR2: 0000000000000008 CR3: 000000010cf6e000 CR4: 0000000000750ee0 > >>> [ 312.253232] PKRU: 55555554 > >>> [ 312.285245] Kernel panic - not syncing: Fatal exception in interrupt > >>> [ 312.360323] Kernel Offset: 0x3a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > >>> [ 312.570739] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- > >>> > >>> > >>> > >>> 2024-05-29T01:00:47-08:00 NOTICE slb: real service r16, IP 4.4.4.16:80 operational, affected [ 390.602501] BUG: kernel NULL pointer dereference, address: 0000000000000008 > >>> [ 390.699875] #PF: supervisor read access in kernel mode > >>> [ 390.760620] #PF: error_code(0x0000) - not-present page > >>> [ 390.821361] PGD 10d319067 P4D 10d319067 PUD 10df94067 PMD 0 > >>> [ 390.888266] Oops: 0000 [#1] SMP > >>> [ 390.925409] CPU: 3 PID: 1922 Comm: sp1 Tainted: P OE 5.15.117-1-ULP-NG #1 > >>> [ 391.020040] Hardware name: Radware Radware/Default string, BIOS 5.25 (785A.015) 05/11/2023 > >>> [ 391.117759] RIP: 0010:xsk_flush+0xb/0x40 > >>> [ 391.164147] Code: a0 03 00 00 01 b8 e4 ff ff ff eb dc 49 83 85 a0 03 00 00 01 b8 e4 ff ff ff eb cd 0f 1f 40 00 48 8b 87 40 03 00 00 55 48 89 e5 <8b> 50 08 48 8b 40 10 89 10 48 8b 87 68 03 00 00 48 8b 80 80 00 00 > >>> [ 391.386008] RSP: 0018:ffffae5c80244d48 EFLAGS: 00010287 > >>> [ 391.447793] RAX: 0000000000000000 RBX: ffff8bb80f8ab768 RCX: ffff8bb80f8abac0 > >>> [ 391.532161] RDX: ffff8bb80f8abac0 RSI: 0000000000000004 RDI: ffff8bb80f8ab768 > >>> [ 391.616518] RBP: ffffae5c80244d48 R08: ffff8bb0845f0000 R09: 000000000000967a > >>> [ 391.700883] R10: ffff8bb0833fdc00 R11: 000000000ca5a180 R12: ffff8bb80f8ab768 > >>> [ 391.785233] R13: ffff8bb80f8ebac0 R14: ffff8bb081571b00 R15: 0000000000000132 > >>> [ 391.869615] FS: 00007ffff7e97a80(0000) GS:ffff8bb80f8c0000(0000) knlGS:0000000000000000 > >>> [ 391.965273] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [ 392.033224] CR2: 0000000000000008 CR3: 000000010d84b000 CR4: 0000000000750ee0 > >>> [ 392.117580] PKRU: 55555554 > >>> [ 392.149594] Call Trace: > >>> [ 392.178536] <IRQ> > >>> [ 392.202345] ? show_regs+0x56/0x60 > >>> [ 392.242571] ? __die_body+0x1a/0x60 > >>> [ 392.283806] ? __die+0x25/0x30 > >>> [ 392.319919] ? page_fault_oops+0xc0/0x440 > >>> [ 392.367315] ? i40e_msix_clean_rings+0x27/0x40 [i40e] > >>> [ 392.427035] ? do_user_addr_fault+0x319/0x6e0 > >>> [ 392.478538] ? handle_irq_event+0x41/0x60 > >>> [ 392.525934] ? exc_page_fault+0x6c/0x130 > >>> [ 392.572304] ? asm_exc_page_fault+0x27/0x30 > >>> [ 392.621771] ? xsk_flush+0xb/0x40 > >>> [ 392.660954] ? xsk_flush+0x34/0x40 > >>> [ 392.701174] __xsk_map_flush+0x3a/0x80 > >>> [ 392.745497] xdp_do_flush+0x13/0x20 > >>> [ 392.786749] i40e_finalize_xdp_rx+0x44/0x50 [i40e] > >>> virt 2.2.2.10 > >>> [ 392.843370] i40e_clean_rx_irq_zc+0x132/0x500 [i40e] > >>> [ 392.918465] i40e_napi_poll+0x119/0x1270 [i40e] > >>> > >>> 2024-05-29T01:0[ 392.972027] ? scheduler_tick+0x9f/0xd0 > >>> [ 393.033806] ? tick_sched_do_timer+0x40/0x40 > >>> [ 393.084300] __napi_poll+0x23/0x160 > >>> [ 393.125556] net_rx_action+0x232/0x290 > >>> [ 393.169869] __do_softirq+0xd0/0x270 > >>> [ 393.212141] irq_exit_rcu+0x74/0xa0 > >>> [ 393.253387] common_interrupt+0x62/0xa0 > >>> [ 393.298734] </IRQ> > >>> [ 393.323571] <TASK> > >>> [ 393.348409] asm_common_interrupt+0x27/0x40 > >>> [ 393.397871] RIP: 0010:__fget_light+0x43/0x100 > >>> [ 393.449396] Code: b0 0b 00 00 8b 08 83 f9 01 75 36 48 8b 40 20 8b 18 39 df 73 1f 89 fa 48 39 da 48 19 db 48 8b 40 08 21 df 48 8d 04 f8 48 8b 00 <48> 85 c0 74 05 85 70 44 74 02 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d > >>> [ 393.671279] RSP: 0018:ffffae5c81057a98 EFLAGS: 00000202 > >>> [ 393.733054] RAX: ffff8bb083541100 RBX: ffffffffffffffff RCX: 0000000000000001 > >>> [ 393.817413] RDX: 000000000000002f RSI: 0000000000004000 RDI: 000000000000002f > >>> [ 393.901777] RBP: ffffae5c81057ac0 R08: 000000010000002f R09: 0000000000000000 > >>> [ 393.986133] R10: ffffae5c81057f08 R11: 0000000000000000 R12: 0000000000000000 > >>> [ 394.070506] R13: ffffae5c81057b54 R14: ffffae5c81057b4c R15: ffffae5c81057b4c > >>> [ 394.154884] __fdget+0xe/0x10 > >>> [ 394.189993] do_sys_poll+0x1fd/0x5e0 > >>> [ 394.232287] ? step_into+0x11f/0x750 > >>> [ 394.274570] ? common_interrupt+0x8e/0xa0 > >>> [ 394.321987] ? common_interrupt+0x8e/0xa0 > >>> [ 394.369406] ? asm_common_interrupt+0x27/0x40 > >>> [ 394.420929] ? xsk_flush+0x34/0x40 > >>> [ 394.461166] ? xsk_tx_peek_release_desc_batch+0x24c/0x2e0 > >>> [ 394.525014] ? i40e_clean_rx_irq_zc+0x342/0x500 [i40e] > >>> [ 394.585763] ? i40e_xsk_wakeup+0xa2/0xc0 [i40e] > >>> [ 394.639332] ? xsk_xmit+0x6d/0x6c0 > >>> [ 394.679569] ? i40e_napi_poll+0xc39/0x1270 [i40e] > >>> [ 394.735202] ? xsk_sendmsg+0xf4/0x100 > >>> [ 394.778498] ? sock_sendmsg+0x2e/0x40 > >>> [ 394.821808] ? __sys_sendto+0x13a/0x170 > >>> [ 394.867166] ? __snd_timer_user_ioctl+0x9e0/0xaa0 > >>> [ 394.922812] ? net_rx_action+0x232/0x290 > >>> [ 394.969206] __x64_sys_poll+0xa0/0x130 > >>> [ 395.013534] ? __x64_sys_poll+0xa0/0x130 > >>> [ 395.059930] do_syscall_64+0x34/0xb0 > >>> [ 395.102213] entry_SYSCALL_64_after_hwframe+0x44/0xae > >>> [ 395.161948] RIP: 0033:0x7ffff76f0d47 > >>> [ 395.204222] Code: 00 00 00 5b 49 8b 45 10 5d 41 5c 41 5d 41 5e c3 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 07 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 > >>> [ 395.426119] RSP: 002b:00007fffffffe6c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000007 > >>> [ 395.515625] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff76f0d47 > >>> [ 395.599996] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007fffffffe6f8 > >>> [ 395.684371] RBP: 00007ffff7e85f80 R08: 00000000173eed80 R09: 0000000000000051 > >>> [ 395.768741] R10: 000001193c2583df R11: 0000000000000246 R12: 0000000000000000 > >>> [ 395.853108] R13: 00007ffff7e85fb0 R14: 0000000000000000 R15: 0000000000000000 > >>> [ 395.937488] </TASK> > >>> [ 395.963347] Modules linked in: uio i40e ec(PO) evdev ncs_acpi(O) bonding sr_mod cdrom usb_storage dwmac_intel stmmac i2c_i801 pcs_xpcs phylink i2c_smbus [last unloaded: i40e] > >>> 0:47-08:00 NOTIC[ 396.147236] CR2: 0000000000000008 > >>> [ 396.202839] ---[ end trace 9b309a97c006510f ]--- > >>> E slb: real server r16, IP 4.4.4.16 operational > >>> > >>> 2024-05-29T01:00:47-08:00 NOTICE slb: real [ 396.344228] RIP: 0010:xsk_flush+0xb/0x40 > >>> [ 396.406975] Code: a0 03 00 00 01 b8 e4 ff ff ff eb dc 49 83 85 a0 03 00 00 01 b8 e4 ff ff ff eb cd 0f 1f 40 00 48 8b 87 40 03 00 00 55 48 89 e5 <8b> 50 08 48 8b 40 10 89 10 48 8b 87 68 03 00 00 48 8b 80 80 00 00 > >>> [ 396.628859] RSP: 0018:ffffae5c80244d48 EFLAGS: 00010287 > >>> [ 396.690632] RAX: 0000000000000000 RBX: ffff8bb80f8ab768 RCX: ffff8bb80f8abac0 > >>> [ 396.774985] RDX: ffff8bb80f8abac0 RSI: 0000000000000004 RDI: ffff8bb80f8ab768 > >>> [ 396.859346] RBP: ffffae5c80244d48 R08: ffff8bb0845f0000 R09: 000000000000967a > >>> [ 396.943694] R10: ffff8bb0833fdc00 R11: 000000000ca5a180 R12: ffff8bb80f8ab768 > >>> [ 397.028064] R13: ffff8bb80f8ebac0 R14: ffff8bb081571b00 R15: 0000000000000132 > >>> [ 397.112433] FS: 00007ffff7e97a80(0000) GS:ffff8bb80f8c0000(0000) knlGS:0000000000000000 > >>> [ 397.208103] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [ 397.276061] CR2: 0000000000000008 CR3: 000000010d84b000 CR4: 0000000000750ee0 > >>> [ 397.360414] PKRU: 55555554 > >>> [ 397.392419] Kernel panic - not syncing: Fatal exception in interrupt > >>> [ 397.467554] Kernel Offset: 0x34600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > >>> [ 397.678387] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- > > > >