Hi, so I’ve dug a bit deeper. It seems the receive queue is not maintained across CPUs and this may make the whole practice of redirecting to XSK unsafe if I read the code correctly. It also seems this persists to kernel 6.6.8. Am I missing something? The code in cpumap runs XDP programs BUT doesn’t fill the rxq (see also the TODO comment in the code). The XDP frame is converted to XDP buffer but the rxq and txq are not filled. It seems they remain as dangling pointers. When __xsk_map_redirect is called and in turn xsk_rcv the rxq is consulted both to check if the umem is shared (in the new patch) as well as decide whether zero copy is used or not. Since rxq is not valid, I’m not certain what the outcome is. Once I bypassed umem check, I no longer crash the kernel but I'm definitely not seeing traffic running. Thanks, Yuval. Here are the excerpts from the code (5.15.117): cpumap.c:212 <cpu_map_bpf_prog_run_xdp> ===== Note the comment in the code. ===== for (i = 0; i < n; i++) { struct xdp_frame *xdpf = frames[i]; u32 act; int err; rxq.dev = xdpf->dev_rx; rxq.mem = xdpf->mem; /* TODO: report queue_index to xdp_rxq_info */ xdp_convert_frame_to_buff(xdpf, &xdp); act = bpf_prog_run_xdp(rcpu->prog, &xdp); xdp.h:175 ===== rxq & txq in the xdp_buff are not filled or even reset. ===== static inline void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp) { xdp->data_hard_start = frame->data - frame->headroom - sizeof(*frame); xdp->data = frame->data; xdp->data_end = frame->data + frame->len; xdp->data_meta = frame->data - frame->metasize; xdp->frame_sz = frame->frame_sz; } xsk.c:269 ===== xsk_rcv consults rxq in xdp_buff which is not filled! ===== static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) { int err; u32 len; err = xsk_rcv_check(xs, xdp); if (err) return err; if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL) { len = xdp->data_end - xdp->data; return __xsk_rcv_zc(xs, xdp, len); } err = __xsk_rcv(xs, xdp); if (!err) xdp_return_buff(xdp); return err; } int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp) { struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list); int err; err = xsk_rcv(xs, xdp); if (err) return err; if (!xs->flush_node.prev) list_add(&xs->flush_node, flush_list); return 0; } > On 24 May 2024, at 18:49, Yuval El-Hanany <YuvalE@xxxxxxxxxxx> wrote: > > CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe. > > Hi All, > I would really appreciate some guidance here. As this issue spans both network XDP and CPU map, it’s a bit more difficult for me to follow in the kernel code. I attempted to trace the issue in the source, but it’s a bit complex to enter existing code base. > > It will be easier if you can tell me the following: > 1. Is the receive queue metadata maintained correctly through the CPU switch in CPU map? In the description of CPU map in XDP (https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap), it is mentioned that some metadata is not maintained across the CPU switch, but that would make the umem test problematic. Should I assume the receive queue is maintained? > 2. Does XSK umem user code have access to the receive queue metadata memory? It seems to be in a different pointer, but it could be that it points to the packet headroom in the umem accessible memory. > > If the receive queue is not maintained across CPU switch then I think the shared umem test should run along the lines of using some default receive queue after switching CPUs. Otherwise, it should probably drop packets with invalid receive queues. > > Thanks, > Yuval. > >> On 20 May 2024, at 18:24, Yuval El-Hanany <YuvalE@xxxxxxxxxxx> wrote: >> >> Hi, >> >> Short Version >> I’ve encountered an issue redirecting traffic to another CPU while using shared UMEM. The frame arrived at the redirected CPU with invalid receive queue. This caused a kernel crash on the patch that supports redirecting to packets to any queue with the same umem due to a test based on the receive queue. >> >> Long Version >> I’ve created a test program that redirects traffic between two interfaces (replacing the MAC address). See the code below. It starts two separate processes on two different cores. When directing traffic directly to an XSK socket all works well. I’m running on kernel 5.15.117 with Magnus patch for shared umem. >> >> The next step was an XDP program that redirects each packet to the other CPU, and the xdp CPU program redirect the packets to an XDP socket. Initially this resulted in a kernel crash. The reason is that the receive queue is not received correctly once the packet switches CPU. >> >> I’ve patched the kernel so it does sanity check and does not crash but defaults to queue 0 just for the umem test. >> >> The kernel no longer crashes but I don’t get the traffic to the XSK sockets, and I log the receive queues and they do not seem random: >> On first interface: >> 0xC1323D50 >> 0xC1343D19 >> 0xC1343D50 >> >> On second interface: >> 0xC134BD50 >> 0xC135BD50 >> >> Very few bits differences between each of the “receive queues”. I couldn’t really find something in my program with those or similar values, or in the packets so it makes it less likely it comes from something I do in userspace. Seems to suggest some flags field? The values are not always the same across reboots but pretty close. >> >> Am I doing something wrong? It seems pretty basic. Most of the code in the XDP program was just added for debugging. The only oddity perhaps is using the shared umem. >> >> Thanks, >> Yuval. >> >> I’m attaching the program and values from my "debug map”. In the debug map: >> 0. Values 0/1 - Which type of program ran (with/without CPU redirection). >> 1. Values 2XY - packet redirected from core X to core Y (1st stage). >> 2. Values 4XY - packet on core X returns action Y (1st stage). >> (These suggest that 1st stage worked correctly). >> 3. Values 1XXYY - packet receive queue XX on core YY (XX = 10 is invalid receive queue outside 2..5). >> 4. 2XXXX - packet receive queue on any core (XXXX - 4999 imeans queue greater than 5000). >> 5. Random entry - a receive queue. >> >> Debug map values without CPU redirection: >> +------+------+--------------+ >> | Port | Key | Received | >> +------+------+--------------+ >> | 0 | 1032 | 504 | >> | 0 | 1024 | 37005992 | >> | 0 | 1034 | 37544052 | >> | 0 | 0 | 74546875 | >> | 1 | 1034 | 35765961 | >> | 1 | 0 | 71492773 | >> | 1 | 1024 | 35729107 | >> | 1 | 1032 | 277 | >> +------+------+--------------+ >> >> Debug map value with CPU redirection (no other change): >> +------+------+--------------+ >> | Port | Key | Received | >> +------+------+--------------+ >> | 0 | 1 | 3400 | >> | 0 | 223 | 2947 | >> | 0 | 232 | 454 | >> | 0 | 424 | 454 | >> | 0 | 434 | 2947 | >> | 0 | 11002 | 454 | >> | 0 | 11003 | 2947 | >> | 0 | 20000 | 106 | >> | 0 | 20016 | 7 | >> | 0 | 20024 | 1058 | >> | 0 | 20582 | 2 | >> | 0 | 24999 | 2231 | >> | 0 | 3241295184 | 339 | >> | 0 | 3241426201 | 3 | >> | 0 | 3241426256 | 1891 | >> | 1 | 1 | 1013 | >> | 1 | 223 | 610 | >> | 1 | 232 | 404 | >> | 1 | 424 | 404 | >> | 1 | 434 | 610 | >> | 1 | 11002 | 404 | >> | 1 | 11003 | 610 | >> | 1 | 20000 | 17 | >> | 1 | 20024 | 309 | >> | 1 | 24999 | 689 | >> | 1 | 3241459024 | 309 | >> | 1 | 3241524560 | 381 | >> +------+------+--------------+ >> >> The XDP code (#if 0 is for the no CPU redirection): >> /* SPDX-License-Identifier: GPL-2.0 */ >> >> #include <linux/if_ether.h> >> #include <linux/bpf.h> >> #include <linux/in.h> >> >> #include <bpf/bpf_helpers.h> >> #include <bpf/bpf_endian.h> >> >> #include "xdp/parsing_helpers.h" >> >> /* Redirect to SP queues map. */ >> struct >> { >> __uint (type, BPF_MAP_TYPE_XSKMAP); >> __type (key, __u32); >> __type (value, __u32); >> __uint (max_entries, 128); >> } xsks_map SEC(".maps"); >> >> /* Redirect to CPUs map. */ >> struct >> { >> __uint (type, BPF_MAP_TYPE_CPUMAP); >> __uint (key_size, sizeof (__u32)); >> __uint (value_size, sizeof (struct bpf_cpumap_val)); >> __uint (max_entries, 32); >> } cpu_map SEC(".maps"); >> >> /* Statistics and debug maps. */ >> struct >> { >> __uint (type, BPF_MAP_TYPE_PERCPU_ARRAY); >> __type (key, __u32); >> __type (value, __u32); >> __uint (max_entries, 64); >> } xdp_stats_map SEC(".maps"); >> >> struct >> { >> __uint (type, BPF_MAP_TYPE_HASH); >> __type (key, __u32); >> __type (value, __u32); >> __uint (max_entries, 64); >> } xdp_debug_proto SEC(".maps"); >> >> static __u32 oneval = 1; >> >> int my_parse_ethhdr (struct hdr_cursor* nh, void* dataend, struct ethhdr** eth) >> { >> __u16 ethtype; >> *eth = nh->pos; >> if ((void*) &(*eth) [1] > dataend) >> { >> return (-1); >> } >> >> ethtype = (*eth)->h_proto; >> nh->pos = &(*eth) [1]; >> if (ethtype == bpf_ntohs (ETH_P_8021Q)) >> { >> struct vlan_hdr* vlan = (struct vlan_hdr*) &(*eth) [1]; >> if ((void*) &vlan [1] > dataend) >> { >> return (-1); >> } >> ethtype = vlan->h_vlan_encapsulated_proto; >> /* inc_debug_map (10000+(bpf_ntohs (vlan->h_vlan_TCI) & 0x0FFF)); */ >> /* inc_debug_map (vlan->h_vlan_TCI); */ >> nh->pos = &vlan [1]; >> } >> >> return (ethtype); >> } >> >> static void inc_debug_map (__u32 key) >> { >> #ifdef WITH_STATS >> bpf_map_update_elem (&xdp_debug_proto, >> &key, (__u32*) &oneval, BPF_NOEXIST); >> __u32* rec = bpf_map_lookup_elem (&xdp_debug_proto, (__u32*) &key); >> if (rec != NULL) >> { >> ++(*rec); >> } >> #endif >> } >> >> >> >> SEC ("xdp") >> int rdwr_xsk_prog (struct xdp_md *ctx) >> { >> enum xdp_action action; >> int index = ctx->rx_queue_index; >> void* dataend = (void *)(long)ctx->data_end; >> void* data = (void *)(long)ctx->data; >> int stage = 0; >> >> struct hdr_cursor nh; >> >> __u32 *pkt_count; >> struct ethhdr* eth; >> >> #ifdef WITH_STATS >> pkt_count = bpf_map_lookup_elem(&xdp_stats_map, &index); >> if (pkt_count != NULL) >> { >> ++*pkt_count; >> } >> #endif >> >> /* Parse Ethernet & VLAN */ >> nh.pos = data; >> int ethtype = my_parse_ethhdr (&nh, dataend, ð); >> if (ethtype != bpf_ntohs (0x0800)) >> { >> return (XDP_PASS); >> } >> >> /* Direct XSK redirect. */ >> #if 0 >> inc_debug_map (stage); >> action = bpf_redirect_map (&xsks_map, index, XDP_PASS); >> inc_debug_map (1000 + index*10 + action); >> #endif >> /* CPU map redirect. */ >> __u32 cpuid = bpf_get_smp_processor_id (); >> __u32 targetcpu = cpuid ^ 0x01; >> ++stage; >> inc_debug_map (stage); >> inc_debug_map (200 + cpuid * 10 + targetcpu); >> action = bpf_redirect_map (&cpu_map, targetcpu, XDP_PASS); >> inc_debug_map (400 + targetcpu * 10 + action); >> return (action); >> } >> >> SEC ("xdp/cpumap") >> int rdwr_cpu_prog(struct xdp_md *ctx) >> { >> __u64* dataend64 = (__u64*) (long) ctx->data_end; >> __u64* data64 = (__u64*) (long) ctx->data; >> __u32 rxqueue = ctx->rx_queue_index; >> if (rxqueue < 5000) >> { >> inc_debug_map (20000 + rxqueue); >> } >> else >> { >> inc_debug_map (rxqueue); >> inc_debug_map (24999); >> } >> >> rxqueue = (rxqueue >= 2 && rxqueue < 6) ? rxqueue : 10; >> __u32 cpuid = bpf_get_smp_processor_id (); >> cpuid = (cpuid < 12) ? cpuid : 20; >> inc_debug_map (10000 + 100*rxqueue + cpuid); >> >> return bpf_redirect_map (&xsks_map, bpf_get_smp_processor_id (), 0); >> } >> >