On Mon, 27 May 2024 at 10:11, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote: > > On Mon, 27 May 2024 at 09:34, Yuval El-Hanany <yuvale@xxxxxxxxxxx> wrote: > > > > Yeah, > > as far down as I dug in the code, it seems it’s the only one that’s used. The other members are reg_state & napi_id. I’m unfamiliar with their exact purposes, but doesn’t sound like they’re necessary down the line. One other concern is txq IMO. It is not initialised to any value in xdp_buff. While it may not be relevant considering it’s a received packet, it seems wrong to keep a dangling pointer both in terms of security (leaking previous values) and robustness. Maybe just set it to NULL? > > > > I’ve been racking my brain as to what is the accepted method to pass the receive queue. Passing it on the packet headroom seemed like a reasonable solution, but it would mean memory access to the packet memory on packet arrival on the second CPU which may be harmful to performance. The queue itself is just a pointer queue so can’t piggyback on it unless using a different kind of queue. Would be great to hear how it should be done. > > Let me get into this piece of code and scratch my head for a while and > I will get back to you. Do not have a solution on top of my head. Will > loop in Maciej too. Yuval, Could you please send me the patch that you are using? I cannot remember what I sent you :-). > /Magnus > > > Thanks, > > Yuval. > > > > > On 27 May 2024, at 8:48, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote: > > > > > > CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe. > > > > > > On Sun, 26 May 2024 at 14:48, Yuval El-Hanany <yuvale@xxxxxxxxxxx> wrote: > > >> > > >> Correction, > > >> the rxq pointer is init’d globally for all frames before the frame loops, but only the device and the memory information are set from the XDP frame. So still an issue with the queue_index, but the tests on the device and memory info in xsk_rcv should be fine. > > >> > > >> Cheers, > > >> Yuval. > > > > > > Hi Yuval, > > > > > > Now I am back. Yes, this seems to be a bug so thank you for spotting > > > it. This is nothing I have ever tested since I thought you would do an > > > XSKMAP redirect to the "right" core first. But there is nothing > > > hindering that someone would do a CPUMAP redirect first then an XSKMAP > > > redirect, so should be supported. So in your analysis, is it only > > > queue_index that needs to be populated at CPUMAP redirect time for > > > this to work? > > > > > > Thanks: Magnus > > > > > >>> On 26 May 2024, at 12:02, Yuval El-Hanany <yuvale@xxxxxxxxxxx> wrote: > > >>> > > >>> Hi, > > >>> so I’ve dug a bit deeper. It seems the receive queue is not maintained across CPUs and this may make the whole practice of redirecting to XSK unsafe if I read the code correctly. It also seems this persists to kernel 6.6.8. Am I missing something? > > >>> > > >>> The code in cpumap runs XDP programs BUT doesn’t fill the rxq (see also the TODO comment in the code). The XDP frame is converted to XDP buffer but the rxq and txq are not filled. It seems they remain as dangling pointers. When __xsk_map_redirect is called and in turn xsk_rcv the rxq is consulted both to check if the umem is shared (in the new patch) as well as decide whether zero copy is used or not. Since rxq is not valid, I’m not certain what the outcome is. Once I bypassed umem check, I no longer crash the kernel but I'm definitely not seeing traffic running. > > >>> > > >>> Thanks, > > >>> Yuval. > > >>> > > >>> Here are the excerpts from the code (5.15.117): > > >>> > > >>> cpumap.c:212 <cpu_map_bpf_prog_run_xdp> ===== Note the comment in the code. ===== > > >>> for (i = 0; i < n; i++) { > > >>> struct xdp_frame *xdpf = frames[i]; > > >>> u32 act; > > >>> int err; > > >>> > > >>> rxq.dev = xdpf->dev_rx; > > >>> rxq.mem = xdpf->mem; > > >>> /* TODO: report queue_index to xdp_rxq_info */ > > >>> > > >>> xdp_convert_frame_to_buff(xdpf, &xdp); > > >>> > > >>> act = bpf_prog_run_xdp(rcpu->prog, &xdp); > > >>> > > >>> xdp.h:175 ===== rxq & txq in the xdp_buff are not filled or even reset. ===== > > >>> static inline > > >>> void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp) > > >>> { > > >>> xdp->data_hard_start = frame->data - frame->headroom - sizeof(*frame); > > >>> xdp->data = frame->data; > > >>> xdp->data_end = frame->data + frame->len; > > >>> xdp->data_meta = frame->data - frame->metasize; > > >>> xdp->frame_sz = frame->frame_sz; > > >>> } > > >>> > > >>> xsk.c:269 ===== xsk_rcv consults rxq in xdp_buff which is not filled! ===== > > >>> static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) > > >>> { > > >>> int err; > > >>> u32 len; > > >>> > > >>> err = xsk_rcv_check(xs, xdp); > > >>> if (err) > > >>> return err; > > >>> > > >>> if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL) { > > >>> len = xdp->data_end - xdp->data; > > >>> return __xsk_rcv_zc(xs, xdp, len); > > >>> } > > >>> > > >>> err = __xsk_rcv(xs, xdp); > > >>> if (!err) > > >>> xdp_return_buff(xdp); > > >>> return err; > > >>> } > > >>> > > >>> int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp) > > >>> { > > >>> struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list); > > >>> int err; > > >>> > > >>> err = xsk_rcv(xs, xdp); > > >>> if (err) > > >>> return err; > > >>> > > >>> if (!xs->flush_node.prev) > > >>> list_add(&xs->flush_node, flush_list); > > >>> > > >>> return 0; > > >>> } > > >>> > > >>>> On 24 May 2024, at 18:49, Yuval El-Hanany <YuvalE@xxxxxxxxxxx> wrote: > > >>>> > > >>>> CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe. > > >>>> > > >>>> Hi All, > > >>>> I would really appreciate some guidance here. As this issue spans both network XDP and CPU map, it’s a bit more difficult for me to follow in the kernel code. I attempted to trace the issue in the source, but it’s a bit complex to enter existing code base. > > >>>> > > >>>> It will be easier if you can tell me the following: > > >>>> 1. Is the receive queue metadata maintained correctly through the CPU switch in CPU map? In the description of CPU map in XDP (https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap), it is mentioned that some metadata is not maintained across the CPU switch, but that would make the umem test problematic. Should I assume the receive queue is maintained? > > >>>> 2. Does XSK umem user code have access to the receive queue metadata memory? It seems to be in a different pointer, but it could be that it points to the packet headroom in the umem accessible memory. > > >>>> > > >>>> If the receive queue is not maintained across CPU switch then I think the shared umem test should run along the lines of using some default receive queue after switching CPUs. Otherwise, it should probably drop packets with invalid receive queues. > > >>>> > > >>>> Thanks, > > >>>> Yuval. > > >>>> > > >>>>> On 20 May 2024, at 18:24, Yuval El-Hanany <YuvalE@xxxxxxxxxxx> wrote: > > >>>>> > > >>>>> Hi, > > >>>>> > > >>>>> Short Version > > >>>>> I’ve encountered an issue redirecting traffic to another CPU while using shared UMEM. The frame arrived at the redirected CPU with invalid receive queue. This caused a kernel crash on the patch that supports redirecting to packets to any queue with the same umem due to a test based on the receive queue. > > >>>>> > > >>>>> Long Version > > >>>>> I’ve created a test program that redirects traffic between two interfaces (replacing the MAC address). See the code below. It starts two separate processes on two different cores. When directing traffic directly to an XSK socket all works well. I’m running on kernel 5.15.117 with Magnus patch for shared umem. > > >>>>> > > >>>>> The next step was an XDP program that redirects each packet to the other CPU, and the xdp CPU program redirect the packets to an XDP socket. Initially this resulted in a kernel crash. The reason is that the receive queue is not received correctly once the packet switches CPU. > > >>>>> > > >>>>> I’ve patched the kernel so it does sanity check and does not crash but defaults to queue 0 just for the umem test. > > >>>>> > > >>>>> The kernel no longer crashes but I don’t get the traffic to the XSK sockets, and I log the receive queues and they do not seem random: > > >>>>> On first interface: > > >>>>> 0xC1323D50 > > >>>>> 0xC1343D19 > > >>>>> 0xC1343D50 > > >>>>> > > >>>>> On second interface: > > >>>>> 0xC134BD50 > > >>>>> 0xC135BD50 > > >>>>> > > >>>>> Very few bits differences between each of the “receive queues”. I couldn’t really find something in my program with those or similar values, or in the packets so it makes it less likely it comes from something I do in userspace. Seems to suggest some flags field? The values are not always the same across reboots but pretty close. > > >>>>> > > >>>>> Am I doing something wrong? It seems pretty basic. Most of the code in the XDP program was just added for debugging. The only oddity perhaps is using the shared umem. > > >>>>> > > >>>>> Thanks, > > >>>>> Yuval. > > >>>>> > > >>>>> I’m attaching the program and values from my "debug map”. In the debug map: > > >>>>> 0. Values 0/1 - Which type of program ran (with/without CPU redirection). > > >>>>> 1. Values 2XY - packet redirected from core X to core Y (1st stage). > > >>>>> 2. Values 4XY - packet on core X returns action Y (1st stage). > > >>>>> (These suggest that 1st stage worked correctly). > > >>>>> 3. Values 1XXYY - packet receive queue XX on core YY (XX = 10 is invalid receive queue outside 2..5). > > >>>>> 4. 2XXXX - packet receive queue on any core (XXXX - 4999 imeans queue greater than 5000). > > >>>>> 5. Random entry - a receive queue. > > >>>>> > > >>>>> Debug map values without CPU redirection: > > >>>>> +------+------+--------------+ > > >>>>> | Port | Key | Received | > > >>>>> +------+------+--------------+ > > >>>>> | 0 | 1032 | 504 | > > >>>>> | 0 | 1024 | 37005992 | > > >>>>> | 0 | 1034 | 37544052 | > > >>>>> | 0 | 0 | 74546875 | > > >>>>> | 1 | 1034 | 35765961 | > > >>>>> | 1 | 0 | 71492773 | > > >>>>> | 1 | 1024 | 35729107 | > > >>>>> | 1 | 1032 | 277 | > > >>>>> +------+------+--------------+ > > >>>>> > > >>>>> Debug map value with CPU redirection (no other change): > > >>>>> +------+------+--------------+ > > >>>>> | Port | Key | Received | > > >>>>> +------+------+--------------+ > > >>>>> | 0 | 1 | 3400 | > > >>>>> | 0 | 223 | 2947 | > > >>>>> | 0 | 232 | 454 | > > >>>>> | 0 | 424 | 454 | > > >>>>> | 0 | 434 | 2947 | > > >>>>> | 0 | 11002 | 454 | > > >>>>> | 0 | 11003 | 2947 | > > >>>>> | 0 | 20000 | 106 | > > >>>>> | 0 | 20016 | 7 | > > >>>>> | 0 | 20024 | 1058 | > > >>>>> | 0 | 20582 | 2 | > > >>>>> | 0 | 24999 | 2231 | > > >>>>> | 0 | 3241295184 | 339 | > > >>>>> | 0 | 3241426201 | 3 | > > >>>>> | 0 | 3241426256 | 1891 | > > >>>>> | 1 | 1 | 1013 | > > >>>>> | 1 | 223 | 610 | > > >>>>> | 1 | 232 | 404 | > > >>>>> | 1 | 424 | 404 | > > >>>>> | 1 | 434 | 610 | > > >>>>> | 1 | 11002 | 404 | > > >>>>> | 1 | 11003 | 610 | > > >>>>> | 1 | 20000 | 17 | > > >>>>> | 1 | 20024 | 309 | > > >>>>> | 1 | 24999 | 689 | > > >>>>> | 1 | 3241459024 | 309 | > > >>>>> | 1 | 3241524560 | 381 | > > >>>>> +------+------+--------------+ > > >>>>> > > >>>>> The XDP code (#if 0 is for the no CPU redirection): > > >>>>> /* SPDX-License-Identifier: GPL-2.0 */ > > >>>>> > > >>>>> #include <linux/if_ether.h> > > >>>>> #include <linux/bpf.h> > > >>>>> #include <linux/in.h> > > >>>>> > > >>>>> #include <bpf/bpf_helpers.h> > > >>>>> #include <bpf/bpf_endian.h> > > >>>>> > > >>>>> #include "xdp/parsing_helpers.h" > > >>>>> > > >>>>> /* Redirect to SP queues map. */ > > >>>>> struct > > >>>>> { > > >>>>> __uint (type, BPF_MAP_TYPE_XSKMAP); > > >>>>> __type (key, __u32); > > >>>>> __type (value, __u32); > > >>>>> __uint (max_entries, 128); > > >>>>> } xsks_map SEC(".maps"); > > >>>>> > > >>>>> /* Redirect to CPUs map. */ > > >>>>> struct > > >>>>> { > > >>>>> __uint (type, BPF_MAP_TYPE_CPUMAP); > > >>>>> __uint (key_size, sizeof (__u32)); > > >>>>> __uint (value_size, sizeof (struct bpf_cpumap_val)); > > >>>>> __uint (max_entries, 32); > > >>>>> } cpu_map SEC(".maps"); > > >>>>> > > >>>>> /* Statistics and debug maps. */ > > >>>>> struct > > >>>>> { > > >>>>> __uint (type, BPF_MAP_TYPE_PERCPU_ARRAY); > > >>>>> __type (key, __u32); > > >>>>> __type (value, __u32); > > >>>>> __uint (max_entries, 64); > > >>>>> } xdp_stats_map SEC(".maps"); > > >>>>> > > >>>>> struct > > >>>>> { > > >>>>> __uint (type, BPF_MAP_TYPE_HASH); > > >>>>> __type (key, __u32); > > >>>>> __type (value, __u32); > > >>>>> __uint (max_entries, 64); > > >>>>> } xdp_debug_proto SEC(".maps"); > > >>>>> > > >>>>> static __u32 oneval = 1; > > >>>>> > > >>>>> int my_parse_ethhdr (struct hdr_cursor* nh, void* dataend, struct ethhdr** eth) > > >>>>> { > > >>>>> __u16 ethtype; > > >>>>> *eth = nh->pos; > > >>>>> if ((void*) &(*eth) [1] > dataend) > > >>>>> { > > >>>>> return (-1); > > >>>>> } > > >>>>> > > >>>>> ethtype = (*eth)->h_proto; > > >>>>> nh->pos = &(*eth) [1]; > > >>>>> if (ethtype == bpf_ntohs (ETH_P_8021Q)) > > >>>>> { > > >>>>> struct vlan_hdr* vlan = (struct vlan_hdr*) &(*eth) [1]; > > >>>>> if ((void*) &vlan [1] > dataend) > > >>>>> { > > >>>>> return (-1); > > >>>>> } > > >>>>> ethtype = vlan->h_vlan_encapsulated_proto; > > >>>>> /* inc_debug_map (10000+(bpf_ntohs (vlan->h_vlan_TCI) & 0x0FFF)); */ > > >>>>> /* inc_debug_map (vlan->h_vlan_TCI); */ > > >>>>> nh->pos = &vlan [1]; > > >>>>> } > > >>>>> > > >>>>> return (ethtype); > > >>>>> } > > >>>>> > > >>>>> static void inc_debug_map (__u32 key) > > >>>>> { > > >>>>> #ifdef WITH_STATS > > >>>>> bpf_map_update_elem (&xdp_debug_proto, > > >>>>> &key, (__u32*) &oneval, BPF_NOEXIST); > > >>>>> __u32* rec = bpf_map_lookup_elem (&xdp_debug_proto, (__u32*) &key); > > >>>>> if (rec != NULL) > > >>>>> { > > >>>>> ++(*rec); > > >>>>> } > > >>>>> #endif > > >>>>> } > > >>>>> > > >>>>> > > >>>>> > > >>>>> SEC ("xdp") > > >>>>> int rdwr_xsk_prog (struct xdp_md *ctx) > > >>>>> { > > >>>>> enum xdp_action action; > > >>>>> int index = ctx->rx_queue_index; > > >>>>> void* dataend = (void *)(long)ctx->data_end; > > >>>>> void* data = (void *)(long)ctx->data; > > >>>>> int stage = 0; > > >>>>> > > >>>>> struct hdr_cursor nh; > > >>>>> > > >>>>> __u32 *pkt_count; > > >>>>> struct ethhdr* eth; > > >>>>> > > >>>>> #ifdef WITH_STATS > > >>>>> pkt_count = bpf_map_lookup_elem(&xdp_stats_map, &index); > > >>>>> if (pkt_count != NULL) > > >>>>> { > > >>>>> ++*pkt_count; > > >>>>> } > > >>>>> #endif > > >>>>> > > >>>>> /* Parse Ethernet & VLAN */ > > >>>>> nh.pos = data; > > >>>>> int ethtype = my_parse_ethhdr (&nh, dataend, ð); > > >>>>> if (ethtype != bpf_ntohs (0x0800)) > > >>>>> { > > >>>>> return (XDP_PASS); > > >>>>> } > > >>>>> > > >>>>> /* Direct XSK redirect. */ > > >>>>> #if 0 > > >>>>> inc_debug_map (stage); > > >>>>> action = bpf_redirect_map (&xsks_map, index, XDP_PASS); > > >>>>> inc_debug_map (1000 + index*10 + action); > > >>>>> #endif > > >>>>> /* CPU map redirect. */ > > >>>>> __u32 cpuid = bpf_get_smp_processor_id (); > > >>>>> __u32 targetcpu = cpuid ^ 0x01; > > >>>>> ++stage; > > >>>>> inc_debug_map (stage); > > >>>>> inc_debug_map (200 + cpuid * 10 + targetcpu); > > >>>>> action = bpf_redirect_map (&cpu_map, targetcpu, XDP_PASS); > > >>>>> inc_debug_map (400 + targetcpu * 10 + action); > > >>>>> return (action); > > >>>>> } > > >>>>> > > >>>>> SEC ("xdp/cpumap") > > >>>>> int rdwr_cpu_prog(struct xdp_md *ctx) > > >>>>> { > > >>>>> __u64* dataend64 = (__u64*) (long) ctx->data_end; > > >>>>> __u64* data64 = (__u64*) (long) ctx->data; > > >>>>> __u32 rxqueue = ctx->rx_queue_index; > > >>>>> if (rxqueue < 5000) > > >>>>> { > > >>>>> inc_debug_map (20000 + rxqueue); > > >>>>> } > > >>>>> else > > >>>>> { > > >>>>> inc_debug_map (rxqueue); > > >>>>> inc_debug_map (24999); > > >>>>> } > > >>>>> > > >>>>> rxqueue = (rxqueue >= 2 && rxqueue < 6) ? rxqueue : 10; > > >>>>> __u32 cpuid = bpf_get_smp_processor_id (); > > >>>>> cpuid = (cpuid < 12) ? cpuid : 20; > > >>>>> inc_debug_map (10000 + 100*rxqueue + cpuid); > > >>>>> > > >>>>> return bpf_redirect_map (&xsks_map, bpf_get_smp_processor_id (), 0); > > >>>>> } > > >>>>> > > >>>> > > >>> > > >> > >