Re: XDP CPU maps & shared umem

Yuval El-Hanany <YuvalE@xxxxxxxxxxx> · Wed, 29 May 2024 07:51:43 +0000

Yeah,
        I totally agree. I didn’t know if there’s room for the xdp_frame to add more space, that’s why I opted to save space and piggyback there but I didn’t like it. I wasn’t too concerned about the bit fields as it’s 16 bit aligned. Sure I can rework the patch. You’re not concerned about having a shared rx_queue? You have a more comprehensive view of the code.

        I didn’t know the xdp_frame is stored in the headroom. If it’s stored at the head of the packet, this means the construction of the xdp_buff in the loop causes memory access to the packets memory which are not stored sequentially. Perhaps it’s worth adding prefetch? I did packet prefetch in my user code and it did wonders when using xsk. The XDP programs are not anticipated to be huge data processing code so the prefetch has potential to make a difference.

        Cheers,
                Yuval.

(Once again plain text resend…)

> On 29 May 2024, at 9:57, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
>
> CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue, 28 May 2024 at 18:03, Yuval El-Hanany <yuvale@xxxxxxxxxxx> wrote:
>>
>> Hi,
>>        this is my temporary patch to get the receive queue across the CPU-s. I’m not sure I recommend it. I avoided increasing the frame size by piggy backing on the memory model type which places data in non relevant structures. It also limits the receive queue index to u16 (24 bits would also be possible). It does work and traffic is running (I’ll probably have a followup question on performance I expected better but I'll do some more testing). Another issue - it updates the rxq in xdp_buff. In the context of cpumap it’s okay as rxq is only used immediately in the loop and discarded after the loop. I haven’t found anywhere it’s wrong. But theoretically rxq may have a longer timespan, and may be shared across multiple buffers. Modifying it, may affect other buffers that belong to different receive queue index.
>>
>>        (Sorry for the double message, previous one was not plaintext so it was rejected by group server)
>>
>>        Cheers,
>>                Yuval.
>>
>> diff -urN a/include/net/xdp.h b/include/net/xdp.h
>> --- a/include/net/xdp.h 2023-06-05 07:21:27.000000000 +0000
>> +++ b/include/net/xdp.h 2024-05-28 07:04:00.549507316 +0000
>> @@ -48,7 +48,8 @@
>> #define XDP_XMIT_FLAGS_MASK    XDP_XMIT_FLUSH
>>
>> struct xdp_mem_info {
>> -       u32 type; /* enum xdp_mem_type, but known size type */
>> +       u32 type : 16; /* enum xdp_mem_type, but known size type */
>> +    u32 rx_queue_index : 16;
>
> How about just sticking an u32 queue_index in the struct xdp_frame?
> The xdp_frame is stored at the start of the 256 byte XDP metadata
> section in front of the packet. I believe there is room for an extra 4
> bytes there without spilling over to a new cache line. Then in the
> xdp_frame from xdp_buff creation function, you always copy over the
> queue_index to the xdp_frame. But when converting the xdp_frame to an
> xdp_buff, you only do this for the "XDP program on a CPUMAP" case when
> the action is XDP_REDIRECT. There is even a /* TODO */ in this
> function for copying over the queue_index.
>
> What do you think? Want to take a stab at it and submit a patch?
>
> BTW, bitfields are slow so you do not want to use that. And
> queue_index has nothing to do with memory type so is not a logical
> place for it.
>
>>        u32 id;
>> };
>>
>> @@ -180,6 +181,10 @@
>>        xdp->data_end = frame->data + frame->len;
>>        xdp->data_meta = frame->data - frame->metasize;
>>        xdp->frame_sz = frame->frame_sz;
>> +       if (xdp->rxq != NULL)
>> +       {
>> +               xdp->rxq->queue_index = frame->mem.rx_queue_index;
>> +       }
>> }
>>
>> static inline
>> @@ -206,6 +211,8 @@
>>        xdp_frame->headroom = headroom - sizeof(*xdp_frame);
>>        xdp_frame->metasize = metasize;
>>        xdp_frame->frame_sz = xdp->frame_sz;
>> +       xdp_frame->mem.rx_queue_index =
>> +               (xdp->rxq != NULL) ? xdp->rxq->queue_index : 0;
>>
>>        return 0;
>> }
>> @@ -226,6 +233,7 @@
>>
>>        /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
>>        xdp_frame->mem = xdp->rxq->mem;
>> +       xdp_frame->mem.rx_queue_index = xdp->rxq->queue_index;
>>
>>        return xdp_frame;
>> }
>> diff -urN a/net/core/xdp.c b/net/core/xdp.c
>> --- a/net/core/xdp.c    2023-06-05 07:21:27.000000000 +0000
>> +++ b/net/core/xdp.c    2024-05-28 14:05:24.531617279 +0000
>> @@ -536,6 +536,7 @@
>>        xdpf->metasize = metasize;
>>        xdpf->frame_sz = PAGE_SIZE;
>>        xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
>> +    xdpf->mem.rx_queue_index = (xdp->rxq != NULL) ? xdp->rxq->queue_index : 0;
>>
>>        xsk_buff_free(xdp);
>>        return xdpf;
>>
>>> On 28 May 2024, at 15:38, Yuval El-Hanany <yuvale@xxxxxxxxxxx> wrote:
>>>
>>> CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe.
>>>
>>> Sure. The two patches (the second is just documentation) are below.
>>>
>>>       I’m trying to add the rx queue from xdp_buff to the xdp_frame and extract it back. Is the size or alignment of xdp_frame critical? Is it okay to add the rx queue as an extra field, or should I try to add it as a bit field on an existing field like the memory type that takes 32 bits but has only 4 values?
>>>
>>>       Cheers,
>>>               Yuval.
>>>
>>> From: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
>>> Subject: [PATCH bpf-next 1/2] xsk: support redirect to any socket bound to the same umem
>>> Date: 5 February 2024 at 14:35:50 GMT+2
>>> To: magnus.karlsson@xxxxxxxxx, bjorn@xxxxxxxxxx, ast@xxxxxxxxxx, daniel@xxxxxxxxxxxxx, netdev@xxxxxxxxxxxxxxx, maciej.fijalkowski@xxxxxxxxx, yuvale@xxxxxxxxxxx
>>> Cc: bpf@xxxxxxxxxxxxxxx
>>>
>>> CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe.
>>>
>>> From: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
>>>
>>> Add support for directing a packet to any socket bound to the same
>>> umem. This makes it possible to use the XDP program to select what
>>> socket the packet should be received on. The user can populate the
>>> XSKMAP with various sockets and as long as they share the same umem,
>>> the XDP program can pick any one of them.
>>>
>>> Suggested-by: Yuval El-Hanany <yuvale@xxxxxxxxxxx>
>>> Signed-off-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
>>> ---
>>> net/xdp/xsk.c | 5 ++++-
>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>> index 1eadfac03cc4..a339e9a1b557 100644
>>> --- a/net/xdp/xsk.c
>>> +++ b/net/xdp/xsk.c
>>> @@ -313,10 +313,13 @@ static bool xsk_is_bound(struct xdp_sock *xs)
>>>
>>> static int xsk_rcv_check(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>>> {
>>> +       struct net_device *dev = xdp->rxq->dev;
>>> +       u32 qid = xdp->rxq->queue_index;
>>> +
>>>      if (!xsk_is_bound(xs))
>>>              return -ENXIO;
>>>
>>> -       if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
>>> +       if (!dev->_rx[qid].pool || xs->umem != dev->_rx[qid].pool->umem)
>>>              return -EINVAL;
>>>
>>>      if (len > xsk_pool_get_rx_frame_size(xs->pool) && !xs->sg) {
>>> --
>>> 2.42.0
>>>
>>> From: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
>>>
>>> Document the ability to redirect to any socket bound to the same umem.
>>>
>>> Signed-off-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
>>> ---
>>> Documentation/networking/af_xdp.rst | 33 +++++++++++++++++------------
>>> 1 file changed, 19 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
>>> index dceeb0d763aa..72da7057e4cf 100644
>>> --- a/Documentation/networking/af_xdp.rst
>>> +++ b/Documentation/networking/af_xdp.rst
>>> @@ -329,23 +329,24 @@ XDP_SHARED_UMEM option and provide the initial socket's fd in the
>>> sxdp_shared_umem_fd field as you registered the UMEM on that
>>> socket. These two sockets will now share one and the same UMEM.
>>>
>>> -There is no need to supply an XDP program like the one in the previous
>>> -case where sockets were bound to the same queue id and
>>> -device. Instead, use the NIC's packet steering capabilities to steer
>>> -the packets to the right queue. In the previous example, there is only
>>> -one queue shared among sockets, so the NIC cannot do this steering. It
>>> -can only steer between queues.
>>> -
>>> -In libbpf, you need to use the xsk_socket__create_shared() API as it
>>> -takes a reference to a FILL ring and a COMPLETION ring that will be
>>> -created for you and bound to the shared UMEM. You can use this
>>> -function for all the sockets you create, or you can use it for the
>>> -second and following ones and use xsk_socket__create() for the first
>>> -one. Both methods yield the same result.
>>> +In this case, it is possible to use the NIC's packet steering
>>> +capabilities to steer the packets to the right queue. This is not
>>> +possible in the previous example as there is only one queue shared
>>> +among sockets, so the NIC cannot do this steering as it can only steer
>>> +between queues.
>>> +
>>> +In libxdp (or libbpf prior to version 1.0), you need to use the
>>> +xsk_socket__create_shared() API as it takes a reference to a FILL ring
>>> +and a COMPLETION ring that will be created for you and bound to the
>>> +shared UMEM. You can use this function for all the sockets you create,
>>> +or you can use it for the second and following ones and use
>>> +xsk_socket__create() for the first one. Both methods yield the same
>>> +result.
>>>
>>> Note that a UMEM can be shared between sockets on the same queue id
>>> and device, as well as between queues on the same device and between
>>> -devices at the same time.
>>> +devices at the same time. It is also possible to redirect to any
>>> +socket as long as it is bound to the same umem with XDP_SHARED_UMEM.
>>>
>>> XDP_USE_NEED_WAKEUP bind flag
>>> -----------------------------
>>> @@ -822,6 +823,10 @@ A: The short answer is no, that is not supported at the moment. The
>>>  switch, or other distribution mechanism, in your NIC to direct
>>>  traffic to the correct queue id and socket.
>>>
>>> +   Note that if you are using the XDP_SHARED_UMEM option, it is
>>> +   possible to switch traffic between any socket bound to the same
>>> +   umem.
>>> +
>>> Q: My packets are sometimes corrupted. What is wrong?
>>>
>>> A: Care has to be taken not to feed the same buffer in the UMEM into
>>> --
>>> 2.42.0
>>>
>>>> On 28 May 2024, at 15:21, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
>>>>
>>>> CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe.
>>>>
>>>> On Mon, 27 May 2024 at 10:11, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
>>>>>
>>>>> On Mon, 27 May 2024 at 09:34, Yuval El-Hanany <yuvale@xxxxxxxxxxx> wrote:
>>>>>>
>>>>>> Yeah,
>>>>>>      as far down as I dug in the code, it seems it’s the only one that’s used. The other members are reg_state & napi_id. I’m unfamiliar with their exact purposes, but doesn’t sound like they’re necessary down the line. One other concern is txq IMO. It is not initialised to any value in xdp_buff. While it may not be relevant considering it’s a received packet, it seems wrong to keep a dangling pointer both in terms of security (leaking previous values) and robustness. Maybe just set it to NULL?
>>>>>>
>>>>>>      I’ve been racking my brain as to what is the accepted method to pass the receive queue. Passing it on the packet headroom seemed like a reasonable solution, but it would mean memory access to the packet memory on packet arrival on the second CPU which may be harmful to performance. The queue itself is just a pointer queue so can’t piggyback on it unless using a different kind of queue. Would be great to hear how it should be done.
>>>>>
>>>>> Let me get into this piece of code and scratch my head for a while and
>>>>> I will get back to you. Do not have a solution on top of my head. Will
>>>>> loop in Maciej too.
>>>>
>>>> Yuval,
>>>>
>>>> Could you please send me the patch that you are using? I cannot
>>>> remember what I sent you :-).
>>>>
>>>>> /Magnus
>>>>>
>>>>>>      Thanks,
>>>>>>              Yuval.
>>>>>>
>>>>>>> On 27 May 2024, at 8:48, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
>>>>>>>
>>>>>>> CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe.
>>>>>>>
>>>>>>> On Sun, 26 May 2024 at 14:48, Yuval El-Hanany <yuvale@xxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> Correction,
>>>>>>>>     the rxq pointer is init’d globally for all frames before the frame loops, but only the device and the memory information are set from the XDP frame. So still an issue with the queue_index, but the tests on the device and memory info in xsk_rcv should be fine.
>>>>>>>>
>>>>>>>>     Cheers,
>>>>>>>>             Yuval.
>>>>>>>
>>>>>>> Hi Yuval,
>>>>>>>
>>>>>>> Now I am back. Yes, this seems to be a bug so thank you for spotting
>>>>>>> it. This is nothing I have ever tested since I thought you would do an
>>>>>>> XSKMAP redirect to the "right" core first. But there is nothing
>>>>>>> hindering that someone would do a CPUMAP redirect first then an XSKMAP
>>>>>>> redirect, so should be supported. So in your analysis, is it only
>>>>>>> queue_index that needs to be populated at CPUMAP redirect time for
>>>>>>> this to work?
>>>>>>>
>>>>>>> Thanks: Magnus
>>>>>>>
>>>>>>>>> On 26 May 2024, at 12:02, Yuval El-Hanany <yuvale@xxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> so I’ve dug a bit deeper. It seems the receive queue is not maintained across CPUs and this may make the whole practice of redirecting to XSK unsafe if I read the code correctly. It also seems this persists to kernel 6.6.8. Am I missing something?
>>>>>>>>>
>>>>>>>>> The code in cpumap runs XDP programs BUT doesn’t fill the rxq (see also the TODO comment in the code). The XDP frame is converted to XDP buffer but the rxq and txq are not filled. It seems they remain as dangling pointers. When __xsk_map_redirect is called and in turn xsk_rcv the rxq is consulted both to check if the umem is shared (in the new patch) as well as decide whether zero copy is used or not. Since rxq is not valid, I’m not certain what the outcome is. Once I bypassed umem check, I no longer crash the kernel but I'm definitely not seeing traffic running.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yuval.
>>>>>>>>>
>>>>>>>>> Here are the excerpts from the code (5.15.117):
>>>>>>>>>
>>>>>>>>> cpumap.c:212 <cpu_map_bpf_prog_run_xdp> ===== Note the comment in the code. =====
>>>>>>>>> for (i = 0; i < n; i++) {
>>>>>>>>>    struct xdp_frame *xdpf = frames[i];
>>>>>>>>>    u32 act;
>>>>>>>>>    int err;
>>>>>>>>>
>>>>>>>>> rxq.dev = xdpf->dev_rx;
>>>>>>>>>    rxq.mem = xdpf->mem;
>>>>>>>>>    /* TODO: report queue_index to xdp_rxq_info */
>>>>>>>>>
>>>>>>>>>    xdp_convert_frame_to_buff(xdpf, &xdp);
>>>>>>>>>
>>>>>>>>>    act = bpf_prog_run_xdp(rcpu->prog, &xdp);
>>>>>>>>>
>>>>>>>>> xdp.h:175 ===== rxq & txq in the xdp_buff are not filled or even reset. =====
>>>>>>>>> static inline
>>>>>>>>> void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
>>>>>>>>> {
>>>>>>>>> xdp->data_hard_start = frame->data - frame->headroom - sizeof(*frame);
>>>>>>>>> xdp->data = frame->data;
>>>>>>>>> xdp->data_end = frame->data + frame->len;
>>>>>>>>> xdp->data_meta = frame->data - frame->metasize;
>>>>>>>>> xdp->frame_sz = frame->frame_sz;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> xsk.c:269 ===== xsk_rcv consults rxq in xdp_buff which is not filled! =====
>>>>>>>>> static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>>>>>>>>> {
>>>>>>>>> int err;
>>>>>>>>> u32 len;
>>>>>>>>>
>>>>>>>>> err = xsk_rcv_check(xs, xdp);
>>>>>>>>> if (err)
>>>>>>>>>    return err;
>>>>>>>>>
>>>>>>>>> if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL) {
>>>>>>>>> len = xdp->data_end - xdp->data;
>>>>>>>>>    return __xsk_rcv_zc(xs, xdp, len);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> err = __xsk_rcv(xs, xdp);
>>>>>>>>> if (!err)
>>>>>>>>> xdp_return_buff(xdp);
>>>>>>>>> return err;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
>>>>>>>>> {
>>>>>>>>> struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
>>>>>>>>> int err;
>>>>>>>>>
>>>>>>>>> err = xsk_rcv(xs, xdp);
>>>>>>>>> if (err)
>>>>>>>>> return err;
>>>>>>>>>
>>>>>>>>> if (!xs->flush_node.prev)
>>>>>>>>>    list_add(&xs->flush_node, flush_list);
>>>>>>>>>
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>> On 24 May 2024, at 18:49, Yuval El-Hanany <YuvalE@xxxxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>> CAUTION:External Email, Do not click on links or open attachments unless you recognize the sender and know the content is safe.
>>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>   I would really appreciate some guidance here. As this issue spans both network XDP and CPU map, it’s a bit more difficult for me to follow in the kernel code. I attempted to trace the issue in the source, but it’s a bit complex to enter existing code base.
>>>>>>>>>>
>>>>>>>>>>   It will be easier if you can tell me the following:
>>>>>>>>>> 1. Is the receive queue metadata maintained correctly through the CPU switch in CPU map? In the description of CPU map in XDP (https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap), it is mentioned that some metadata is not maintained across the CPU switch, but that would make the umem test problematic. Should I assume the receive queue is maintained?
>>>>>>>>>> 2. Does XSK umem user code have access to the receive queue metadata memory? It seems to be in a different pointer, but it could be that it points to the packet headroom in the umem accessible memory.
>>>>>>>>>>
>>>>>>>>>>   If the receive queue is not maintained across CPU switch then I think the shared umem test should run along the lines of using some default receive queue after switching CPUs. Otherwise, it should probably drop packets with invalid receive queues.
>>>>>>>>>>
>>>>>>>>>>   Thanks,
>>>>>>>>>>           Yuval.
>>>>>>>>>>
>>>>>>>>>>> On 20 May 2024, at 18:24, Yuval El-Hanany <YuvalE@xxxxxxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Short Version
>>>>>>>>>>> I’ve encountered an issue redirecting traffic to another CPU while using shared UMEM. The frame arrived at the redirected CPU with invalid receive queue. This caused a kernel crash on the patch that supports redirecting to packets to any queue with the same umem due to a test based on the receive queue.
>>>>>>>>>>>
>>>>>>>>>>> Long Version
>>>>>>>>>>> I’ve created a test program that redirects traffic between two interfaces (replacing the MAC address). See the code below. It starts two separate processes on two different cores. When directing traffic directly to an XSK socket all works well. I’m running on kernel 5.15.117 with Magnus patch for shared umem.
>>>>>>>>>>>
>>>>>>>>>>> The next step was an XDP program that redirects each packet to the other CPU, and the xdp CPU program redirect the packets to an XDP socket. Initially this resulted in a kernel crash. The reason is that the receive queue is not received correctly once the packet switches CPU.
>>>>>>>>>>>
>>>>>>>>>>> I’ve patched the kernel so it does sanity check and does not crash but defaults to queue 0 just for the umem test.
>>>>>>>>>>>
>>>>>>>>>>> The kernel no longer crashes but I don’t get the traffic to the XSK sockets, and I log the receive queues and they do not seem random:
>>>>>>>>>>> On first interface:
>>>>>>>>>>> 0xC1323D50
>>>>>>>>>>> 0xC1343D19
>>>>>>>>>>> 0xC1343D50
>>>>>>>>>>>
>>>>>>>>>>> On second interface:
>>>>>>>>>>> 0xC134BD50
>>>>>>>>>>> 0xC135BD50
>>>>>>>>>>>
>>>>>>>>>>> Very few bits differences between each of the “receive queues”. I couldn’t really find something in my program with those or similar values, or in the packets so it makes it less likely it comes from something I do in userspace. Seems to suggest some flags field? The values are not always the same across reboots but pretty close.
>>>>>>>>>>>
>>>>>>>>>>> Am I doing something wrong? It seems pretty basic. Most of the code in the XDP program was just added for debugging. The only oddity perhaps is using the shared umem.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yuval.
>>>>>>>>>>>
>>>>>>>>>>> I’m attaching the program and values from my "debug map”. In the debug map:
>>>>>>>>>>> 0. Values 0/1 - Which type of program ran (with/without CPU redirection).
>>>>>>>>>>> 1. Values 2XY - packet redirected from core X to core Y (1st stage).
>>>>>>>>>>> 2. Values 4XY - packet on core X returns action Y (1st stage).
>>>>>>>>>>> (These suggest that 1st stage worked correctly).
>>>>>>>>>>> 3. Values 1XXYY - packet receive queue XX on core YY (XX = 10 is invalid receive queue outside 2..5).
>>>>>>>>>>> 4. 2XXXX - packet receive queue on any core (XXXX - 4999 imeans queue greater than 5000).
>>>>>>>>>>> 5. Random entry - a receive queue.
>>>>>>>>>>>
>>>>>>>>>>> Debug map values without CPU redirection:
>>>>>>>>>>> +------+------+--------------+
>>>>>>>>>>> | Port | Key  |     Received |
>>>>>>>>>>> +------+------+--------------+
>>>>>>>>>>> |    0 | 1032 |          504 |
>>>>>>>>>>> |    0 | 1024 |     37005992 |
>>>>>>>>>>> |    0 | 1034 |     37544052 |
>>>>>>>>>>> |    0 |    0 |     74546875 |
>>>>>>>>>>> |    1 | 1034 |     35765961 |
>>>>>>>>>>> |    1 |    0 |     71492773 |
>>>>>>>>>>> |    1 | 1024 |     35729107 |
>>>>>>>>>>> |    1 | 1032 |          277 |
>>>>>>>>>>> +------+------+--------------+
>>>>>>>>>>>
>>>>>>>>>>> Debug map value with CPU redirection (no other change):
>>>>>>>>>>> +------+------+--------------+
>>>>>>>>>>> | Port | Key  |     Received |
>>>>>>>>>>> +------+------+--------------+
>>>>>>>>>>> |    0 |    1 |         3400 |
>>>>>>>>>>> |    0 |  223 |         2947 |
>>>>>>>>>>> |    0 |  232 |          454 |
>>>>>>>>>>> |    0 |  424 |          454 |
>>>>>>>>>>> |    0 |  434 |         2947 |
>>>>>>>>>>> |    0 | 11002 |          454 |
>>>>>>>>>>> |    0 | 11003 |         2947 |
>>>>>>>>>>> |    0 | 20000 |          106 |
>>>>>>>>>>> |    0 | 20016 |            7 |
>>>>>>>>>>> |    0 | 20024 |         1058 |
>>>>>>>>>>> |    0 | 20582 |            2 |
>>>>>>>>>>> |    0 | 24999 |         2231 |
>>>>>>>>>>> |    0 | 3241295184 |          339 |
>>>>>>>>>>> |    0 | 3241426201 |            3 |
>>>>>>>>>>> |    0 | 3241426256 |         1891 |
>>>>>>>>>>> |    1 |    1 |         1013 |
>>>>>>>>>>> |    1 |  223 |          610 |
>>>>>>>>>>> |    1 |  232 |          404 |
>>>>>>>>>>> |    1 |  424 |          404 |
>>>>>>>>>>> |    1 |  434 |          610 |
>>>>>>>>>>> |    1 | 11002 |          404 |
>>>>>>>>>>> |    1 | 11003 |          610 |
>>>>>>>>>>> |    1 | 20000 |           17 |
>>>>>>>>>>> |    1 | 20024 |          309 |
>>>>>>>>>>> |    1 | 24999 |          689 |
>>>>>>>>>>> |    1 | 3241459024 |          309 |
>>>>>>>>>>> |    1 | 3241524560 |          381 |
>>>>>>>>>>> +------+------+--------------+
>>>>>>>>>>>
>>>>>>>>>>> The XDP code (#if 0 is for the no CPU redirection):
>>>>>>>>>>> /* SPDX-License-Identifier: GPL-2.0 */
>>>>>>>>>>>
>>>>>>>>>>> #include <linux/if_ether.h>
>>>>>>>>>>> #include <linux/bpf.h>
>>>>>>>>>>> #include <linux/in.h>
>>>>>>>>>>>
>>>>>>>>>>> #include <bpf/bpf_helpers.h>
>>>>>>>>>>> #include <bpf/bpf_endian.h>
>>>>>>>>>>>
>>>>>>>>>>> #include "xdp/parsing_helpers.h"
>>>>>>>>>>>
>>>>>>>>>>> /* Redirect to SP queues map. */
>>>>>>>>>>> struct
>>>>>>>>>>> {
>>>>>>>>>>> __uint (type, BPF_MAP_TYPE_XSKMAP);
>>>>>>>>>>> __type (key, __u32);
>>>>>>>>>>> __type (value, __u32);
>>>>>>>>>>> __uint (max_entries, 128);
>>>>>>>>>>> } xsks_map SEC(".maps");
>>>>>>>>>>>
>>>>>>>>>>> /* Redirect to CPUs map. */
>>>>>>>>>>> struct
>>>>>>>>>>> {
>>>>>>>>>>> __uint (type, BPF_MAP_TYPE_CPUMAP);
>>>>>>>>>>> __uint (key_size, sizeof (__u32));
>>>>>>>>>>> __uint (value_size, sizeof (struct bpf_cpumap_val));
>>>>>>>>>>> __uint (max_entries, 32);
>>>>>>>>>>> } cpu_map SEC(".maps");
>>>>>>>>>>>
>>>>>>>>>>> /* Statistics and debug maps. */
>>>>>>>>>>> struct
>>>>>>>>>>> {
>>>>>>>>>>> __uint (type, BPF_MAP_TYPE_PERCPU_ARRAY);
>>>>>>>>>>> __type (key, __u32);
>>>>>>>>>>> __type (value, __u32);
>>>>>>>>>>> __uint (max_entries, 64);
>>>>>>>>>>> } xdp_stats_map SEC(".maps");
>>>>>>>>>>>
>>>>>>>>>>> struct
>>>>>>>>>>> {
>>>>>>>>>>> __uint (type, BPF_MAP_TYPE_HASH);
>>>>>>>>>>> __type (key, __u32);
>>>>>>>>>>> __type (value, __u32);
>>>>>>>>>>> __uint (max_entries, 64);
>>>>>>>>>>> } xdp_debug_proto SEC(".maps");
>>>>>>>>>>>
>>>>>>>>>>> static __u32 oneval = 1;
>>>>>>>>>>>
>>>>>>>>>>> int my_parse_ethhdr (struct hdr_cursor* nh, void* dataend, struct ethhdr** eth)
>>>>>>>>>>> {
>>>>>>>>>>> __u16 ethtype;
>>>>>>>>>>> *eth = nh->pos;
>>>>>>>>>>> if ((void*) &(*eth) [1] > dataend)
>>>>>>>>>>> {
>>>>>>>>>>> return (-1);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> ethtype = (*eth)->h_proto;
>>>>>>>>>>> nh->pos = &(*eth) [1];
>>>>>>>>>>> if (ethtype == bpf_ntohs (ETH_P_8021Q))
>>>>>>>>>>> {
>>>>>>>>>>> struct vlan_hdr* vlan = (struct vlan_hdr*) &(*eth) [1];
>>>>>>>>>>> if ((void*) &vlan [1] > dataend)
>>>>>>>>>>> {
>>>>>>>>>>>       return (-1);
>>>>>>>>>>> }
>>>>>>>>>>>   ethtype = vlan->h_vlan_encapsulated_proto;
>>>>>>>>>>> /* inc_debug_map (10000+(bpf_ntohs (vlan->h_vlan_TCI) & 0x0FFF)); */
>>>>>>>>>>> /* inc_debug_map (vlan->h_vlan_TCI); */
>>>>>>>>>>> nh->pos = &vlan [1];
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> return (ethtype);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> static void inc_debug_map (__u32 key)
>>>>>>>>>>> {
>>>>>>>>>>> #ifdef WITH_STATS
>>>>>>>>>>> bpf_map_update_elem (&xdp_debug_proto,
>>>>>>>>>>>                    &key, (__u32*) &oneval, BPF_NOEXIST);
>>>>>>>>>>> __u32* rec = bpf_map_lookup_elem (&xdp_debug_proto, (__u32*) &key);
>>>>>>>>>>> if (rec != NULL)
>>>>>>>>>>> {
>>>>>>>>>>> ++(*rec);
>>>>>>>>>>> }
>>>>>>>>>>> #endif
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> SEC ("xdp")
>>>>>>>>>>> int rdwr_xsk_prog (struct xdp_md *ctx)
>>>>>>>>>>> {
>>>>>>>>>>> enum xdp_action action;
>>>>>>>>>>> int index = ctx->rx_queue_index;
>>>>>>>>>>> void* dataend = (void *)(long)ctx->data_end;
>>>>>>>>>>> void* data = (void *)(long)ctx->data;
>>>>>>>>>>> int stage = 0;
>>>>>>>>>>>
>>>>>>>>>>> struct hdr_cursor nh;
>>>>>>>>>>>
>>>>>>>>>>> __u32 *pkt_count;
>>>>>>>>>>> struct ethhdr* eth;
>>>>>>>>>>>
>>>>>>>>>>> #ifdef WITH_STATS
>>>>>>>>>>> pkt_count = bpf_map_lookup_elem(&xdp_stats_map, &index);
>>>>>>>>>>> if (pkt_count != NULL)
>>>>>>>>>>> {
>>>>>>>>>>> ++*pkt_count;
>>>>>>>>>>> }
>>>>>>>>>>> #endif
>>>>>>>>>>>
>>>>>>>>>>> /* Parse Ethernet & VLAN */
>>>>>>>>>>> nh.pos = data;
>>>>>>>>>>> int ethtype = my_parse_ethhdr (&nh, dataend, &eth);
>>>>>>>>>>> if (ethtype != bpf_ntohs (0x0800))
>>>>>>>>>>> {
>>>>>>>>>>> return (XDP_PASS);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> /* Direct XSK redirect. */
>>>>>>>>>>> #if 0
>>>>>>>>>>> inc_debug_map (stage);
>>>>>>>>>>> action =  bpf_redirect_map (&xsks_map, index, XDP_PASS);
>>>>>>>>>>> inc_debug_map (1000 + index*10 + action);
>>>>>>>>>>> #endif
>>>>>>>>>>> /* CPU map redirect. */
>>>>>>>>>>> __u32 cpuid = bpf_get_smp_processor_id ();
>>>>>>>>>>> __u32 targetcpu = cpuid ^ 0x01;
>>>>>>>>>>> ++stage;
>>>>>>>>>>> inc_debug_map (stage);
>>>>>>>>>>> inc_debug_map (200 + cpuid * 10 + targetcpu);
>>>>>>>>>>> action =  bpf_redirect_map (&cpu_map, targetcpu, XDP_PASS);
>>>>>>>>>>> inc_debug_map (400 + targetcpu * 10 + action);
>>>>>>>>>>> return (action);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> SEC ("xdp/cpumap")
>>>>>>>>>>> int rdwr_cpu_prog(struct xdp_md *ctx)
>>>>>>>>>>> {
>>>>>>>>>>> __u64* dataend64 = (__u64*) (long) ctx->data_end;
>>>>>>>>>>> __u64* data64 = (__u64*) (long) ctx->data;
>>>>>>>>>>> __u32 rxqueue = ctx->rx_queue_index;
>>>>>>>>>>> if (rxqueue < 5000)
>>>>>>>>>>> {
>>>>>>>>>>> inc_debug_map (20000 + rxqueue);
>>>>>>>>>>> }
>>>>>>>>>>> else
>>>>>>>>>>> {
>>>>>>>>>>> inc_debug_map (rxqueue);
>>>>>>>>>>> inc_debug_map (24999);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> rxqueue = (rxqueue >= 2 && rxqueue < 6) ? rxqueue : 10;
>>>>>>>>>>> __u32 cpuid = bpf_get_smp_processor_id ();
>>>>>>>>>>> cpuid = (cpuid < 12) ? cpuid : 20;
>>>>>>>>>>> inc_debug_map (10000 + 100*rxqueue + cpuid);
>>>>>>>>>>>
>>>>>>>>>>> return bpf_redirect_map (&xsks_map, bpf_get_smp_processor_id (), 0);
>>>>>>>>>>> }