Re: Is It Possible to RX/Process/TX packets concurrently with AF_XDP?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 11, 2023 at 8:03 PM Zhaoxi Zhu <zzhu@xxxxxxxxxxxxx> wrote:
>
> Just an update, and a few more questions:
>
> Both issues I mentioned in my last email is caused by my use and modification of the original xsk_fwd.c program, I was able to mitigate this issue by setting the umem_cfg_default->fill_size to be as same as umem_cfg_default->comp_size.
>
> After this change, my program runs fine and hasn't been stuck anymore. I don't think it is needed to submit a fix to the original xsk_fwd.c this time, since it looks more like a problem caused my myself using/modifying the program incorrectly.
>
> However, it would be greatly appreciated if you can please share the reason for setting the derfault fill_size to be twice as the comp_size, it is normal to think that these two rings should have the same size.

The rule of thumb is that fill ring size should be double that of the
Rx ring size. Does not matter what completion and Tx ring are compared
to fill ring. If you follow this rule of thumb and always fill the
fill ring with buffers when you have the opportunity, you will not be
able to starve the driver of buffers in softirq mode. If you starve
the driver of buffers, performance gets lower and you will get packet
loss which likely decreases performance of your app even further.

> Also, I would like to raise a few more questions about the AF_XDP:
>
> 1) We have the XDP_ZEROCOPY flag for the port_params_default, do we need to set a similar flag for the umem_cfg_default, if we want to make the UMEM rings(fill ring and completion ring) also support zero copy? What is a good way to check if zero copy is in use in my rings?

No need to do that. Use a getsockopt() and provide the option
XDP_OPTIONS to check if you are in zc mode. See
include/uapi/linux/if_xdp.h.

> 2) I noticed that the umem_fq doesn't need to wake up, and the txq needs to wake up, when I'm using the default umem and port parameters, is there any reasons behind that? What should I do if I also want to the umem_fq to wake up(and do the poll())? I think I tried to set XDP_RING_NEED_WAKEUP for umem_cfg_default->flags, but it causes failure for UMEM creation.

Only the Rx and Tx ring uses the need_wakeup flag. You wake everything
up by acting on one of them.

> 3) Since the xsk_fwd.c code enabled us to use multi-threading for the AF_XDP sockets, are there any recommendations on how many threads should be kept for the AF_XDP socket, and how many threads to be reserved for the rest of the machine? One of my concerns is, if I assign all cores to AF_XDP sockets, other codes in the machine would have no CPU resource left. I have seen my test machine crashed when I assigned all cores to the AF_XDP sockets.

The million dollar question :-). It all depends on what you are doing
with the rest of the system. To leave at least core 0 alone is always
a good idea.

> 4) Kind of related to question 3), do we know what core(s) the kernel XDP program runs on? Are there any concerns that kernel XDP and AF_XDP runs on the same core(s), and they would fight for resources? What is a good way to make sure that both kernel XDP and AF_XDP sockets not fight for CPU resources?

The XDP program runs on the core you have bound your irq of that
netdev to. Default core_id == queue_id for most NICs and not very
large servers.

> 5) Are there any relationship between the RX/TX queue interrupts, and the AF_XDP sockets? Say I have these interrupts for my NIC that runs AF_XDP:
>
>             CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15      CPU16      CPU17      CPU18      CPU19      CPU20      CPU21      CPU22      CPU23
>   93:         22          0       3825          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099200-edge      NIC-TxRx-0
>   94:          0          0         16          0          0          0          0          0          0          0       2313          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099201-edge      NIC-TxRx-1
>   95:          0          0          0          0         20          0          0          0          0          0       2459          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099202-edge      NIC-TxRx-2
>   96:          0          0          0          0          0          0         16          0          0          0          0          0          0          0          0          0          0          0       4433          0          0          0          0          0  IR-PCI-MSI 2099203-edge      NIC-TxRx-3
>   97:          0          0          0          0          0          0          0          0         16          0          0          0          0          0          0          0          0          0       3266          0          0          0          0          0  IR-PCI-MSI 2099204-edge      NIC-TxRx-4
>   98:          0          0          0          0          0          0          0          0          0          0         22          0          0          0          0          0          0          0          0          0          0          0       6788          0  IR-PCI-MSI 2099205-edge      NIC-TxRx-5
>   99:          0          0          0          0          0          0          0          0          0          0          0          0         16          0          0          0          0          0          0          0          0          0       2116          0  IR-PCI-MSI 2099206-edge      NIC-TxRx-6
>  100:       2498          0          0          0          0          0          0          0          0          0          0          0          0          0         16          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099207-edge      NIC-TxRx-7
>  101:       2493          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0         16          0          0          0          0          0          0          0  IR-PCI-MSI 2099208-edge      NIC-TxRx-8
>  102:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0       2527          0         25          0          0          0          0          0  IR-PCI-MSI 2099209-edge      NIC-TxRx-9
>  103:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0       3746          0          0          0         26          0          0          0  IR-PCI-MSI 2099210-edge      NIC-TxRx-10
>  104:          0          0          0          0          0          0          0          0          0          0          0          0       2575          0          0          0          0          0          0          0          0          0         18          0  IR-PCI-MSI 2099211-edge      NIC-TxRx-11
>  105:          0         16          0          0          0          0          0          0          0          0          0          0       2571          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099212-edge      NIC-TxRx-12
>  106:          0          0          0         16          0          0       2080          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099213-edge      NIC-TxRx-13
>  107:          0          0          0          0          0         16       2670          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099214-edge      NIC-TxRx-14
>  108:          0          0       2715          0          0          0          0         18          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099215-edge      NIC-TxRx-15
>  109:          0          0       1503          0          0          0          0          0          0         16          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099216-edge      NIC-TxRx-16
>  110:          0          0          0          0          0          0          0          0       1503          0          0         16          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099217-edge      NIC-TxRx-17
>  111:          0          0          0          0          0          0          0          0       1503          0          0          0          0         16          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099218-edge      NIC-TxRx-18
>  112:          0          0          0          0          0          0          0          0          0          0          0          0          0          0       1503         16          0          0          0          0          0          0          0          0  IR-PCI-MSI 2099219-edge      NIC-TxRx-19
>  113:          0          0          0          0          0          0          0          0          0          0          0          0          0          0       1503          0          0         16          0          0          0          0          0          0  IR-PCI-MSI 2099220-edge      NIC-TxRx-20
>  114:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0         16       1503          0          0          0  IR-PCI-MSI 2099221-edge      NIC-TxRx-21
>  115:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0       1503         16          0          0  IR-PCI-MSI 2099222-edge      NIC-TxRx-22
>  116:          0          0          0          0       1503          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0         16  IR-PCI-MSI 2099223-edge      NIC-TxRx-23
>  117:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0  IR-PCI-MSI 2099224-edge      NIC
>
>
> Should I be binding the AF_XDP sockets with the corresponding CPU of the above interrupt matrix? Say socket with iface_queue 0 binds to CPU 0 and CPU 2, the same CPUs of NIC_TxRx-0 interrupts?

Yes, that is a good idea if you do not want to change the default irq affinity.

> This is a long post with many questions, thank you very much for reading this email and helping us.
> Zhaoxi Zhu
>
>
> On 12/30/22, 1:31 PM, "Zhaoxi Zhu" <zzhu@xxxxxxxxxxxxx> wrote:
>
>     >On 12/29/22, 2:18 AM, "Magnus Karlsson" <magnus.karlsson@xxxxxxxxx> wrote:
>     >
>     >    On Thu, Dec 29, 2022 at 2:07 AM Zhaoxi Zhu <zzhu@xxxxxxxxxxxxx> wrote:
>     >    >
>     >    >
>     >    >
>     >    > >> On 12/22/22, 1:38 AM, "Magnus Karlsson" <magnus.karlsson@xxxxxxxxx> wrote:
>     >    > >>
>     >    > >>   On Thu, Dec 22, 2022 at 12:11 AM Zhaoxi Zhu <zzhu@xxxxxxxxxxxxx> wrote:
>     >    > >>   >
>     >    > >>    >
>     >    > >>    > >    On Tue, Dec 13, 2022 at 8:11 PM Zhaoxi Zhu <zzhu@xxxxxxxxxxxxx> wrote:
>     >    > >>    > >    >
>     >    > >>    > >    > It looks like that I didn’t include the mailing list in my previous replies. I hope this one does.
>     >    > >>    > >    >
>     >    > >>    > >   > Also, for the AF_XDP-forwarding example, is it able to handle multiple AF_XDP sockets on the same NIC?
>     >    > >>    > >
>     >    > >>    > >    Yes.
>     >    > >>    > >
>     >    > >>    > >    > Such as:
>     >    > >>    > >    >
>     >    > >>    > >    > ```
>     >    > >>    > >    > ./xdp_fwd -i IFA -q Q1 -i IFA -q Q2 -i IFA -q Q3 -i IFA -q Q4 -c CX -c CY
>     >    > >>    > >    >
>     >    > >>    > >    > ```
>     >    > >>    > >    >
>     >    > >>    > >    > If the above is doable, maybe I can have multiple queues, rather than having one, on the same NIC, create one AF_XDP socket per queue, and then use this xdp_fwd example to achieve multi-threading?
>     >    > >>    > >
>     >    > >>    > >   That is the best way to do multithreading without having to resort to
>     >    > >>    > >    expensive locking. One queue and socket per thread is the way to go.
>     >    > >>    > >
>     >    > >>    >
>     >    > >>    > Thank you so much for your reply and suggestions. I made changes following your suggestions and it partly worked!
>     >    > >>    >
>     >    > >>    > At the beginning, I copied the round robin logic of the xdpsock_kern.c and put it in my XDP code, in the user space, I have the number_of_receving_queues of threads, which equals to the number of cores of my machine(24), each have an AF_XDP socket for that queue, but none of them are receiving packets. I added some logs to the rx_burst and found out that they were all busy polling and the n_packets are always 0. I later changed the number of queues and threads to 4, and the results are the same. What could be the reason that the round robin doesn't work?
>     >    > >>    >
>     >    > >>    > Then, I set the number of receiving queues to. 8, removed the round robin logic in the XDP code, simply forward the packet to the ctx->rx_ueue_index of the xsks_map; and used 8 threads in the userspace for the 8 queues, and it is now able to receive packets. However, when I tried to increase the number of queue and threads to 16, none of the AF_XDP socket can receive packets again. Do you know what might be cause this? Is it because the number of queues and threads are too many?
>     >    > >>
>     >    > >>    There is not upper limit to the amount of queues and threads supported
>     >    > >>    in the AF_XDP code. Your NIC will likely have a limit on the amount of
>     >    > >>    queues though.
>     >    > >>
>     >    >
>     >    > Thank you very much for your tips. After some debugging, I found out that it is because of some bugs initiating the bcache, which results in ports' pcache trading new slabs from the bpool.
>     >
>     >    Please submit this fix to the bpf-examples repo so other people can
>         benefit from it too.
>
>     Sure, let me verify it this also happens for the original xsk_fwd program as well, if that's also the case, I would create an issue and submit a PR for it.
>     >
>     >    > I fixed that and continued testing, my test is to run some iperfs to generate traffic that goes through these AF_XDP sockets.
>     >    >
>     >    > There is one interesting behavior: the iperf traffic runs fine at the beginning. However, after some time, maybe 30 seconds or 1 minute, the traffic stopped and I didn't see any sockets rx/tx packets(I added some logs there for debugging).
>     >    >
>     >    > The program didn't crash, it is as if there's no new traffic coming to the sockets. However, if I stop the iperfs and give the sockets a break, say 1 or 2 minutes, and restart the iperfs, the sockets are usually able to rx/tx traffic again, until it doesn't.
>     >    >
>     >    > I wonder have you seen anything like this before? It feels like that I'm very close to having a fully running program, but I'm stuck at this step and I can't figure out why.
>     >
>     >    Have not seen anything like this. TCP is sensitive to packet loss. Are
>     >    packets dropped in your test run?
>     >
>
>     I'm not sure if that's the case, is there a way to find out? Also, when the traffic stopped, it wasn't only stopped for TCP traffic, it seems to stop for other traffics, such as ping. However, after a while, it would recover and I would be able to do iperf/ping again.
>
>     One thing I find interesting is that, when there's only 1 or 2 iperfs running at the same time, and each of them are getting ~3 to 4 Gbits/sec speeds, the AF_XDP sockets tend to stop working at some point. However, if I have a larger number of iperfs running, say 10, and each getting less speed(a little less than 1 Gbits/sec), the AF_XDP sockets works without stopping. This behavior surprises me, as the combined traffic that the sockets needs to handle is more (9.X Gbits/second v.s 7.x Gbits/second),  but there's AF_XDP sockets works fine in my few tests. I guess I need to investigate more on this issue.
>
>     Another question I have is: In the xsk_fwd.c code, is there any reason why the port_params_default has the XDP_USE_NEED_WAKEUP, but the same flag isn't set for the umem_cfg_default? I was debugging the traffic stopping issue, and I realized that the umem fill queue doesn't need to wake up, and the tx queue does. I tried to add this XDP_USE_NEED_WAKEUP to the umem_cfg_default but it causes failure for creating umem. Is there any way to use this flag for the umem queues, does it bring any performance benefits?
>
>     Thank you very much and happy new year to the community.
>     Rio
>
>     >    > Thanks again for your help.
>     >    >
>     >    > >>    > Another questions is, since I'm not using round robin in my XDP code, the traffic isn't distributed evenly among my queues, it seems to me that 2 queues are always getting most of the traffic and the others are getting very little:
>     >    > >>    >
>     >    > >>    > +------+--------------+---------------+--------------+---------------+
>     >    > >>    > | Port |   RX packets | RX rate (pps) |   TX packets | TX_rate (pps) |
>     >    > >>    > +------+--------------+---------------+--------------+---------------+
>     >    > >>    > |    0 |         2113 |             1 |         2113 |             1 |
>     >    > >>    > |    1 |            0 |             0 |            0 |             0 |
>     >    > >>    > |    2 |            0 |             0 |            0 |             0 |
>     >    > >>    > |    3 |          568 |             0 |          568 |             0 |
>     >    > >>    > |    4 |         2590 |             1 |         2590 |             1 |
>     >    > >>    > |    5 |            0 |             0 |            0 |             0 |
>     >    > >>    > |    6 |            0 |             0 |            0 |             0 |
>     >    > >>    > |    7 |           85 |             0 |           85 |             0 |
>     >    > >>    > +------+--------------+---------------+--------------+---------------+
>     >    > >>    >
>     >    > >>    > I understand this is mainly because I don't have round robin in my XDP code, but I wonder what decides which queue gets the traffic? Also, if round robin works, does it mean that when a packet arrives in the XDP in queue x, and then be forwarded to an AF_XDP socket with queue y, the packet will be copied, and zero-copy won't work in this case?
>     >    > >>
>     >    > >>    Your packet distribution among queues is decided by your NIC and the
>     >    > >>    traffic it receives. It probably has RSS enabled by default. You can
>     >    > >>    program the NIC flow steering rules using ethtool. If you want
>     >    > >>    something perfectly spread among the cores, you probably want to have
>     >    > >>    a synthetic workload and enable explicit flow steering rules to
>     >    > >>    achieve perfect control. Google some examples and experiment without
>     >    > >>    using XDP, is my tip.
>     >    > >>
>     >    > >>    You cannot direct packets coming in on queue X to a socket bound to
>     >    > >>    queue Y, this regardless if it is zero-copy mode or not. You are
>     >    > >>    correct that this could be supported in copy-mode, but it is not.
>     >    > >>
>     >    > >>    > Again, thank you very much for reading this and your help.
>     >    > >>    >
>     >    > >>    > Rio
>     >    > >>    >
>     >    > >>    > >    > Thank you very much for your help and time.
>     >    > >>    > >    > Rio
>     >    > >>    > >    >
>     >    > >>    > >    >
>     >    > >>    > >    > From: Zhaoxi Zhu <zzhu@xxxxxxxxxxxxx>
>     >    > >>    > >    > Date: Monday, December 12, 2022 at 11:06 AM
>     >    > >>    > >   > To: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
>     >    > >>    > >    > Subject: Re: Is It Possible to RX/Process/TX packets concurrently with AF_XDP?
>     >    > >>    > >    >
>     >    > >>    > >    > Got it, thank you very much for your clarification.
>     >    > >>    > >    >
>     >    > >>    > >    > I have one more question, if I may: If one AF_XDP should be handled by one thread, in order to avoid mutexes and to achieve better performance, then, can I have more than one AF_XDP socket on the same physical NIC, and use one thread per AF_XDP socket, in order to make process packets coming into this NIC concurrently?
>     >    > >>    > >    >
>     >    > >>    > >   > Currently, the way we are testing AF_XDP with is to have only 1 queue:
>     >    > >>    > >    >
>     >    > >>    > >    > ```
>     >    > >>    > >    > sudo ethtool -L <interface> combined 1
>     >    > >>    > >    > ```
>     >    > >>    > >    >
>     >    > >>    > >    > Can I change the number of queues to something like 4, and the user space program,  have one AF_XDP socket per queue and one thread per AF_XDP socket, in order to have four threads processing traffic coming into the same NIC?
>     >    > >>    > >    >
>     >    > >>    > >    > Thank you very much for your help and time.
>     >    > >>    > >    > Rio
>     >    >
>
>




[Index of Archives]     [Linux Networking Development]     [Fedora Linux Users]     [Linux SCTP]     [DCCP]     [Gimp]     [Yosemite Campsites]

  Powered by Linux