Re: AF_XDP sockets across multiple NIC queues

Magnus Karlsson <magnus.karlsson@xxxxxxxxx> · Tue, 30 Mar 2021 08:21:17 +0200

On Tue, Mar 30, 2021 at 8:17 AM Magnus Karlsson
<magnus.karlsson@xxxxxxxxx> wrote:
>
> On Tue, Mar 30, 2021 at 7:32 AM Konstantinos Kaffes <kkaffes@xxxxxxxxx> wrote:
> >
> > On Fri, 26 Mar 2021 at 00:36, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
> > >
> > > On Thu, Mar 25, 2021 at 7:51 PM Konstantinos Kaffes <kkaffes@xxxxxxxxx> wrote:
> > > >
> > > > Great, thanks for the info! I will look into implementing this.
> > > >
> > > > For the time being, I implemented a version of my design with N^2
> > > > sockets. I observed that when all traffic is directed to a single NIC
> > > > queue, the throughput is higher than when I use all N NIC queues. I am
> > > > using spinlocks to guard concurrent access to UMEM and the
> > > > fill/completion rings. When I use a single NIC queue, I achieve
> > > > ~1Mpps; when I use multiple ~550Kpps. Are these numbers reasonable,
> > > > and this bad scaling behavior expected?
> > >
> > > 1Mpps sounds reasonable with SKB mode. If you use something simple
> > > like the spinlock scheme you describe, then it will not scale. Check
> > > the sample xsk_fwd.c in samples/bpf in the Linux kernel repo. It has a
> > > mempool implementation that should scale better than the one you
> > > implemented. For anything remotely complicated, something that manages
> > > the buffers in the umem plus the fill and completion queues is usually
> > > required. This is called a mempool most of the time. User-space
> > > network libraries such as DPDK and VPP provide fast and scalable
> > > mempool implementations. It would be nice to add a simple one to
> > > libbpf, or rather libxdp as the AF_XDP functionality is moving over
> > > there. Several people have asked for it, but unfortunately I have not
> > > had the time.
> > >
> >
> > Thanks for the tip! I have also started trying zero-copy DRV mode and
> > came across a weird behavior. When I am using multiple sockets, one
> > for each NIC queue, I observe very low throughput and a lot of time
> > spent on the following loop:
> >
> > uint32_t idx_cq;
> > while (ret < buf_count) {
> >   ret += xsk_ring_cons__peek(&xsk->umem->cq, buf_count, &idx_cq);
> > }
>
> This is very likely a naïve and unscalable implementation from my
> side, or maybe from you or someone else since I do not know where it
> comes from :-). Here you are waiting for the completing ring to have a
> certain amount of entries (buf_count) to move on. Work with what you
> get instead of trying to get a certain amount.

Another good tactic is to just go and do something else if you do not
get buf_count, then come back later and try again. Do not waste your
cycles doing nothing.

> Also check where your
> driver code for each queue id is running. Are they evenly spread out
> or on the same core? htop is an easy way to find out. It seems that
> your completion rate is bounded and does not scale with number of
> queue ids. Might be the case that Tx driver processing is occurring on
> one core. At least worth examining. I would do that first before
> changing the logic above.
>
> /Magnus
>
> > This does not happen when I have only one XDP socket bound to a single queue.
> >
> > Any idea on why this might be happening?
> >
> > > >
> > > > On Thu, 25 Mar 2021 at 00:24, Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
> > > > >
> > > > > On Thu, Mar 25, 2021 at 7:25 AM Konstantinos Kaffes <kkaffes@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > I want to write a multi-threaded AF_XDP server where all N threads can
> > > > > > read from all N NIC queues. In my design, each thread creates N AF_XDP
> > > > > > sockets, each associated with a different queue. I have the following
> > > > > > questions:
> > > > > >
> > > > > > 1. Do sockets associated with the same queue need to share their UMEM
> > > > > > area and fill and completion rings?
> > > > >
> > > > > Yes. In zero-copy mode this is natural since the NIC HW will DMA the
> > > > > packet into a umem that was decided long before the packet was even
> > > > > received. And this is of course before we even get to pick what socket
> > > > > it should go to. This restriction is currently carried over to
> > > > > copy-mode, however, in theory there is nothing preventing supporting
> > > > > multiple umems on the same netdev and queue id in copy-mode. It is
> > > > > just that nobody has implemented support for it.
> > > > >
> > > > > > 2. Will there be a single XSKMAP holding all N^2 sockets? If yes, what
> > > > > > happens if my XDP program redirects a packet to a socket that is
> > > > > > associated with a different NIC queue than the one in which the packet
> > > > > > arrived?
> > > > >
> > > > > You can have multiple XSKMAPs but you would in any case have to have
> > > > > N^2 sockets in total to be able to cover all cases. Sockets are tied
> > > > > to a specific netdev and queue id. If you try to redirect to socket
> > > > > with a queue id or netdev that the packet was not received on, it will
> > > > > be dropped. Again, for copy-mode, it would from a theoretical
> > > > > perspective be perfectly fine to redirect to another queue id and/or
> > > > > netdev since the packet is copied anyway. Maybe you want to add
> > > > > support for it :-).
> > > > >
> > > > > > I must mention that I am using the XDP skb mode with copies.
> > > > > >
> > > > > > Thank you in advance,
> > > > > > Kostis
> > > >
> > > >
> > > >
> > > > --
> > > > Kostis Kaffes
> > > > PhD Student in Electrical Engineering
> > > > Stanford University
> >
> >
> >
> > --
> > Kostis Kaffes
> > PhD Student in Electrical Engineering
> > Stanford University