Re: [net-next 0/3] Per epoll context busy poll support

Joe Damato <jdamato@xxxxxxxxxx> · Fri, 2 Feb 2024 11:33:33 -0800

On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote:
> On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:
> > > I know I am replying to a stale thread on the patches I've submit (there is
> > > a v5 now [1]), but I just looked at your message - sorry I didn't reply
> > > sooner.
> > > 
> > > The per-queue and per-napi netlink APIs look extremely useful, thanks for
> > > pointing this out.
> > > 
> > > In my development tree, I had added SIOCGIFNAME_BY_NAPI_ID which works
> > > similar to SIOCGIFNAME: it takes a NAPI ID and returns the IF name. This is
> > > useful on machines with multiple NICs where each NIC could be located in
> > > one of many different NUMA zones.
> > > 
> > > The idea was that apps would use SO_INCOMING_NAPI_ID, distribute the NAPI
> > > ID to a worker thread which could then use SIOCGIFNAME_BY_NAPI_ID to
> > > compute which NIC the connection came in on. The app would then (via
> > > configuration) know where to pin that worker thread; ideally somewhere NUMA
> > > local to the NIC.
> > > 
> > > I had assumed that such a change would be rejected, but I figured I'd send
> > > an RFC for it after the per epoll context stuff was done and see if anyone
> > > thought SIOCGIFNAME_BY_NAPI_ID would be useful for them, as well.  
> > 
> > I think you should be able to get this functionality via the netdev-genl 
> > API to get napi parameters. It returns ifindex as one of the parameters 
> > and you should able to get the name from ifindex.
> > 
> > $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
> > {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}
> 
> FWIW we also have a C library to access those. Out of curiosity what's
> the programming language you'd use in user space, Joe?

I am using C from user space. Curious what you think about
SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not
sure if such an extension would be accepted. I can send an RFC, if you'd
like to take a look and consider it. I know you are busy and I don't want
to add too much noise to the list if I can help it :)

Here's a brief description of what I'm doing, which others might find
helpful:

1. Machine has multiple NICs. Each NIC has 1 queue per busy poll app
thread, plus a few extra queues for other non busy poll usage.

2. A custom RSS context is created to distribute flows to the busy poll
queues. This context is created for each NIC. The default context directs
flows to the non-busy poll queues.

3. Each NIC has n-tuple filters inserted to direct incoming connections
with certain destination ports (e.g. 80, 443) to the custom RSS context.
All other incoming connections will land in the default context and go to
the other queues.

4. IRQs for the busy poll queues are pinned to specific CPUs which are NUMA
local to the NIC.

5. IRQ coalescing values are setup with busy poll in mind, so IRQs are
deferred as much as possible with the assumption userland will drive NAPI
via epoll_wait. This is done per queue (using ethtool --per-queue and a
queue mask). This is where napi_defer_hard_irqs and gro_flush_timeout
could help even more. IRQ deferral is only needed for the busy poll queues.

6. userspace app config has NICs with their NUMA local CPUs listed, for
example like this:

   - eth0: 0,1,2,3
   - eth1: 4,5,6,7

The app reads that configuration in when it starts. Ideally, these are the
same CPUs the IRQs are pinned to in step 4, but hopefully the coalesce
settings let IRQs be deferred quite a bit so busy poll can take over.

7. App threads are created and sockets are opened with REUSEPORT. Notably:
when the sockets are created, SO_BINDTODEVICE is used* (see below for
longer explanation about this).

8. cbpf reusport program inserted to distribute incoming connections to
threads based on skb->queue_mapping. skb->queue_mapping values are not
unique (e.g. each NIC will have queue_mapping==0), this is why BINDTODEVICE
is needed. Again, see below.

9. worker thread epoll contexts are set to busy poll by the ioctl I've
submit in my patches.

The first time a worker thread receives a connection, it:

1. calls SO_INCOMING_NAPI_ID to get the NAPI ID associated with the
connection it received.

2. Takes that NAPI ID and calls SIOCGIFNAME_BY_NAPI_ID to figure out which
NIC the connection came in on.

3. Looks for an un-unsed CPU from the list it read in at configuration time
that is associated with that NIC and then pins itself to that CPU. That CPU
is removed from the list so other threads can't take it.

All future incoming connections with the same NAPI ID will be distributed
to app threads which are pinned in the appropriate place and are doing busy
polling.

So, as you can see, SIOCGIFNAME_BY_NAPI_ID makes this implementation very
simple.

I plan to eventually add some information to the kernel networking
documentation to capture some more details of the above, which I think
might be helpful for others.

Thanks,
Joe

* Longer explanation about SO_BINDTODEVICE (only relevant if you have
mulitple NICs):

It turns out that reuseport groups in the kernel are bounded by a few
attributes, port being one of them but also ifindex. Since multiple NICs
can have queue_mapping == 0, reusport groups need to be constructed in
userland with care if there are multiple NICs. This is required because
each epoll context can only do epoll busy poll on a single NAPI ID. So,
even if multiple NICs have queue_mapping == 0, the queues will have
different NAPI IDs and incoming connections must be distributed to threads
uniquely based on NAPI ID.

I am doing this by creating listen sockets for each NIC, one NIC at a time.
When the listen socket is created, SO_BINDTODEVICE is used on the socket
before calling listen.

In the kernel, this results in all listen sockets with the same ifindex to
form a reuseport group. So, if I have 2 NICs and 1 listen port (say port
80), this results in 2 reuseport groups -- one for nic0 port 80 and one for
nic1 port 80, because of SO_BINDTODEVICE.

The reuseport cbpf filter is inserted for each reuseport group, and then
the skb->queue_mapping based listen socket selection will work as expected
distributing NAPI IDs to app threads without breaking epoll busy poll.

Without the above, you can run into an issue where two connections with the
same queue_mapping (but from different NICs) can land in the same epoll
context, which breaks busy poll.

Another potential solution to avoid the above might be use an eBPF program
and to build a hash that maps NAPI IDs to thread IDs and write a more
complicated eBPF program to distribute connections that way. This seemed
cool, but involved a lot more work so I went with the SO_BINDTODEVICE +
SIOCGIFNAME_BY_NAPI_ID method instead which was pretty simple (C code wise)
and easy to implement.