Re: [net-next 0/3] Per epoll context busy poll support

Joe Damato <jdamato@xxxxxxxxxx> · Fri, 2 Feb 2024 12:23:44 -0800

On Fri, Feb 02, 2024 at 11:58:28AM -0800, Jakub Kicinski wrote:
> On Fri, 2 Feb 2024 11:33:33 -0800 Joe Damato wrote:
> > On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote:
> > > On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:  
> > > > I think you should be able to get this functionality via the netdev-genl 
> > > > API to get napi parameters. It returns ifindex as one of the parameters 
> > > > and you should able to get the name from ifindex.
> > > > 
> > > > $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
> > > > {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}  
> > > 
> > > FWIW we also have a C library to access those. Out of curiosity what's
> > > the programming language you'd use in user space, Joe?  
> > 
> > I am using C from user space. 
> 
> Ah, great! Here comes the advert.. :)
> 
>   make -C tools/net/ynl/
> 
> will generate the C lib for you. tools/net/ynl/generated/netdev-user.h
> will have the full API. There are some samples in
> tools/net/ynl/samples/. And basic info also here:
> https://docs.kernel.org/next/userspace-api/netlink/intro-specs.html#ynl-lib
> 
> You should be able to convert Sridhar's cli.py into an equivalent 
> in C in ~10 LoC.
> 
> > Curious what you think about
> > SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not
> > sure if such an extension would be accepted. I can send an RFC, if you'd
> > like to take a look and consider it. I know you are busy and I don't want
> > to add too much noise to the list if I can help it :)
> 
> Nothing wrong with it in particular, but we went with the netlink API
> because all the objects are related. There are interrupts, NAPI
> instances, queues, page pools etc. and we need to show all sort of
> attributes, capabilities, stats as well as the linking. So getsockopts
> may not scale, or we'd need to create a monster mux getsockopt?
> Plus with some luck the netlink API will send you notifications of
> things changing.

Yes this all makes sense. The notification on changes would be excellent,
especially if NAPI IDs get changed for some reason  (e.g. the queue count
is adjusted or the queues are rebuilt by the driver for some reason like a
timeout, etc).

I think the issue I'm solving with SIOCGIFNAME_BY_NAPI_ID is related, but
different.

In my case, SIOCGIFNAME_BY_NAPI_ID identifies which NIC a specific fd from
accept arrived from.

AFAICT, the netlink API wouldn't be able to help me answer that question. I
could use SIOCGIFNAME_BY_NAPI_ID to tell me which NIC the fd is from and
then use netlink to figure out which CPU to bind to (for example), but I
think SIOCGIFNAME_BY_NAPI_ID is still needed.

I'll send an RFC about for that shortly, hope that's OK.

> > Here's a brief description of what I'm doing, which others might find
> > helpful:
> > 
> > 1. Machine has multiple NICs. Each NIC has 1 queue per busy poll app
> > thread, plus a few extra queues for other non busy poll usage.
> > 
> > 2. A custom RSS context is created to distribute flows to the busy poll
> > queues. This context is created for each NIC. The default context directs
> > flows to the non-busy poll queues.
> > 
> > 3. Each NIC has n-tuple filters inserted to direct incoming connections
> > with certain destination ports (e.g. 80, 443) to the custom RSS context.
> > All other incoming connections will land in the default context and go to
> > the other queues.
> > 
> > 4. IRQs for the busy poll queues are pinned to specific CPUs which are NUMA
> > local to the NIC.
> > 
> > 5. IRQ coalescing values are setup with busy poll in mind, so IRQs are
> > deferred as much as possible with the assumption userland will drive NAPI
> > via epoll_wait. This is done per queue (using ethtool --per-queue and a
> > queue mask). This is where napi_defer_hard_irqs and gro_flush_timeout
> > could help even more. IRQ deferral is only needed for the busy poll queues.
> 
> Did you see SO_PREFER_BUSY_POLL by any chance? (In combination with
> gro_flush_timeout IIRC). We added it a while back with Bjorn, it seems
> like a great idea to me at the time but I'm unclear if anyone uses it 
> in production..

I have seen it while reading the code, yes. I think maybe I missed
something about its interaction with gro_flush_timeout. In my use case,
the machine has no traffic until after the app is started.

In this case, I haven't needed to worry about regular NAPI monopolizing the
CPU and preventing busy poll from working.

Maybe I am missing something more nuanced, though? I'll have another look
at the code, just incase.

> 
> > 6. userspace app config has NICs with their NUMA local CPUs listed, for
> > example like this:
> > 
> >    - eth0: 0,1,2,3
> >    - eth1: 4,5,6,7
> > 
> > The app reads that configuration in when it starts. Ideally, these are the
> > same CPUs the IRQs are pinned to in step 4, but hopefully the coalesce
> > settings let IRQs be deferred quite a bit so busy poll can take over.
> 
> FWIW if the driver you're using annotates things right you'll also get
> the NAPI <> IRQ mapping via the netdev netlink. Hopefully that
> simplifies the pinning setup.

I looked last night after I learned about the netlink interface. It does
not look like the driver I am using does, but I can fix that fairly easily,
I think.

I'll try to send a patch for this next week.

> > 7. App threads are created and sockets are opened with REUSEPORT. Notably:
> > when the sockets are created, SO_BINDTODEVICE is used* (see below for
> > longer explanation about this).
> > 
> > 8. cbpf reusport program inserted to distribute incoming connections to
> > threads based on skb->queue_mapping. skb->queue_mapping values are not
> > unique (e.g. each NIC will have queue_mapping==0), this is why BINDTODEVICE
> > is needed. Again, see below.
> > 
> > 9. worker thread epoll contexts are set to busy poll by the ioctl I've
> > submit in my patches.
> > 
> > The first time a worker thread receives a connection, it:
> > 
> > 1. calls SO_INCOMING_NAPI_ID to get the NAPI ID associated with the
> > connection it received.
> > 
> > 2. Takes that NAPI ID and calls SIOCGIFNAME_BY_NAPI_ID to figure out which
> > NIC the connection came in on.
> > 
> > 3. Looks for an un-unsed CPU from the list it read in at configuration time
> > that is associated with that NIC and then pins itself to that CPU. That CPU
> > is removed from the list so other threads can't take it.
> > 
> > All future incoming connections with the same NAPI ID will be distributed
> > to app threads which are pinned in the appropriate place and are doing busy
> > polling.
> > 
> > So, as you can see, SIOCGIFNAME_BY_NAPI_ID makes this implementation very
> > simple.
> > 
> > I plan to eventually add some information to the kernel networking
> > documentation to capture some more details of the above, which I think
> > might be helpful for others.
> 
> Sounds very sensible & neat indeed. And makes sense to describe this 
> in the docs, that should hopefully put more people on the right path :)
> 
> > Another potential solution to avoid the above might be use an eBPF program
> > and to build a hash that maps NAPI IDs to thread IDs and write a more
> > complicated eBPF program to distribute connections that way. This seemed
> > cool, but involved a lot more work so I went with the SO_BINDTODEVICE +
> > SIOCGIFNAME_BY_NAPI_ID method instead which was pretty simple (C code wise)
> > and easy to implement.
> 
> Interesting!