Re: [net-next 0/3] Per epoll context busy poll support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2 Feb 2024 11:33:33 -0800 Joe Damato wrote:
> On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote:
> > On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:  
> > > I think you should be able to get this functionality via the netdev-genl 
> > > API to get napi parameters. It returns ifindex as one of the parameters 
> > > and you should able to get the name from ifindex.
> > > 
> > > $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
> > > {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}  
> > 
> > FWIW we also have a C library to access those. Out of curiosity what's
> > the programming language you'd use in user space, Joe?  
> 
> I am using C from user space. 

Ah, great! Here comes the advert.. :)

  make -C tools/net/ynl/

will generate the C lib for you. tools/net/ynl/generated/netdev-user.h
will have the full API. There are some samples in
tools/net/ynl/samples/. And basic info also here:
https://docs.kernel.org/next/userspace-api/netlink/intro-specs.html#ynl-lib

You should be able to convert Sridhar's cli.py into an equivalent 
in C in ~10 LoC.

> Curious what you think about
> SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not
> sure if such an extension would be accepted. I can send an RFC, if you'd
> like to take a look and consider it. I know you are busy and I don't want
> to add too much noise to the list if I can help it :)

Nothing wrong with it in particular, but we went with the netlink API
because all the objects are related. There are interrupts, NAPI
instances, queues, page pools etc. and we need to show all sort of
attributes, capabilities, stats as well as the linking. So getsockopts
may not scale, or we'd need to create a monster mux getsockopt?
Plus with some luck the netlink API will send you notifications of
things changing.

> Here's a brief description of what I'm doing, which others might find
> helpful:
> 
> 1. Machine has multiple NICs. Each NIC has 1 queue per busy poll app
> thread, plus a few extra queues for other non busy poll usage.
> 
> 2. A custom RSS context is created to distribute flows to the busy poll
> queues. This context is created for each NIC. The default context directs
> flows to the non-busy poll queues.
> 
> 3. Each NIC has n-tuple filters inserted to direct incoming connections
> with certain destination ports (e.g. 80, 443) to the custom RSS context.
> All other incoming connections will land in the default context and go to
> the other queues.
> 
> 4. IRQs for the busy poll queues are pinned to specific CPUs which are NUMA
> local to the NIC.
> 
> 5. IRQ coalescing values are setup with busy poll in mind, so IRQs are
> deferred as much as possible with the assumption userland will drive NAPI
> via epoll_wait. This is done per queue (using ethtool --per-queue and a
> queue mask). This is where napi_defer_hard_irqs and gro_flush_timeout
> could help even more. IRQ deferral is only needed for the busy poll queues.

Did you see SO_PREFER_BUSY_POLL by any chance? (In combination with
gro_flush_timeout IIRC). We added it a while back with Bjorn, it seems
like a great idea to me at the time but I'm unclear if anyone uses it 
in production..

> 6. userspace app config has NICs with their NUMA local CPUs listed, for
> example like this:
> 
>    - eth0: 0,1,2,3
>    - eth1: 4,5,6,7
> 
> The app reads that configuration in when it starts. Ideally, these are the
> same CPUs the IRQs are pinned to in step 4, but hopefully the coalesce
> settings let IRQs be deferred quite a bit so busy poll can take over.

FWIW if the driver you're using annotates things right you'll also get
the NAPI <> IRQ mapping via the netdev netlink. Hopefully that
simplifies the pinning setup.

> 7. App threads are created and sockets are opened with REUSEPORT. Notably:
> when the sockets are created, SO_BINDTODEVICE is used* (see below for
> longer explanation about this).
> 
> 8. cbpf reusport program inserted to distribute incoming connections to
> threads based on skb->queue_mapping. skb->queue_mapping values are not
> unique (e.g. each NIC will have queue_mapping==0), this is why BINDTODEVICE
> is needed. Again, see below.
> 
> 9. worker thread epoll contexts are set to busy poll by the ioctl I've
> submit in my patches.
> 
> The first time a worker thread receives a connection, it:
> 
> 1. calls SO_INCOMING_NAPI_ID to get the NAPI ID associated with the
> connection it received.
> 
> 2. Takes that NAPI ID and calls SIOCGIFNAME_BY_NAPI_ID to figure out which
> NIC the connection came in on.
> 
> 3. Looks for an un-unsed CPU from the list it read in at configuration time
> that is associated with that NIC and then pins itself to that CPU. That CPU
> is removed from the list so other threads can't take it.
> 
> All future incoming connections with the same NAPI ID will be distributed
> to app threads which are pinned in the appropriate place and are doing busy
> polling.
> 
> So, as you can see, SIOCGIFNAME_BY_NAPI_ID makes this implementation very
> simple.
> 
> I plan to eventually add some information to the kernel networking
> documentation to capture some more details of the above, which I think
> might be helpful for others.

Sounds very sensible & neat indeed. And makes sense to describe this 
in the docs, that should hopefully put more people on the right path :)

> Another potential solution to avoid the above might be use an eBPF program
> and to build a hash that maps NAPI IDs to thread IDs and write a more
> complicated eBPF program to distribute connections that way. This seemed
> cool, but involved a lot more work so I went with the SO_BINDTODEVICE +
> SIOCGIFNAME_BY_NAPI_ID method instead which was pretty simple (C code wise)
> and easy to implement.

Interesting!




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux