On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote: > On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote: > > > I know I am replying to a stale thread on the patches I've submit (there is > > > a v5 now [1]), but I just looked at your message - sorry I didn't reply > > > sooner. > > > > > > The per-queue and per-napi netlink APIs look extremely useful, thanks for > > > pointing this out. > > > > > > In my development tree, I had added SIOCGIFNAME_BY_NAPI_ID which works > > > similar to SIOCGIFNAME: it takes a NAPI ID and returns the IF name. This is > > > useful on machines with multiple NICs where each NIC could be located in > > > one of many different NUMA zones. > > > > > > The idea was that apps would use SO_INCOMING_NAPI_ID, distribute the NAPI > > > ID to a worker thread which could then use SIOCGIFNAME_BY_NAPI_ID to > > > compute which NIC the connection came in on. The app would then (via > > > configuration) know where to pin that worker thread; ideally somewhere NUMA > > > local to the NIC. > > > > > > I had assumed that such a change would be rejected, but I figured I'd send > > > an RFC for it after the per epoll context stuff was done and see if anyone > > > thought SIOCGIFNAME_BY_NAPI_ID would be useful for them, as well. > > > > I think you should be able to get this functionality via the netdev-genl > > API to get napi parameters. It returns ifindex as one of the parameters > > and you should able to get the name from ifindex. > > > > $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}' > > {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727} > > FWIW we also have a C library to access those. Out of curiosity what's > the programming language you'd use in user space, Joe? I am using C from user space. Curious what you think about SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not sure if such an extension would be accepted. I can send an RFC, if you'd like to take a look and consider it. I know you are busy and I don't want to add too much noise to the list if I can help it :) Here's a brief description of what I'm doing, which others might find helpful: 1. Machine has multiple NICs. Each NIC has 1 queue per busy poll app thread, plus a few extra queues for other non busy poll usage. 2. A custom RSS context is created to distribute flows to the busy poll queues. This context is created for each NIC. The default context directs flows to the non-busy poll queues. 3. Each NIC has n-tuple filters inserted to direct incoming connections with certain destination ports (e.g. 80, 443) to the custom RSS context. All other incoming connections will land in the default context and go to the other queues. 4. IRQs for the busy poll queues are pinned to specific CPUs which are NUMA local to the NIC. 5. IRQ coalescing values are setup with busy poll in mind, so IRQs are deferred as much as possible with the assumption userland will drive NAPI via epoll_wait. This is done per queue (using ethtool --per-queue and a queue mask). This is where napi_defer_hard_irqs and gro_flush_timeout could help even more. IRQ deferral is only needed for the busy poll queues. 6. userspace app config has NICs with their NUMA local CPUs listed, for example like this: - eth0: 0,1,2,3 - eth1: 4,5,6,7 The app reads that configuration in when it starts. Ideally, these are the same CPUs the IRQs are pinned to in step 4, but hopefully the coalesce settings let IRQs be deferred quite a bit so busy poll can take over. 7. App threads are created and sockets are opened with REUSEPORT. Notably: when the sockets are created, SO_BINDTODEVICE is used* (see below for longer explanation about this). 8. cbpf reusport program inserted to distribute incoming connections to threads based on skb->queue_mapping. skb->queue_mapping values are not unique (e.g. each NIC will have queue_mapping==0), this is why BINDTODEVICE is needed. Again, see below. 9. worker thread epoll contexts are set to busy poll by the ioctl I've submit in my patches. The first time a worker thread receives a connection, it: 1. calls SO_INCOMING_NAPI_ID to get the NAPI ID associated with the connection it received. 2. Takes that NAPI ID and calls SIOCGIFNAME_BY_NAPI_ID to figure out which NIC the connection came in on. 3. Looks for an un-unsed CPU from the list it read in at configuration time that is associated with that NIC and then pins itself to that CPU. That CPU is removed from the list so other threads can't take it. All future incoming connections with the same NAPI ID will be distributed to app threads which are pinned in the appropriate place and are doing busy polling. So, as you can see, SIOCGIFNAME_BY_NAPI_ID makes this implementation very simple. I plan to eventually add some information to the kernel networking documentation to capture some more details of the above, which I think might be helpful for others. Thanks, Joe * Longer explanation about SO_BINDTODEVICE (only relevant if you have mulitple NICs): It turns out that reuseport groups in the kernel are bounded by a few attributes, port being one of them but also ifindex. Since multiple NICs can have queue_mapping == 0, reusport groups need to be constructed in userland with care if there are multiple NICs. This is required because each epoll context can only do epoll busy poll on a single NAPI ID. So, even if multiple NICs have queue_mapping == 0, the queues will have different NAPI IDs and incoming connections must be distributed to threads uniquely based on NAPI ID. I am doing this by creating listen sockets for each NIC, one NIC at a time. When the listen socket is created, SO_BINDTODEVICE is used on the socket before calling listen. In the kernel, this results in all listen sockets with the same ifindex to form a reuseport group. So, if I have 2 NICs and 1 listen port (say port 80), this results in 2 reuseport groups -- one for nic0 port 80 and one for nic1 port 80, because of SO_BINDTODEVICE. The reuseport cbpf filter is inserted for each reuseport group, and then the skb->queue_mapping based listen socket selection will work as expected distributing NAPI IDs to app threads without breaking epoll busy poll. Without the above, you can run into an issue where two connections with the same queue_mapping (but from different NICs) can land in the same epoll context, which breaks busy poll. Another potential solution to avoid the above might be use an eBPF program and to build a hash that maps NAPI IDs to thread IDs and write a more complicated eBPF program to distribute connections that way. This seemed cool, but involved a lot more work so I went with the SO_BINDTODEVICE + SIOCGIFNAME_BY_NAPI_ID method instead which was pretty simple (C code wise) and easy to implement.