Re: [net-next 0/3] Per epoll context busy poll support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2/1/2024 9:28 PM, Joe Damato wrote:
On Tue, Jan 30, 2024 at 12:54:50PM -0600, Samudrala, Sridhar wrote:


On 1/24/2024 2:20 AM, Eric Dumazet wrote:
On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@xxxxxxxxxx> wrote:

Greetings:

TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
epoll with socket fds.") by allowing user applications to enable
epoll-based busy polling and set a busy poll packet budget on a per epoll
context basis.

To allow for this, two ioctls have been added for epoll contexts for
getting and setting a new struct, struct epoll_params.

This makes epoll-based busy polling much more usable for user
applications than the current system-wide sysctl and hardcoded budget.

Agree. looking forward to see this patch series accepted soon.


Longer explanation:

Presently epoll has support for a very useful form of busy poll based on
the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).

This form of busy poll allows epoll_wait to drive NAPI packet processing
which allows for a few interesting user application designs which can
reduce latency and also potentially improve L2/L3 cache hit rates by
deferring NAPI until userland has finished its work.

The documentation available on this is, IMHO, a bit confusing so please
allow me to explain how one might use this:

1. Ensure each application thread has its own epoll instance mapping
1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
direct connections with specific dest ports to these queues.

2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
polling will occur. This can help avoid the userland app from being
pre-empted by a hard IRQ while userland is running. Note this means that
userland must take care to call epoll_wait and not take too long in
userland since it now drives NAPI via epoll_wait.

3. Ensure that all incoming connections added to an epoll instance
have the same NAPI ID. This can be done with a BPF filter when
SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
accept thread is used which dispatches incoming connections to threads.

4. Lastly, busy poll must be enabled via a sysctl
(/proc/sys/net/core/busy_poll).

The unfortunate part about step 4 above is that this enables busy poll
system-wide which affects all user applications on the system,
including epoll-based network applications which were not intended to
be used this way or applications where increased CPU usage for lower
latency network processing is unnecessary or not desirable.

If the user wants to run one low latency epoll-based server application
with epoll-based busy poll, but would like to run the rest of the
applications on the system (which may also use epoll) without busy poll,
this system-wide sysctl presents a significant problem.

This change preserves the system-wide sysctl, but adds a mechanism (via
ioctl) to enable or disable busy poll for epoll contexts as needed by
individual applications, making epoll-based busy poll more usable.


I think this description missed the napi_defer_hard_irqs and
gro_flush_timeout settings ?

I would think that if an application really wants to make sure its
thread is the only one
eventually calling napi->poll(), we must make sure NIC interrupts stay masked.

Current implementations of busy poll always release NAPI_STATE_SCHED bit when
returning to user space.

It seems you want to make sure the application and only the
application calls the napi->poll()
at chosen times.

Some kind of contract is needed, and the presence of the hrtimer
(currently only driven from dev->@gro_flush_timeout)
would allow to do that correctly.

Whenever we 'trust' user space to perform the napi->poll shortly, we
also want to arm the hrtimer to eventually detect
the application took too long, to restart the other mechanisms (NIC irq based)

Note that we added the kthread based napi polling, and we are working
to add a busy polling feature to these kthreads.
allowing to completely mask NIC interrupts and further reduce latencies.


Good to know that you are looking into enabling busy polling for napi
kthreads.
We have something similar in our ice OOT driver that is implemented and we
call it 'independent pollers' as in this mode, busy polling will not be app
dependent or triggered by an application.
Here is a link to the slides we presented at netdev 0x16 driver workshop.
https://netdevconf.info/0x16/slides/48/netdev0x16_driver_workshop_ADQ.pdf

We haven't yet submitted the patches upstream as there is no kernel
interface to configure napi specific timeouts.
With the recent per-queue and per-napi netlink APIs that are accepted
upstream
https://lore.kernel.org/netdev/170147307026.5260.9300080745237900261.stgit@xxxxxxxxxxxxxxxxxxxxxxxxx/

we are thinking of making timeout as a per-napi parameter and can be used as
an interface to enable napi kthread based busy polling.

I know I am replying to a stale thread on the patches I've submit (there is
a v5 now [1]), but I just looked at your message - sorry I didn't reply
sooner.

The per-queue and per-napi netlink APIs look extremely useful, thanks for
pointing this out.

In my development tree, I had added SIOCGIFNAME_BY_NAPI_ID which works
similar to SIOCGIFNAME: it takes a NAPI ID and returns the IF name. This is
useful on machines with multiple NICs where each NIC could be located in
one of many different NUMA zones.

The idea was that apps would use SO_INCOMING_NAPI_ID, distribute the NAPI
ID to a worker thread which could then use SIOCGIFNAME_BY_NAPI_ID to
compute which NIC the connection came in on. The app would then (via
configuration) know where to pin that worker thread; ideally somewhere NUMA
local to the NIC.

I had assumed that such a change would be rejected, but I figured I'd send
an RFC for it after the per epoll context stuff was done and see if anyone
thought SIOCGIFNAME_BY_NAPI_ID would be useful for them, as well.

I think you should be able to get this functionality via the netdev-genl API to get napi parameters. It returns ifindex as one of the parameters and you should able to get the name from ifindex.

$ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
{'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}


I think even the per-device napi_defer_hard_irqs and gro_flush_timeout
should become per-napi parameters.

I agree.

I had been contemplating implementing this until I tried a different method
similar to an academic paper I was reading [2][3]. I think per-device
defer_hard_irqs and gro_flush_timeout would be extremely useful and a
better approach than the one I'm currently using.

Is this something you are currently working? I may try implementing this,
but didn't want to duplicate effort if you are already working on this.

Not yet. Please go ahead and work on it it if you have time.
napi-get and napi-set can be extended to show and set these parameters.


Thanks,
Joe

[1]: https://lore.kernel.org/all/20240131180811.23566-1-jdamato@xxxxxxxxxx/
[2]: https://dl.acm.org/doi/pdf/10.1145/3626780
[3]: https://gitlab.uwaterloo.ca/p5cai/netstack-exp/-/blob/master/kernel-polling-5.15.79-base.patch?ref_type=heads




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux