On Wed, Jan 24, 2024 at 3:20 PM Joe Damato <jdamato@xxxxxxxxxx> wrote: > > On Wed, Jan 24, 2024 at 09:20:09AM +0100, Eric Dumazet wrote: > > On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@xxxxxxxxxx> wrote: > > > > > > Greetings: > > > > > > TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to > > > epoll with socket fds.") by allowing user applications to enable > > > epoll-based busy polling and set a busy poll packet budget on a per epoll > > > context basis. > > > > > > To allow for this, two ioctls have been added for epoll contexts for > > > getting and setting a new struct, struct epoll_params. > > > > > > This makes epoll-based busy polling much more usable for user > > > applications than the current system-wide sysctl and hardcoded budget. > > > > > > Longer explanation: > > > > > > Presently epoll has support for a very useful form of busy poll based on > > > the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]). > > > > > > This form of busy poll allows epoll_wait to drive NAPI packet processing > > > which allows for a few interesting user application designs which can > > > reduce latency and also potentially improve L2/L3 cache hit rates by > > > deferring NAPI until userland has finished its work. > > > > > > The documentation available on this is, IMHO, a bit confusing so please > > > allow me to explain how one might use this: > > > > > > 1. Ensure each application thread has its own epoll instance mapping > > > 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to > > > direct connections with specific dest ports to these queues. > > > > > > 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy > > > polling will occur. This can help avoid the userland app from being > > > pre-empted by a hard IRQ while userland is running. Note this means that > > > userland must take care to call epoll_wait and not take too long in > > > userland since it now drives NAPI via epoll_wait. > > > > > > 3. Ensure that all incoming connections added to an epoll instance > > > have the same NAPI ID. This can be done with a BPF filter when > > > SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single > > > accept thread is used which dispatches incoming connections to threads. > > > > > > 4. Lastly, busy poll must be enabled via a sysctl > > > (/proc/sys/net/core/busy_poll). > > > > > > The unfortunate part about step 4 above is that this enables busy poll > > > system-wide which affects all user applications on the system, > > > including epoll-based network applications which were not intended to > > > be used this way or applications where increased CPU usage for lower > > > latency network processing is unnecessary or not desirable. > > > > > > If the user wants to run one low latency epoll-based server application > > > with epoll-based busy poll, but would like to run the rest of the > > > applications on the system (which may also use epoll) without busy poll, > > > this system-wide sysctl presents a significant problem. > > > > > > This change preserves the system-wide sysctl, but adds a mechanism (via > > > ioctl) to enable or disable busy poll for epoll contexts as needed by > > > individual applications, making epoll-based busy poll more usable. > > > > > > > I think this description missed the napi_defer_hard_irqs and > > gro_flush_timeout settings ? > > I'm not sure if those settings are strictly related to the change I am > proposing which makes epoll-based busy poll something that can be > enabled/disabled on a per-epoll context basis and allows the budget to be > set as well, but maybe I am missing something? Sorry for my > misunderstanding if so. > > IMHO: a single system-wide busy poll setting is difficult to use > properly and it is unforunate that the packet budget is hardcoded. It would > be extremely useful to be able to set both of these on a per-epoll basis > and I think my suggested change helps to solve this. > > Please let me know. > > Re the two settings you noted: > > I didn't mention those in the interest of brevity, but yes they can be used > instead of or in addition to what I've described above. > > While those settings are very useful, IMHO, they have their own issues > because they are system-wide as well. If they were settable per-NAPI, that > would make it much easier to use them because they could be enabled for the > NAPIs which are being busy-polled by applications that support busy-poll. > > Imagine you have 3 types of apps running side-by-side: > - A low latency epoll-based busy poll app, > - An app where latency doesn't matter as much, and > - A latency sensitive legacy app which does not yet support epoll-based > busy poll. > > In the first two cases, the settings you mention would be helpful or not > make any difference, but in the third case the system-wide impact might be > undesirable because having IRQs fire might be important to keep latency > down. > > If your comment was more that my cover letter should have mentioned these, > I can include that in a future cover letter or suggest some kernel > documentation which will discuss all of these features and how they relate > to each other. > > > > > I would think that if an application really wants to make sure its > > thread is the only one > > eventually calling napi->poll(), we must make sure NIC interrupts stay masked. > > > > Current implementations of busy poll always release NAPI_STATE_SCHED bit when > > returning to user space. > > > > It seems you want to make sure the application and only the > > application calls the napi->poll() > > at chosen times. > > > > Some kind of contract is needed, and the presence of the hrtimer > > (currently only driven from dev->@gro_flush_timeout) > > would allow to do that correctly. > > > > Whenever we 'trust' user space to perform the napi->poll shortly, we > > also want to arm the hrtimer to eventually detect > > the application took too long, to restart the other mechanisms (NIC irq based) > > There is another change [1] I've been looking at from a research paper [2] > which does something similar to what you've described above -- it keeps > IRQs suppressed during busy polling. The paper suggests a performance > improvement is measured when using a mechanism like this to keep IRQs off. > Please see the paper for more details. > > I haven't had a chance to reach out to the authors or to tweak this patch > to attempt an RFC / submission for it, but it seems fairly promising in my > initial synthetic tests. > > When I tested their patch, as you might expect, no IRQs were generated at > all for the NAPIs that were being busy polled, but the rest of the > NAPIs and queues were generating IRQs as expected. > > Regardless of the above patch: I think my proposed change is helpful and > the IRQ suppression bit can be handled in a separate change in the future. > What do you think? > > > Note that we added the kthread based napi polling, and we are working > > to add a busy polling feature to these kthreads. > > allowing to completely mask NIC interrupts and further reduce latencies. > > I am aware of kthread based NAPI polling, yes, but I was not aware that > busy polling was being considered as a feature for them, thanks for the > head's up. > > > Thank you > > Thanks for your comments - I appreciate your time and attention. > > Could you let me know if your comments are meant as a n-ack or similar? Patch #2 needs the 'why' part, and why would we allow user space to ask to poll up to 65535 packets... There is a reason we have a warning in place when a driver attempts to set a budget bigger than 64. You cited recent papers, I wrote this one specific to linux busy polling ( https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf ) Busy polling had been in the pipe, when Wei sent her patches and follow ups. cb038357937ee4f589aab2469ec3896dce90f317 net: fix race between napi kthread mode and busy poll 5fdd2f0e5c64846bf3066689b73fc3b8dddd1c74 net: add sysfs attribute to control napi threaded mode 29863d41bb6e1d969c62fdb15b0961806942960e net: implement threaded-able napi poll loop support I am saying that I am currently working to implement the kthread busy polling implementation, after fixing two bugs in SCTP and UDP (making me wondering if busy polling is really used these days) I am also considering unifying napi_threaded_poll() and the napi_busy_loop(), and seeing your patches coming make this work more challenging.