On Wed, Jan 24, 2024 at 03:38:19PM +0100, Eric Dumazet wrote: > On Wed, Jan 24, 2024 at 3:20 PM Joe Damato <jdamato@xxxxxxxxxx> wrote: > > > > On Wed, Jan 24, 2024 at 09:20:09AM +0100, Eric Dumazet wrote: > > > On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@xxxxxxxxxx> wrote: > > > > > > > > Greetings: > > > > > > > > TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to > > > > epoll with socket fds.") by allowing user applications to enable > > > > epoll-based busy polling and set a busy poll packet budget on a per epoll > > > > context basis. > > > > > > > > To allow for this, two ioctls have been added for epoll contexts for > > > > getting and setting a new struct, struct epoll_params. > > > > > > > > This makes epoll-based busy polling much more usable for user > > > > applications than the current system-wide sysctl and hardcoded budget. > > > > > > > > Longer explanation: > > > > > > > > Presently epoll has support for a very useful form of busy poll based on > > > > the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]). > > > > > > > > This form of busy poll allows epoll_wait to drive NAPI packet processing > > > > which allows for a few interesting user application designs which can > > > > reduce latency and also potentially improve L2/L3 cache hit rates by > > > > deferring NAPI until userland has finished its work. > > > > > > > > The documentation available on this is, IMHO, a bit confusing so please > > > > allow me to explain how one might use this: > > > > > > > > 1. Ensure each application thread has its own epoll instance mapping > > > > 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to > > > > direct connections with specific dest ports to these queues. > > > > > > > > 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy > > > > polling will occur. This can help avoid the userland app from being > > > > pre-empted by a hard IRQ while userland is running. Note this means that > > > > userland must take care to call epoll_wait and not take too long in > > > > userland since it now drives NAPI via epoll_wait. > > > > > > > > 3. Ensure that all incoming connections added to an epoll instance > > > > have the same NAPI ID. This can be done with a BPF filter when > > > > SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single > > > > accept thread is used which dispatches incoming connections to threads. > > > > > > > > 4. Lastly, busy poll must be enabled via a sysctl > > > > (/proc/sys/net/core/busy_poll). > > > > > > > > The unfortunate part about step 4 above is that this enables busy poll > > > > system-wide which affects all user applications on the system, > > > > including epoll-based network applications which were not intended to > > > > be used this way or applications where increased CPU usage for lower > > > > latency network processing is unnecessary or not desirable. > > > > > > > > If the user wants to run one low latency epoll-based server application > > > > with epoll-based busy poll, but would like to run the rest of the > > > > applications on the system (which may also use epoll) without busy poll, > > > > this system-wide sysctl presents a significant problem. > > > > > > > > This change preserves the system-wide sysctl, but adds a mechanism (via > > > > ioctl) to enable or disable busy poll for epoll contexts as needed by > > > > individual applications, making epoll-based busy poll more usable. > > > > > > > > > > I think this description missed the napi_defer_hard_irqs and > > > gro_flush_timeout settings ? > > > > I'm not sure if those settings are strictly related to the change I am > > proposing which makes epoll-based busy poll something that can be > > enabled/disabled on a per-epoll context basis and allows the budget to be > > set as well, but maybe I am missing something? Sorry for my > > misunderstanding if so. > > > > IMHO: a single system-wide busy poll setting is difficult to use > > properly and it is unforunate that the packet budget is hardcoded. It would > > be extremely useful to be able to set both of these on a per-epoll basis > > and I think my suggested change helps to solve this. > > > > Please let me know. > > > > Re the two settings you noted: > > > > I didn't mention those in the interest of brevity, but yes they can be used > > instead of or in addition to what I've described above. > > > > While those settings are very useful, IMHO, they have their own issues > > because they are system-wide as well. If they were settable per-NAPI, that > > would make it much easier to use them because they could be enabled for the > > NAPIs which are being busy-polled by applications that support busy-poll. > > > > Imagine you have 3 types of apps running side-by-side: > > - A low latency epoll-based busy poll app, > > - An app where latency doesn't matter as much, and > > - A latency sensitive legacy app which does not yet support epoll-based > > busy poll. > > > > In the first two cases, the settings you mention would be helpful or not > > make any difference, but in the third case the system-wide impact might be > > undesirable because having IRQs fire might be important to keep latency > > down. > > > > If your comment was more that my cover letter should have mentioned these, > > I can include that in a future cover letter or suggest some kernel > > documentation which will discuss all of these features and how they relate > > to each other. > > > > > > > > I would think that if an application really wants to make sure its > > > thread is the only one > > > eventually calling napi->poll(), we must make sure NIC interrupts stay masked. > > > > > > Current implementations of busy poll always release NAPI_STATE_SCHED bit when > > > returning to user space. > > > > > > It seems you want to make sure the application and only the > > > application calls the napi->poll() > > > at chosen times. > > > > > > Some kind of contract is needed, and the presence of the hrtimer > > > (currently only driven from dev->@gro_flush_timeout) > > > would allow to do that correctly. > > > > > > Whenever we 'trust' user space to perform the napi->poll shortly, we > > > also want to arm the hrtimer to eventually detect > > > the application took too long, to restart the other mechanisms (NIC irq based) > > > > There is another change [1] I've been looking at from a research paper [2] > > which does something similar to what you've described above -- it keeps > > IRQs suppressed during busy polling. The paper suggests a performance > > improvement is measured when using a mechanism like this to keep IRQs off. > > Please see the paper for more details. > > > > I haven't had a chance to reach out to the authors or to tweak this patch > > to attempt an RFC / submission for it, but it seems fairly promising in my > > initial synthetic tests. > > > > When I tested their patch, as you might expect, no IRQs were generated at > > all for the NAPIs that were being busy polled, but the rest of the > > NAPIs and queues were generating IRQs as expected. > > > > Regardless of the above patch: I think my proposed change is helpful and > > the IRQ suppression bit can be handled in a separate change in the future. > > What do you think? > > > > > Note that we added the kthread based napi polling, and we are working > > > to add a busy polling feature to these kthreads. > > > allowing to completely mask NIC interrupts and further reduce latencies. > > > > I am aware of kthread based NAPI polling, yes, but I was not aware that > > busy polling was being considered as a feature for them, thanks for the > > head's up. > > > > > Thank you > > > > Thanks for your comments - I appreciate your time and attention. > > > > Could you let me know if your comments are meant as a n-ack or similar? > > Patch #2 needs the 'why' part, and why would we allow user space to > ask to poll up to 65535 packets... > There is a reason we have a warning in place when a driver attempts to > set a budget bigger than 64. Sure, thanks for pointing this out. I am happy to cap the budget to 64 if a user app requests a larger amount and I can add a netdev_warn when this happens, if you'd like. The 'why' has two reasons: - Increasing the budget for fast NICs can help improve throughput under load (i.e. the hardcoded amount might be too low for some users) - other poll methods have user-configurable budget amounts (SO_BUSY_POLL_BUDGET), so epoll stands out as an edge case where the budget is hardcoded. I hope that reasoning is sufficient and I can include that more explicitly in the commit message. FWIW: My reading of SO_BUSY_POLL_BUDGET suggests that any budget amount up to U16_MAX will be accepted. I probably missed it somewhere, but I didn't see a warning in this case. I think in a v2 SO_BUSY_POLL_BUDGET and the epoll ioctl budget should be capped at the same amount for consistency and I am happy to agree to 64 or 128 or similar as a cap. Let me know what you think and thanks again for your thoughts and detailed response. > You cited recent papers, I wrote this one specific to linux busy > polling ( https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf ) Thanks for sending this link, I'll need to take a closer look, but a quick read surfaced two things: Their use is very limited, since they enforce busy polling for all sockets, which is not desirable We agree on the limited usefulness of system-wide settings ;) Another big problem is that Busy Polling was not really deployed in production, because it works well when having no more than one thread per NIC RX queue. I've been testing epoll-based busy poll in production with a few different NICs and application setups and it has been pretty helpful, but I agree that this is application architecture specific as you allude to in your next paragraph about the scheduler. Thanks for linking to the paper. It would be great if all of this context and information could be put in one place in the kernel docs. If I have time in the future, I'll propose a doc change to try to outline all of this. > Busy polling had been in the pipe, when Wei sent her patches and follow ups. > > cb038357937ee4f589aab2469ec3896dce90f317 net: fix race between napi > kthread mode and busy poll > 5fdd2f0e5c64846bf3066689b73fc3b8dddd1c74 net: add sysfs attribute to > control napi threaded mode > 29863d41bb6e1d969c62fdb15b0961806942960e net: implement threaded-able > napi poll loop support Thanks for letting me know. I think I'd seen these in passing, but hadn't remembered until you mentioned it now. > I am saying that I am currently working to implement the kthread busy > polling implementation, > after fixing two bugs in SCTP and UDP (making me wondering if busy > polling is really used these days) Ah, I see. FWIW, I have so far only been trying to use it for TCP and so far I haven't hit any bugs. I was planning to use it with UDP in the future, though, once the TCP epoll-based busy polling stuff I am working on is done... so thanks in advance for the bug fixes in UDP. > I am also considering unifying napi_threaded_poll() and the > napi_busy_loop(), and seeing your patches > coming make this work more challenging. Sorry about that. I am happy to make modifications to my patches if there's anything I could do which would make your work easier in the future. Thanks, Joe