On 2019-06-25 02:24, Eric Wong wrote:
Roman Penyaev <rpenyaev@xxxxxxx> wrote:
Hi all,
+cc Jason Baron
** Limitations
<snip>
4. No support for EPOLLEXCLUSIVE
If device does not pass pollflags to wake_up() there is no way to
call poll() from the context under spinlock, thus special work is
scheduled to offload polling. In this specific case we can't
support exclusive wakeups, because we do not know actual result
of scheduled work and have to wake up every waiter.
Lacking EPOLLEXCLUSIVE support is probably a showstopper for
common applications using per-task epoll combined with
non-blocking accept4() (e.g. nginx).
For the 'accept' case it seems SO_REUSEPORT can be used:
https://lwn.net/Articles/542629/
Although I've never tried it in O_NONBLOCK + epoll scenario.
But I've just again dived into this add-wait-exclusive logic and it
seems possible to support EPOLLEXCLUSIVE by iterating over all "epis"
for a particular fd, which has been woken up.
For now I want to leave it as is just not to overcomplicate the code.
Fwiw, I'm still a weirdo who prefers a dedicated thread doing
blocking accept4 for distribution between tasks (so epoll never
sees a listen socket). But, depending on what runtime/language
I'm using, I can't always dedicate a blocking thread, so I
recently started using EPOLLEXCLUSIVE from Perl5 where I
couldn't rely on threads being available.
If I could dedicate time to improving epoll; I'd probably
add writev() support for batching epoll_ctl modifications
to reduce syscall traffic, or pick-up the kevent()-like interface
started long ago:
https://lore.kernel.org/lkml/1393206162-18151-1-git-send-email-n1ght.4nd.d4y@xxxxxxxxx/
(but I'm not sure I want to increase the size of the syscall table).
There is also fresh fs/io_uring.c thingy, which supports polling and
batching (among other IO things). But polling there acts only as a
single-shot, so it might make sense to support there event subscription
instead of resurrecting kevent and co.
--
Roman