On 5/31/19 10:02 AM, Roman Penyaev wrote: > On 2019-05-31 16:48, Jens Axboe wrote: >> On 5/16/19 2:57 AM, Roman Penyaev wrote: >>> Hi all, >>> >>> This is v3 which introduces pollable epoll from userspace. >>> >>> v3: >>> - Measurements made, represented below. >>> >>> - Fix alignment for epoll_uitem structure on all 64-bit archs except >>> x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON >>> is added. (Linus) >>> >>> - Check pollflags explicitly on 0 inside work callback, and do >>> nothing >>> if 0. >>> >>> v2: >>> - No reallocations, the max number of items (thus size of the user >>> ring) >>> is specified by the caller. >>> >>> - Interface is simplified: -ENOSPC is returned on attempt to add a >>> new >>> epoll item if number is reached the max, nothing more. >>> >>> - Alloced pages are accounted using user->locked_vm and limited to >>> RLIMIT_MEMLOCK value. >>> >>> - EPOLLONESHOT is handled. >>> >>> This series introduces pollable epoll from userspace, i.e. user >>> creates >>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets >>> header >>> and ring pointers and then consumes ready events from a ring, avoiding >>> epoll_wait() call. When ring is empty, user has to call epoll_wait() >>> in order to wait for new events. epoll_wait() returns -ESTALE if user >>> ring has events in the ring (kind of indication, that user has to >>> consume >>> events from the user ring first, I could not invent anything better >>> than >>> returning -ESTALE). >>> >>> For user header and user ring allocation I used vmalloc_user(). I >>> found >>> that it is much easy to reuse remap_vmalloc_range_partial() instead of >>> dealing with page cache (like aio.c does). What is also nice is that >>> virtual address is properly aligned on SHMLBA, thus there should not >>> be >>> any d-cache aliasing problems on archs with vivt or vipt caches. >> >> Why aren't we just adding support to io_uring for this instead? Then we >> don't need yet another entirely new ring, that's is just a little >> different from what we have. >> >> I haven't looked into the details of your implementation, just curious >> if there's anything that makes using io_uring a non-starter for this >> purpose? > > Afaict the main difference is that you do not need to recharge an fd > (submit new poll request in terms of io_uring): once fd has been added > to > epoll with epoll_ctl() - we get events. When you have thousands of fds > - > that should matter. > > Also interesting question is how difficult to modify existing event > loops > in event libraries in order to support recharging (EPOLLONESHOT in terms > of epoll). > > Maybe Azat who maintains libevent can shed light on this (currently I > see > that libevent does not support "EPOLLONESHOT" logic). In terms of existing io_uring poll support, which is what I'm guessing you're referring to, it is indeed just one-shot. But there's no reason why we can't have it persist until explicitly canceled with POLL_REMOVE. -- Jens Axboe