Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

Jens Axboe <axboe@xxxxxxxxx> · Fri, 31 May 2019 10:54:27 -0600

On 5/31/19 10:02 AM, Roman Penyaev wrote:
> On 2019-05-31 16:48, Jens Axboe wrote:
>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>> Hi all,
>>>
>>> This is v3 which introduces pollable epoll from userspace.
>>>
>>> v3:
>>>    - Measurements made, represented below.
>>>
>>>    - Fix alignment for epoll_uitem structure on all 64-bit archs except
>>>      x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
>>>      is added. (Linus)
>>>
>>>    - Check pollflags explicitly on 0 inside work callback, and do
>>> nothing
>>>      if 0.
>>>
>>> v2:
>>>    - No reallocations, the max number of items (thus size of the user
>>> ring)
>>>      is specified by the caller.
>>>
>>>    - Interface is simplified: -ENOSPC is returned on attempt to add a
>>> new
>>>      epoll item if number is reached the max, nothing more.
>>>
>>>    - Alloced pages are accounted using user->locked_vm and limited to
>>>      RLIMIT_MEMLOCK value.
>>>
>>>    - EPOLLONESHOT is handled.
>>>
>>> This series introduces pollable epoll from userspace, i.e. user
>>> creates
>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>> header
>>> and ring pointers and then consumes ready events from a ring, avoiding
>>> epoll_wait() call.  When ring is empty, user has to call epoll_wait()
>>> in order to wait for new events.  epoll_wait() returns -ESTALE if user
>>> ring has events in the ring (kind of indication, that user has to
>>> consume
>>> events from the user ring first, I could not invent anything better
>>> than
>>> returning -ESTALE).
>>>
>>> For user header and user ring allocation I used vmalloc_user().  I
>>> found
>>> that it is much easy to reuse remap_vmalloc_range_partial() instead of
>>> dealing with page cache (like aio.c does).  What is also nice is that
>>> virtual address is properly aligned on SHMLBA, thus there should not
>>> be
>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>
>> Why aren't we just adding support to io_uring for this instead? Then we
>> don't need yet another entirely new ring, that's is just a little
>> different from what we have.
>>
>> I haven't looked into the details of your implementation, just curious
>> if there's anything that makes using io_uring a non-starter for this
>> purpose?
> 
> Afaict the main difference is that you do not need to recharge an fd
> (submit new poll request in terms of io_uring): once fd has been added
> to
> epoll with epoll_ctl() - we get events.  When you have thousands of fds
> -
> that should matter.
> 
> Also interesting question is how difficult to modify existing event
> loops
> in event libraries in order to support recharging (EPOLLONESHOT in terms
> of epoll).
> 
> Maybe Azat who maintains libevent can shed light on this (currently I
> see
> that libevent does not support "EPOLLONESHOT" logic).

In terms of existing io_uring poll support, which is what I'm guessing
you're referring to, it is indeed just one-shot. But there's no reason
why we can't have it persist until explicitly canceled with POLL_REMOVE.

-- 
Jens Axboe