[PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

Fam Zheng <famz@xxxxxxxxxx> · Fri, 13 Feb 2015 17:03:56 +0800

Hi all,

This is the updated series for the new epoll system calls, with the cover
letter rewritten which includes some more explanation. Comments are very
welcome!

Original Motivation
===================

QEMU, and probably many more select/poll based applications, will consider
epoll as an alternative, when its event loop needs to handle a big number of
fds. However, there are currently two concerns with epoll which prevents the
switching from ppoll to epoll. 

The major one is the timeout precision.

For example in QEMU, the main loop takes care of calling callbacks at a
specific timeout - the QEMU timer API. The timeout value in ppoll depends on
the next firing timer. epoll_pwait's millisecond timeout is so coarse that
rounding up the timeout will hurt performance badly.

The minor one is the number of system call to update fd set.

While epoll can handle a large number of fds quickly, it still requires one
epoll_ctl per fd update, compared to the one-shot call to select/poll with an
fd array. This may as well make epoll inferior to ppoll in the cases where a
small, but frequently changing set of fds are polled by the event loop.

This series introduces two new epoll APIs to address them respectively. The
idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
suggested clockid as a parameter in epoll_pwait1.

Discussion
==========

[Note: This is the question part regarding the interface contract of
epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch yet,
please skip this part and probably start with the man page style documentation.
You can resume to this section later.]

[Thanks to Omar Sandoval <osandov@xxxxxxxxxxx>, who pointed out this in
reviewing v1]

We try to report status for each command in epoll_ctl_batch, by writting to
user provided command array (pointed to cmds). The tricky thing in the
implementation is that, copying the results back to userspace comes last, after
the commands are executed. At this point, if the copy_to_user fails, the
effects are done and no return - or if we add some mechanism to revert it, the
code will be too complicated and slow.

In above corner case, the return value of epoll_ctl_batch is smaller than
ncmds, which assures our caller that last N commands failed, where N = ncmds -
ret.  But they'll also find that cmd.result is not changed to error code.

I suppose this is not a big deal, because 1) it should happen very rarely. 2)
user does know the actual number of commands that succeed.

So, do we leave it as is? Or is there any way to improve?

One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
before we execute the commands. If it succeeds, it's even less likely the last
copy_to_user could fail, so that we can even probably assert it won't. The
testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
performance impact, especially when @ncmds is big.

Links
=====

[1]: http://lists.openwall.net/linux-kernel/2015/01/08/542

Changelog
=========

Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
----------------------------------------------------

  - Rename epoll_ctl_cmd.error_hint to "result". [Michael]

  - Add background introduction in cover letter. [Michael]

  - Expand the last struct of epoll_pwait1, add clockid and timespec.

  - Update man page in cover letter accordingly:

    * "error_hint" -> "result".
    * The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.

  Please review!

Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
-----------------------------------------------------

  - As discussed in previous thread [1], split the call to epoll_ctl_batch and
    epoll_pwait. [Michael]

  - Fix memory leaks. [Omar]

  - Add a short comment about the ignored copy_to_user failure. [Omar]

  - Cover letter rewritten.

Documentation of the new system calls
=====================================

[I tried to write in the familiar man page style, but I am not proficient on
this. Thanks for Michael's help and suggestions in helping me improve it
through previous discussions.]

1) epoll_ctl_batch
------------------

NAME
       epoll_ctl_batch - modify an epoll descriptor in batch

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_ctl_batch(int epfd, int flags,
                           int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

       The system call performs a batch of epoll_ctl operations. It allows
       efficient update of events on this epoll descriptor.

       Flags is reserved and must be 0.

       Each operation is specified as an element in the cmds array, defined as:

           struct epoll_ctl_cmd {

                  /* Reserved flags for future extension, must be 0. */
                  int flags;

                  /* The same as epoll_ctl() op parameter. */
                  int op;

                  /* The same as epoll_ctl() fd parameter. */
                  int fd;

                  /* The same as the "events" field in struct epoll_event. */
                  uint32_t events;

                  /* The same as the "data" field in struct epoll_event. */
                  uint64_t data;

                  /* Output field, will be set to the return code after this
                   * command is executed by kernel */
                  int result;
           };

       Commands are executed in their order in cmds, and if one of them failed,
       the rest after it are not tried.

       Not that this call isn't atomic in terms of updating the epoll
       descriptor, which means a second epoll_ctl or epoll_ctl_batch call
       during the first epoll_ctl_batch may make the operation sequence
       interleaved. However, each single epoll_ctl_cmd operation has the same
       semantics as a epoll_ctl call.

RETURN VALUE

       If one or more of the parameters are incorrect, -1 is returned and errno
       is set appropriately. Otherwise, the number of succeeded commands is
       returned.

       Each 'result' field may be set to the error code or 0, depending on the
       result of the command. If the kernel fails to set the field after the
       commands are completed or failed, it may also be unchanged, even though
       the effects of successful commands are done. In this case, it's still
       ensured that 1) the i-th (i = ret) command is the failed command; 2) all
       the preceeding commands are successfully executed; 3) all the subsequent
       ones are not executed.

ERRORS

       Errors for the overall system call (in errno) are:

       EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
              cmds is NULL.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       EFAULT The memory area pointed to by cmds is not accessible with write
              permissions.

       Errors for each command (in result field) are:

       EBADF  epfd or fd is not a valid file descriptor.

       EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
              already registered with this epoll instance.

       EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd,
              or the requested operation op is not supported by this interface.

       ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
              with this epoll instance.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
              encountered while trying to register (EPOLL_CTL_ADD) a new file
              descriptor on an epoll instance.  See epoll(7) for further
              details.

       EPERM  The target file fd does not support epoll.

CONFORMING TO

       epoll_ctl_batch() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

2) epoll_pwait1
---------------

NAME
       epoll_pwait1 - wait for an I/O event on an epoll file descriptor

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_pwait1(int epfd, int flags,
                        struct epoll_event *events,
                        int maxevents,
                        struct epoll_wait_params *params);

DESCRIPTION

       The epoll_pwait1 system call differs from epoll_pwait only in
       parameter list. The epfd, events and maxevents parameters are the same
       as in epoll_wait and epoll_pwait. The flags and params are new.

       The flags is reserved and must be zero.

       The params is a pointer to a structure parameters for the epoll_pwait1,
       defined as:

           struct epoll_wait_params {
               int clockid;
               struct timespec timeout;
               sigset_t *sigmask;
               size_t sigsetsize;
           };

       The field clockid must be either CLOCK_REALTIME or CLOCK_MONOTONIC, to
       choose the clock type to use for timeout. Note that CLOCK_MONOTONIC is
       the implicit and only possible clock type for epoll_pwait.

       The timeout field specifies the minimum time that epoll_wait() will
       block.  (The effective length will be rounded up to the clock
       granularity, and kernel scheduling delays mean that the blocking
       interval may overrun by a small amount.) Specifying a nagative time
       lenght (for example, timeout.tv_sec = 0 and timeout.tv_nsec = -1, or the
       other way round) causes epoll_pwait1() to block indefinitely, while
       specifying a timeout equal to zero (both fields in timeout are zero)
       causes epoll_wait() to return immediately, even if no events are
       available.

       The sigmask and sigsetsize has the same semantics as epoll_pwait. The
       sigmask field may be specified as NULL, in which case epoll_pwait1()
       will behave like epoll_wait.

   User visibility of sigsetsize

       In epoll_pwait and other syscalls, sigsetsize is not visible to
       application developer as glibc has a wrapper around epoll_pwait system
       call. Now that we pack several parameters in epoll_wait_params, so in
       order to hide sigsetsize from application code, we can still wrap it,
       either by expanding parameters and build the structure in the wrapper
       function, or only ask application to provide the first half:

           struct epoll_wait_params_user {
               int clockid;
               struct timespec timeout;
               sigset_t *sigmask;
           };

      Then in the wrapper function copy to a full structure and fill in
      sigsetsize.

RETURN VALUE

       The same as said in epoll_pwait(2).

ERRORS

       The same as said in man epoll_pwait(2), plus:

       EINVAL flags is not zero, or clockid is neither CLOCK_REALTIME nor
              CLOCK_MONOTONIC.

       EFAULT The memory area pointed to by params is not accessible.

CONFORMING TO

       epoll_pwait1() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Fam Zheng (7):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Specify clockid explicitly
  epoll: Extract ep_ctl_do
  epoll: Add implementation for epoll_ctl_batch
  x86: Hook up epoll_ctl_batch syscall
  epoll: Add implementation for epoll_pwait1
  x86: Hook up epoll_pwait1 syscall

 arch/x86/syscalls/syscall_32.tbl |   2 +
 arch/x86/syscalls/syscall_64.tbl |   2 +
 fs/eventpoll.c                   | 258 +++++++++++++++++++++++++--------------
 include/linux/syscalls.h         |   9 ++
 include/uapi/linux/eventpoll.h   |  18 +++
 5 files changed, 199 insertions(+), 90 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html