Hello Fam Zheng, On 01/20/2015 10:57 AM, Fam Zheng wrote: > This syscall is a sequence of > > 1) a number of epoll_ctl calls > 2) a epoll_pwait, with timeout enhancement. > > The epoll_ctl operations are embeded so that application doesn't have to use > separate syscalls to insert/delete/update the fds before poll. It is more > efficient if the set of fds varies from one poll to another, which is the > common pattern for certain applications. Which applications? Could we have some specific examples? This is a complex API, and it needs good justification. > For example, depending on the input > buffer status, a data reading program may decide to temporarily not polling an > fd. > > Because the enablement of batching in this interface, even that regular > epoll_ctl call sequence, which manipulates several fds, can be optimized to one > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part). ^^^^^^^^^^^^^^ should be epoll_mod_wait I think you mean to say: The ability to batch multiple "epoll_ctl" operations into a single call means that even when no wait events are requested (i.e., spec == NULL), poll_mod_wait() provides a performance optimization over using multiple epoll_ctl() calls. Right? If yes, please amend the commit message, and this text should also make its way into the revised man page under a heading "NOTES". > The only complexity is returning the result of each operation. For each > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored > the return code *iff* the command is executed (0 for success and -errno of the > equivalent epoll_ctl call), and will be left unchanged if the command is not > executed because some earlier error, for example due to failure of > copy_from_user to copy the array. > > Applications can utilize this fact to do error handling: they could initialize > all the epoll_mod_wait.error to a positive value, which is by definition not a > possible output value from epoll_mod_wait. Then when the syscall returned, they > know whether or not the command is executed by comparing each error with the > init value, if they're different, they have the result of the command. > More roughly, they can put any non-zero and not distinguish "not run" from > failure. The "cmds' are not executed in a specified order plus the need to initialize the 'errors' fields to a positive value feels a bit ugly. And indeed the whole "command list was only partially run" case is not pretty. Am I correct to understand that if an error is found during execution of one of the "epoll_ctl" commands in 'cmds' then the system call will return -1 with errno set, indicating an error, even though the epoll interest list may have changed because some of the earlier 'cmds' executed successfully? This all seems a bit of a headache for user space. I have a couple of questions: Q1. I can see that batching "epoll_ctl" commands might be useful, since it results in fewer systems calls. But, does it really need to be bound together with the "epoll_pwait" functionality? (Perhaps this point was covered in previous discussions, but neither the message accompanying this patch nor the 0/6 man page provide a compelling rationale for the need to bind these two operations together.) Yes, I realize you might save a system call, but it makes for a cumbersome API that has the above headache, and also forces the need for double pointer indirection in the 'spec' argument (i.e., spec is a pointer to an array of structures where each element in turn includes an 'events' pointer that points to another array). Why not a simpler API with two syscalls such as: epoll_ctl_batch(int epfd, int flags, int ncmds, struct epoll_mod_cmd *cmds); epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, struct timespec *timeout, int clock_id, const sigset_t *sigmask, size_t sigsetsize); This gives us much of the benefit of reducing system calls, but with greater simplicity. And epoll_ctl_batch() could simply return the number of 'cmds' that were successfully executed.) Q2. In the man page in 0/6 you said that the 'cmds' were not guaranteed to be executed in order. Why not? If you did provide such a guarantee, then, when using your current epoll_mod_wait(), user space could do the following: 1. Initialize the cmd.errors fields to zero. 2. Call epoll_ctl_mod() 3. Iterate through cmd.errors looking for the first nonzero field. > Also, timeout parameter is enhanced: timespec is used, compared to the old ms > scalar. This provides higher precision. Yes, that change seemed inevitable. It slightly puzzled me at the time when Davide Libenzi added epoll_wait() that the timeout was milliseconds, even though pselect() already had demonstrated the need for higher precision. I should have called it out way back then :-{. > The parameter field in struct > epoll_wait_spec, "clockid", also makes it possible for users to use a different > clock than the default when it makes more sense. > > Signed-off-by: Fam Zheng <famz@xxxxxxxxxx> > --- > fs/eventpoll.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++ > include/linux/syscalls.h | 5 ++++ > 2 files changed, 65 insertions(+) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index e7a116d..2cc22c9 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events, > sigmask ? &ksigmask : NULL); > } > > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags, > + int, ncmds, struct epoll_mod_cmd __user *, cmds, > + struct epoll_wait_spec __user *, spec) > +{ > + struct epoll_mod_cmd *kcmds = NULL; > + int i, ret = 0; > + int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds; > + > + if (flags) > + return -EINVAL; > + if (ncmds) { > + if (!cmds) > + return -EINVAL; > + kcmds = kmalloc(cmd_size, GFP_KERNEL); > + if (!kcmds) > + return -ENOMEM; > + if (copy_from_user(kcmds, cmds, cmd_size)) { > + ret = -EFAULT; > + goto out; > + } > + } > + for (i = 0; i < ncmds; i++) { > + struct epoll_event ev = (struct epoll_event) { > + .events = kcmds[i].events, > + .data = kcmds[i].data, > + }; > + if (kcmds[i].flags) { > + kcmds[i].error = ret = -EINVAL; To make the 'ret' change a little more obvious, maybe it's better to write ret = kcmds[i].error = -EINVAL; > + goto out; > + } > + kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev); Likewise: ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev); > + if (ret) > + goto out; > + } > + if (spec) { > + sigset_t ksigmask; > + struct epoll_wait_spec kspec; > + ktime_t timeout; > + > + if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec))) Cosmetic point: s/if(/if (/ > + return -EFAULT; > + if (kspec.sigmask) { > + if (kspec.sigsetsize != sizeof(sigset_t)) > + return -EINVAL; > + if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask))) > + return -EFAULT; > + } > + timeout = timespec_to_ktime(kspec.timeout); > + ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents, > + kspec.clockid, timeout, > + kspec.sigmask ? &ksigmask : NULL); If I understand correctly, the implementation means that the 'size_t sigsetsize' field will probably need to be exposed to applications. In the existing epoll_pwait() call (as in ppoll() and pselect()) the 'size_t sigsetsize' argument is hidden by glibc. However, unless we expect glibc to do some structure copying to/from a structure that hides this field, then we're going end up exposing 'size_t sigsetsize' to applications. (This could be avoided, if we split the API as I suggest above. glibc would do the same thing in epoll_pwait1() that it currently does in epoll_pwait().) Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html