Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> · Sun, 15 Feb 2015 16:16:39 +0100

On 02/13/2015 10:53 AM, Omar Sandoval wrote:
> On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
>> Hi all,
>>
>> This is the updated series for the new epoll system calls, with the cover
>> letter rewritten which includes some more explanation. Comments are very
>> welcome!
>>
>> Original Motivation
>> ===================
>>
>> QEMU, and probably many more select/poll based applications, will consider
>> epoll as an alternative, when its event loop needs to handle a big number of
>> fds. However, there are currently two concerns with epoll which prevents the
>> switching from ppoll to epoll. 
>>
>> The major one is the timeout precision.
>>
>> For example in QEMU, the main loop takes care of calling callbacks at a
>> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
>> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
>> rounding up the timeout will hurt performance badly.
>>
>> The minor one is the number of system call to update fd set.
>>
>> While epoll can handle a large number of fds quickly, it still requires one
>> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
>> fd array. This may as well make epoll inferior to ppoll in the cases where a
>> small, but frequently changing set of fds are polled by the event loop.
>>
>> This series introduces two new epoll APIs to address them respectively. The
>> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
>> suggested clockid as a parameter in epoll_pwait1.
>>
>> Discussion
>> ==========
>>
>> [Note: This is the question part regarding the interface contract of
>> epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch yet,
>> please skip this part and probably start with the man page style documentation.
>> You can resume to this section later.]
>>
>> [Thanks to Omar Sandoval <osandov@xxxxxxxxxxx>, who pointed out this in
>> reviewing v1]
>>
>> We try to report status for each command in epoll_ctl_batch, by writting to
>> user provided command array (pointed to cmds). The tricky thing in the
>> implementation is that, copying the results back to userspace comes last, after
>> the commands are executed. At this point, if the copy_to_user fails, the
>> effects are done and no return - or if we add some mechanism to revert it, the
>> code will be too complicated and slow.
>>
>> In above corner case, the return value of epoll_ctl_batch is smaller than
>> ncmds, which assures our caller that last N commands failed, where N = ncmds -
>> ret.  But they'll also find that cmd.result is not changed to error code.
>>
>> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
>> user does know the actual number of commands that succeed.
>>
>> So, do we leave it as is? Or is there any way to improve?
>>
>> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
>> before we execute the commands. If it succeeds, it's even less likely the last
>> copy_to_user could fail, so that we can even probably assert it won't. The
>> testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
>> right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
>> performance impact, especially when @ncmds is big.
>>
> I don't think this is the right thing to do, since, for example, another
> thread could unmap the memory region holding buffer between the "check"
> copy_to_user and the actual one.
> 
> The two alternatives that I see are:
> 
> 1. If copy_to_user fails, ignore it and return the number of commands
> that succeeded (i.e., what the code in your patch does).
> 2. If copy_to_user fails, return -EFAULT, regardless of how many
> commands succeeded.
> 
> The problem with 1 is that it could potentially mask bugs in a user
> program. You could imagine a buggy program that passes a read-only
> buffer to epoll_ctl_batch and never finds out about it because they
> don't get the error. (Then, when there's a real error in one of the
> epoll_ctl_cmds, but .result is 0, someone will be very confused.)
> 
> So I think 2 is the better option. Sure, the user will have no idea how
> many commands were executed, but when would EFAULT on this system call
> be part of normal operation, anyways? You'll find the memory bug, fix
> it, and rest easy knowing that nothing is silently failing behind your
> back.

What Omar says makes sense to me too. Best to have the user get a clear 
error indication for this case.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html