Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate > > that the pipe being created is going to be used for notifications. This > > suppresses the use of splice(), vmsplice(), tee() and sendfile() on the > > pipe as calling iov_iter_revert() on a pipe when a kernel notification > > message has been inserted into the middle of a multi-buffer splice will be > > messy. > > How messy? Well, iov_iter_revert() on a pipe iterator simply walks backwards along the ring discarding the last N contiguous slots (where N is normally the number of slots that were filled by whatever operation is being reverted). However, unless the code that transfers stuff into the pipe takes the spinlock spinlock and disables softirqs for the duration of its ring filling, what were N contiguous slots may now have kernel notifications interspersed - even if it has been holding the pipe mutex. So, now what do you do? You have to free up just the buffers relevant to the iterator and then you can either compact down the ring to free up the space or you can leave null slots and let the read side clean them up, thereby reducing the capacity of the pipe temporarily. Either way, iov_iter_revert() gets more complex and has to hold the spinlock. And if you don't take the spinlock whilst you're reverting, more notifications can come in to make your life more interesting. There's also a problem with splicing out from a notification pipe that the messages are scribed onto preallocated buffers, but now the buffers need refcounts and, in any case, are of limited quantity. > And is there some way to make it impossible for this to happen? Yes. That's what I'm doing by declaring the pipe to be unspliceable up front. > Adding a new flag to pipe2() to avoid messy kernel code seems > like a poor tradeoff. By far the easiest place to check whether a pipe can be spliced to is in get_pipe_info(). That's checking the file anyway. After that, you can't make the check until the pipe is locked. Furthermore, if it's not done upfront, the change to the pipe might happen during a splicing operation that's residing in pipe_wait()... which drops the pipe mutex. David