On Fri, Jun 17, 2022 at 11:48:01PM +0100, Al Viro wrote: > On Fri, Jun 17, 2022 at 04:30:49PM -0600, Jens Axboe wrote: > > > Al, looks good to me from inspection, and I ported stuffed this on top > > of -git and my 5.20 branch, and did my send/recv/recvmsg io_uring change > > on top and see a noticeable reduction there too for some benchmarking. > > Feel free to add: > > > > Reviewed-by: Jens Axboe <axboe@xxxxxxxxx> > > > > to the series. > > > > Side note - of my initial series I played with, I still have this one > > leftover that I do utilize for io_uring: > > > > https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.20/io_uring-iter&id=a59f5c21a6eeb9506163c20aff4846dbec159f47 > > > > Doesn't make sense standalone, but I have it as a prep patch. > > > > Can I consider your work.iov_iter stable at this point, or are you still > > planning rebasing? > > Umm... Rebasing this part - probably no; there's a fun followup to it, though, > I'm finishing the carve up & reorder at the moment. Will post for review > tonight... This stuff sits on top of #work.iov_iter (as posted a week ago) + #fixes (one commit, handling of failures halfway through copy_mc_to_iter() into ITER_PIPE, posted several days ago, backportable minimal fix) + #work.9p (handling of RERROR on zerocopy 9P read/readdir, posted about a week ago). The branch is #work.iov_iter_get_pages; individual patches in followups. NOTE: the older branches are unchanged, but this series on top of them had been repeatedly carved up, reordered, etc. - there had been a lot of recent massage, so at this point it should be treated as absolutely untested. It can shit over memory and/or chew your filesystems; DON'T TRY IT OUTSIDE OF A SCRATCH KVM IMAGE. Said that, review and (cautious) testing would be very welcome. Part 1: ITER_PIPE cleanups ITER_PIPE handling had never been pretty, but by now it has become really obfuscated and hard to read. Untangle it a bit. 1) splice: stop abusing iov_iter_advance() to flush a pipe A really odd (ab)use of iov_iter_advance() - in case of error generic_file_splice_read() wants to free all pipe buffers ->read_iter() has produced. Yes, forcibly resetting ->head and ->iov_offset to original values and calling iov_iter_advance(i, 0), will trigger pipe_advance(), which will trigger pipe_truncate(), which will free buffers. Or we could just go ahead and free the same buffers; pipe_discard_from() does exactly that, no iov_iter stuff needs to be involved. 2) ITER_PIPE: helper for getting pipe buffer by index In a lot of places we want to find pipe_buffer by index; expression is convoluted and hard to read. Provide an inline helper for that, convert trivial open-coded cases. Eventually *all* open-coded instances in iov_iter.c will get converted. 3) ITER_PIPE: helpers for adding pipe buffers There are only two kinds of pipe_buffer in the area used by ITER_PIPE. * anonymous - copy_to_iter() et.al. end up creating those and copying data there. They have zero ->offset, and their ->ops points to default_pipe_page_ops. * zero-copy ones - those come from copy_page_to_iter(), and page comes from caller. ->offset is also caller-supplied - it might be non-zero. ->ops points to page_cache_pipe_buf_ops. Move creation and insertion of those into helpers - push_anon(pipe, size) and push_page(pipe, page, offset, size) resp., separating them from the "could we avoid creating a new buffer by merging with the current head?" logics. 4) ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives New helper: append_pipe(). Extends the last buffer if possible, allocates a new one otherwise. Returns page and offset in it on success, NULL on failure. iov_iter is advanced past the data we've got. Use that instead of push_pipe() in copy-to-pipe primitives; they get simpler that way. Handling of short copy (in "mc" one) is done simply by iov_iter_revert() - iov_iter is in consistent state after that one, so we can use that. 5) ITER_PIPE: fold push_pipe() into __pipe_get_pages() Expand the only remaining call of push_pipe() (in __pipe_get_pages()), combine it with the page-collecting loop there. We don't need to bother with i->count checks or calculation of offset in the first page - the caller already has done that. Note that the only reason it's not a loop doing append_pipe() is that append_pipe() is advancing, while iov_iter_get_pages() is not. As soon as it switches to saner semantics, this thing will switch to using append_pipe(). 6) ITER_PIPE: lose iter_head argument of __pipe_get_pages() Always equal to pipe->head - 1. 7) ITER_PIPE: clean pipe_advance() up Don't bother with pipe_truncate(); adjust the buffer length just as we decide it'll be the last one, then use pipe_discard_from() to release buffers past that one. 8) ITER_PIPE: clean iov_iter_revert() Fold pipe_truncate() in there, clean the things up. 9) ITER_PIPE: cache the type of last buffer We often need to find whether the last buffer is anon or not, and currently it's rather clumsy: check if ->iov_offset is non-zero (i.e. that pipe is not empty) if so, get the corresponding pipe_buffer and check its ->ops if it's &default_pipe_buf_ops, we have an anon buffer. Let's replace the use of ->iov_offset (which is nowhere near similar to its role for other flavours) with signed field (->last_offset), with the following rules: empty, no buffers occupied: 0 anon, with bytes up to N-1 filled: N zero-copy, with bytes up to N-1 filled: -N That way abs(i->last_offset) is equal to what used to be in i->iov_offset and empty vs. anon vs. zero-copy can be distinguished by the sign of i->last_offset. Checks for "should we extend the last buffer or should we start a new one?" become easier to follow that way. Note that most of the operations can only be done in a sane state - i.e. when the pipe has nothing past the current position of iterator. About the only thing that could be done outside of that state is iov_iter_advance(), which transitions to the sane state by truncating the pipe. There are only two cases where we leave the sane state: 1) iov_iter_get_pages()/iov_iter_get_pages_alloc(). Will be dealt with later, when we make get_pages advancing - the callers are actually happier that way. 2) iov_iter copied, then something is put into the copy. Since they share the underlying pipe, the original gets behind. When we decide that we are done with the copy (original is not usable until then) we advance the original. direct_io used to be done that way; nowadays it operates on the original and we do iov_iter_revert() to discard the excessive data. At the moment there's nothing in the kernel that could do that to ITER_PIPE iterators, so this reason for insane state is theoretical right now. 10) ITER_PIPE: fold data_start() and pipe_space_for_user() together All their callers are next to each other; all of them want the total amount of pages and, possibly, the offset in the partial final buffer. Combine into a new helper (pipe_npages()), fix the bogosity in pipe_space_for_user(), while we are at it. Part 2: iov_iter_get_pages()/iov_iter_get_pages_alloc() unification There's a lot of duplication between iov_iter_get_pages() and iov_iter_get_pages_alloc(). With some massage it can be eliminated, along with some of the cruft accumulated there. Flavour-independent arguments validation and, for ..._alloc(), cleanup handling on failure: 11) iov_iter_get_pages{,_alloc}(): cap the maxsize with LONG_MAX 12) iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper 13) iov_iter_get_pages(): sanity-check arguments Mechanically merge parallel ..._get_pages() and ..._get_pages_alloc(). 14) unify pipe_get_pages() and pipe_get_pages_alloc() 15) unify xarray_get_pages() and xarray_get_pages_alloc() 16) unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts Decrufting for XARRAY: 17) ITER_XARRAY: don't open-code DIV_ROUND_UP() Decrufting for iBUF/IOVEC/BVEC: 18) iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() 19) iov_iter: massage calling conventions for first_{iovec,bvec}_segment() 20) found_iovec_segment(): just return address Decrufting for PIPE: 21) fold __pipe_get_pages() into pipe_get_pages() Collapsing the bits that differ for get_pages and get_pages_alloc cases into a common helper: 22) iov_iter: saner helper for page array allocation Part 3: making iov_iter_get_pages{,_alloc}() advancing Most of the callers follow successful ...get_pages... with advance by the amount it had reported. For some it's unconditional, for some it might end up being less in some cases. All of them would be fine with advancing variants of those primitives - those that might want to advance by less than reported could easily use revert by the difference of those amounts. Rather than doing a flagday change (they are exported and signatures remain unchanged), replacement variants are added (iov_iter_get_pages2() and iov_iter_get_pages_alloc2(), initially as wrappers). By the end of the series everything is converted to those and the old ones are removed. 23) iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() 24) block: convert to advancing variants of iov_iter_get_pages{,_alloc}() 25) iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() 26) af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages() 27) 9p: convert to advancing variant of iov_iter_get_pages_alloc() 28) ceph: switch the last caller of iov_iter_get_pages_alloc() 29) get rid of non-advancing variants Part 4: cleanups 30) pipe_get_pages(): switch to append_pipe() 31) expand those iov_iter_advance()... Overall diffstat: arch/powerpc/include/asm/uaccess.h | 2 +- arch/s390/include/asm/uaccess.h | 4 +- block/bio.c | 15 +- block/blk-map.c | 7 +- block/fops.c | 8 +- crypto/af_alg.c | 3 +- crypto/algif_hash.c | 5 +- drivers/nvme/target/io-cmd-file.c | 2 +- drivers/vhost/scsi.c | 4 +- fs/aio.c | 2 +- fs/btrfs/file.c | 19 +- fs/btrfs/inode.c | 3 +- fs/ceph/addr.c | 2 +- fs/ceph/file.c | 5 +- fs/cifs/file.c | 8 +- fs/cifs/misc.c | 3 +- fs/direct-io.c | 7 +- fs/fcntl.c | 1 + fs/file_table.c | 17 +- fs/fuse/dev.c | 7 +- fs/fuse/file.c | 7 +- fs/gfs2/file.c | 2 +- fs/io_uring.c | 2 +- fs/iomap/direct-io.c | 21 +- fs/nfs/direct.c | 8 +- fs/open.c | 1 + fs/read_write.c | 6 +- fs/splice.c | 54 +- fs/zonefs/super.c | 2 +- include/linux/fs.h | 21 +- include/linux/iomap.h | 6 + include/linux/pipe_fs_i.h | 29 +- include/linux/uaccess.h | 4 +- include/linux/uio.h | 50 +- lib/iov_iter.c | 978 ++++++++++++++----------------------- mm/shmem.c | 2 +- net/9p/client.c | 125 +---- net/9p/protocol.c | 3 +- net/9p/trans_virtio.c | 37 +- net/core/datagram.c | 3 +- net/core/skmsg.c | 3 +- net/rds/message.c | 3 +- net/tls/tls_sw.c | 4 +- 43 files changed, 589 insertions(+), 906 deletions(-)