On Wed, 2022-06-22 at 05:10 +0100, Al Viro wrote: > There's a bunch of pending iov_iter-related work; most of that had > been posted, but only one part got anything resembling a review. Currently > it seems to be working, but it obviously needs review and testing. > > It's split into several subseries; the entire series can be observed > as v5.19-rc2..#work.iov_iter_get_pages. Description follows; individual > patches will be posted^Wmailbombed in followups. > > This stuff is not in -next yet; I'd like to put it there, so if you > see any problems - please yell. > > One thing not currently in there, but to be added very soon is > iov_iter_find_pages{,_alloc}() - analogue of iov_iter_get_pages(), except that > it only grabs page references for userland-backed flavours. The callers, > of course, are responsible for keeping the underlying object(s) alive for as > long as they are using the results. Quite a few of iov_iter_get_pages() > callers would be fine with that. Moreover, unlike iov_iter_get_pages() this > could be allowed for ITER_KVEC, potentially eliminating several places where > we special-case the treatment of ITER_KVEC. > > Another pending thing is integration with cifs and ceph series (dhowells > and jlayton resp.) and probably io_uring as well. > > ---------------------------------------------------------------------------- > > Part 1, #work.9p: [rc1-based] > > 1/44: 9p: handling Rerror without copy_from_iter_full() > Self-contained fix, should be easy to backport. What happens > there is that arrival of Rerror in response to zerocopy read or readdir > ends up with error string in the place where the actual data would've gone > in case of success. It needs to be extracted, and copy_from_iter_full() > is only for data-source iterators, not for e.g. ITER_PIPE. And ITER_PIPE > can be used with those... > > ---------------------------------------------------------------------------- > > Part 2, #work.iov_iter: [rc1-based] > > Dealing with the overhead in new_sync_read()/new_sync_write(), mostly. > Several things there - one is that calculation of iocb flags can be > made cheaper, another is that single-segment iovec is sufficiently > common to be worth turning into a new iov_iter flavour (ITER_UBUF). > With all that, the total size of iov_iter.c goes down, mostly due to > removal of magic in iovec copy_page_to_iter()/copy_page_from_iter(). > Generic variant works for those nowadays... > > This had been posted two weeks ago, got a reasonable amount of comments. > > 2/44: No need of likely/unlikely on calls of check_copy_size() > not just in uio.h; the thing is inlined and it has unlikely on > all paths leading to return false > > 3/44: teach iomap_dio_rw() to suppress dsync > new flag for iomap_dio_rw(), telling it to suppress generic_write_sync() > > 4/44: btrfs: use IOMAP_DIO_NOSYNC > use the above instead of currently used kludges. > > 5/44: struct file: use anonymous union member for rcuhead and llist > "f_u" might have been an amusing name, but... we expect anon unions to > work. > > 6/44: iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC > makes iocb_flags() much cheaper, and it's easier to keep track of > the places where it can change. > > 7/44: keep iocb_flags() result cached in struct file > that, along with the previous commit, reduces the overhead of > new_sync_{read,write}(). struct file doesn't grow - we can keep that > thing in the same anon union where rcuhead and llist live; that field > gets used only before ->f_count reaches zero while the other two are > used only after ->f_count has reached zero. > > 8/44: copy_page_{to,from}_iter(): switch iovec variants to generic > kmap_local_page() allows that. And it kills quite a bit of > code. > > 9/44: new iov_iter flavour - ITER_UBUF > iovec analogue, with single segment. That case is fairly common and it > can be handled with less overhead than full-blown iovec. > > 10/44: switch new_sync_{read,write}() to ITER_UBUF > ... and this is why it is so common. Further reduction of overhead > for new_sync_{read,write}(). > > 11/44: iov_iter_bvec_advance(): don't bother with bvec_iter > AFAICS, variant similar to what we do for iovec/kvec generates better > code. Needs profiling, obviously. > > ---------------------------------------------------------------------------- > > Part 3, #fixes [-rc2-based] > > 12/44: fix short copy handling in copy_mc_pipe_to_iter() > Minimal version of fix; it's replaced with prettier one in the next > series, but replacement is not a backport fodder. > > ---------------------------------------------------------------------------- > > Part 4, #work.ITER_PIPE [on top of merge of previous branches] > > ITER_PIPE handling had never been pretty, but by now it has become > really obfuscated and hard to read. Untangle it a bit. Posted last > weekend, some brainos fixed since then. > > 13/44: splice: stop abusing iov_iter_advance() to flush a pipe > A really odd (ab)use of iov_iter_advance() - in case of error > generic_file_splice_read() wants to free all pipe buffers ->read_iter() > has produced. Yes, forcibly resetting ->head and ->iov_offset to > original values and calling iov_iter_advance(i, 0) will trigger > pipe_advance(), which will trigger pipe_truncate(), which will free > buffers. Or we could just go ahead and free the same buffers; > pipe_discard_from() does exactly that, no iov_iter stuff needs to > be involved. > > 14/44: ITER_PIPE: helper for getting pipe buffer by index > In a lot of places we want to find pipe_buffer by index; > expression is convoluted and hard to read. Provide an inline helper > for that, convert trivial open-coded cases. Eventually *all* > open-coded instances in iov_iter.c will be gone. > > 15/44: ITER_PIPE: helpers for adding pipe buffers > There are only two kinds of pipe_buffer in the area used by ITER_PIPE. > * anonymous - copy_to_iter() et.al. end up creating those and copying data > there. They have zero ->offset, and their ->ops points to > default_pipe_page_ops. > * zero-copy ones - those come from copy_page_to_iter(), and page comes from > caller. ->offset is also caller-supplied - it might be non-zero. > ->ops points to page_cache_pipe_buf_ops. > Move creation and insertion of those into helpers - > push_anon(pipe, size) and push_page(pipe, page, offset, size) resp., separating > them from the "could we avoid creating a new buffer by merging with the current > head?" logics. > > 16/44: ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives > New helper: append_pipe(). Extends the last buffer if possible, > allocates a new one otherwise. Returns page and offset in it on success, > NULL on failure. iov_iter is advanced past the data we've got. > Use that instead of push_pipe() in copy-to-pipe primitives; > they get simpler that way. Handling of short copy (in "mc" one) > is done simply by iov_iter_revert() - iov_iter is in consistent > state after that one, so we can use that. > > 17/44: ITER_PIPE: fold push_pipe() into __pipe_get_pages() > Expand the only remaining call of push_pipe() (in > __pipe_get_pages()), combine it with the page-collecting loop there. > We don't need to bother with i->count checks or calculation of offset > in the first page - the caller already has done that. > Note that the only reason it's not a loop doing append_pipe() > is that append_pipe() is advancing, while iov_iter_get_pages() is not. > As soon as it switches to saner semantics, this thing will switch > to using append_pipe(). > > 18/44: ITER_PIPE: lose iter_head argument of __pipe_get_pages() > Redundant. > > 19/44: ITER_PIPE: clean pipe_advance() up > Don't bother with pipe_truncate(); adjust the buffer > length just as we decide it'll be the last one, then use > pipe_discard_from() to release buffers past that one. > > 20/44: ITER_PIPE: clean iov_iter_revert() > Fold pipe_truncate() in there, clean the things up. > > 21/44: ITER_PIPE: cache the type of last buffer > We often need to find whether the last buffer is anon or not, and > currently it's rather clumsy: > check if ->iov_offset is non-zero (i.e. that pipe is not empty) > if so, get the corresponding pipe_buffer and check its ->ops > if it's &default_pipe_buf_ops, we have an anon buffer. > Let's replace the use of ->iov_offset (which is nowhere near similar to > its role for other flavours) with signed field (->last_offset), with > the following rules: > empty, no buffers occupied: 0 > anon, with bytes up to N-1 filled: N > zero-copy, with bytes up to N-1 filled: -N > That way abs(i->last_offset) is equal to what used to be in i->iov_offset > and empty vs. anon vs. zero-copy can be distinguished by the sign of > i->last_offset. > Checks for "should we extend the last buffer or should we start > a new one?" become easier to follow that way. > Note that most of the operations can only be done in a sane > state - i.e. when the pipe has nothing past the current position of > iterator. About the only thing that could be done outside of that > state is iov_iter_advance(), which transitions to the sane state by > truncating the pipe. There are only two cases where we leave the > sane state: > 1) iov_iter_get_pages()/iov_iter_get_pages_alloc(). Will be > dealt with later, when we make get_pages advancing - the callers are > actually happier that way. > 2) iov_iter copied, then something is put into the copy. Since > they share the underlying pipe, the original gets behind. When we > decide that we are done with the copy (original is not usable until then) > we advance the original. direct_io used to be done that way; nowadays > it operates on the original and we do iov_iter_revert() to discard > the excessive data. At the moment there's nothing in the kernel that > could do that to ITER_PIPE iterators, so this reason for insane state > is theoretical right now. > > 22/44: ITER_PIPE: fold data_start() and pipe_space_for_user() together > All their callers are next to each other; all of them want > the total amount of pages and, possibly, the offset in the partial > final buffer. > Combine into a new helper (pipe_npages()), fix the > bogosity in pipe_space_for_user(), while we are at it. > > ---------------------------------------------------------------------------- > > Part 5, #work.unify_iov_iter_get_pages [on top of previous] > > iov_iter_get_pages() and iov_iter_get_pages_alloc() have a lot of code > duplication and are bloody hard to read. With some massage duplication > can be eliminated, along with some of the cruft accumulated there. > > Flavour-independent arguments validation and, for ..._alloc(), > cleanup handling on failure: > 23/44: iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT > 24/44: iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper > 25/44: iov_iter_get_pages(): sanity-check arguments > > Mechanically merge parallel ..._get_pages() and ..._get_pages_alloc(). > 26/44: unify pipe_get_pages() and pipe_get_pages_alloc() > 27/44: unify xarray_get_pages() and xarray_get_pages_alloc() > 28/44: unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts > > Decrufting for XARRAY: > 29/44: ITER_XARRAY: don't open-code DIV_ROUND_UP() > > Decrufting for UBUF/IOVEC/BVEC: that bunch suffers from really convoluted > helpers; untangling those takes a bit of care, so I'd carved that up into fairly > small chunks. Could be collapsed together, but... > 30/44: iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() > 31/44: iov_iter: first_{iovec,bvec}_segment() - simplify a bit > 32/44: iov_iter: massage calling conventions for first_{iovec,bvec}_segment() > 33/44: found_iovec_segment(): just return address > > Decrufting for PIPE: > 34/44: fold __pipe_get_pages() into pipe_get_pages() > > Now we can finally get a helper encapsulating the array allocations > right way: > 35/44: iov_iter: saner helper for page array allocation > > ---------------------------------------------------------------------------- > > Part 6, #work.iov_iter_get_pages-advance [on top of previous] > Convert iov_iter_get_pages{,_alloc}() to iterator-advancing semantics. > > Most of the callers follow successful ...get_pages... with advance > by the amount it had reported. For some it's unconditional, for some it > might end up being less in some cases. All of them would be fine with > advancing variants of those primitives - those that might want to advance > by less than reported could easily use revert by the difference of those > amounts. > Rather than doing a flagday change (they are exported and signatures > remain unchanged), replacement variants are added (iov_iter_get_pages2() > and iov_iter_get_pages_alloc2(), initially as wrappers). By the end of > the series everything is converted to those and the old ones are removed. > > Makes for simpler rules for ITER_PIPE, among other things, and > advancing semantics is consistent with all data-copying primitives. > Series is pretty obvious - introduce variants with new semantics, switch > users one by one, fold the old variants into new ones. > > 36/44: iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() > 37/44: block: convert to advancing variants of iov_iter_get_pages{,_alloc}() > 38/44: iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() > 39/44: af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages() > 40/44: 9p: convert to advancing variant of iov_iter_get_pages_alloc() > 41/44: ceph: switch the last caller of iov_iter_get_pages_alloc() > 42/44: get rid of non-advancing variants > > ---------------------------------------------------------------------------- > > Part 7, #wort.iov_iter_get_pages [on top of previous] > Trivial followups, with more to be added here... > > 43/44: pipe_get_pages(): switch to append_pipe() > 44/44: expand those iov_iter_advance()... > > Overall diffstat: > > arch/powerpc/include/asm/uaccess.h | 2 +- > arch/s390/include/asm/uaccess.h | 4 +- > block/bio.c | 15 +- > block/blk-map.c | 7 +- > block/fops.c | 8 +- > crypto/af_alg.c | 3 +- > crypto/algif_hash.c | 5 +- > drivers/nvme/target/io-cmd-file.c | 2 +- > drivers/vhost/scsi.c | 4 +- > fs/aio.c | 2 +- > fs/btrfs/file.c | 19 +- > fs/btrfs/inode.c | 3 +- > fs/ceph/addr.c | 2 +- > fs/ceph/file.c | 5 +- > fs/cifs/file.c | 8 +- > fs/cifs/misc.c | 3 +- > fs/direct-io.c | 7 +- > fs/fcntl.c | 1 + > fs/file_table.c | 17 +- > fs/fuse/dev.c | 7 +- > fs/fuse/file.c | 7 +- > fs/gfs2/file.c | 2 +- > fs/io_uring.c | 2 +- > fs/iomap/direct-io.c | 21 +- > fs/nfs/direct.c | 8 +- > fs/open.c | 1 + > fs/read_write.c | 6 +- > fs/splice.c | 54 +- > fs/zonefs/super.c | 2 +- > include/linux/fs.h | 21 +- > include/linux/iomap.h | 6 + > include/linux/pipe_fs_i.h | 29 +- > include/linux/uaccess.h | 4 +- > include/linux/uio.h | 50 +- > lib/iov_iter.c | 993 ++++++++++++++----------------------- > mm/shmem.c | 2 +- > net/9p/client.c | 125 +---- > net/9p/protocol.c | 3 +- > net/9p/trans_virtio.c | 37 +- > net/core/datagram.c | 3 +- > net/core/skmsg.c | 3 +- > net/rds/message.c | 3 +- > net/tls/tls_sw.c | 4 +- > 43 files changed, 599 insertions(+), 911 deletions(-) I ported the CEPH_MSG_DATA_ITER patches on top of this, and ran ceph through xfstests and it seemed to do just fine. You can add: Tested-by: Jeff Layton <jlayton@xxxxxxxxxx>