Large part of the problems with iov_iter comes from its history - it was not designed, it accreted over years. Worse, its users sit on rather hots paths, so touching any of the primitives can come with considerable performance cost. iov_iter first appeared as a replacement for mutable iovec arrays - it carried the state that needed to be modified, allowing to leave iovecs themselves unchanged through the operations. Rewrites of iovec arrays had been costly and avoiding them on the fairly hot paths made a lot of sense. Several years later it had been expanded - instead of wrapping the operations dealing with kernel memory rather than userland one into set_fs(), we'd added the "do those iovecs refer to userland memory?" flag into iov_iter and had primitives act accordingly. Soon after that we'd added a flavour that used <page,offset,length> triples instead of <address,length> pairs (i.e. bio_vec instead of iovec); that simplified a lot of stuff. At that stage the code coalesced into lib/iov_iter.c and include/linux/uio.h. Another thing that happened at that time was the conversion of sendmsg/recvmsg to iov_iter; that added some primitives and mostly killed the mutable iovecs off. Then we'd grown yet another flavour, when destination had been a pipe. That simplified a lot of stuff around splice(2). Unfortunately, all of that had lead to a lot of boilerplate code in lib/iov_iter.c. As much as I loathe templates, I'd ended up implementing something similar. In preprocessor, at that. Result had been predictably revolting. Later more primitives got added; these macros made that more or less easy. Unfortunately, the same macros had lead to shitty code generation - usual results of trying to be too smart with microoptimizations. Then a new flavour ("discard") had been added. That one was trivial, but the last cycle a new one ("xarray") went in. Dave tried to turn that into per-flavour set of methods, but performance cost of indirect calls had been considerable, so we ended up with even grottier macros. It really needs to be cleaned up, without performance loss - this stuff *does* sit on performance-critical paths. We also need to fix a bunch of corner cases in there. The series attempting to do so lives in git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.iov_iter and it seems to survive the local beating. Diffstat is Documentation/filesystems/porting.rst | 9 + fs/btrfs/file.c | 23 +- fs/fuse/file.c | 4 +- fs/iomap/buffered-io.c | 35 +- fs/ntfs/file.c | 33 +- include/linux/uio.h | 66 +- include/net/checksum.h | 14 +- lib/iov_iter.c | 1236 +++++++++++++++------------------ mm/filemap.c | 36 +- 9 files changed, 652 insertions(+), 804 deletions(-) and IMO iov_iter.c does get better. 37 commits, individual patches in followups, series overview follows. Please, review and help with testing. Part 1: preliminary bits and pieces. First, several patches removing the bogus uses of iov_iter_single_seg_count() in "copy into page cache" kind of loops. It used to make sense back when that was the way to tell how much would iov_iter_fault_in_readable() try to fault in; these days iov_iter_fault_in_readable() is not limited to the first iovec. There are uses besides the heuristics for short copy handling, so the primitive itself is not going away. Many of its uses are, though. (1/37) ntfs_copy_from_user_iter(): don't bother with copying iov_iter (2/37) generic_perform_write()/iomap_write_actor(): saner logics for short copy (3/37) fuse_fill_write_pages(): don't bother with iov_iter_single_seg_count() Dead code removal: (4/37) iov_iter: Remove iov_iter_for_each_range() Dave's patch that should've probably been included into xarray series. The last user of that primitive went away when lustre got dropped. Very prone to abuse - trying to get it right (at the moment it's only implemented for iovec and kvec flavours) would lead to very tricky locking environments for the callback. E.g. for xarray flavour it would happen under rcu_read_lock, etc. RIP. Several fixes: (5/37) teach copy_page_to_iter() to handle compound pages In situation when copy_page_to_iter() got a compound page the current code would only work on systems with no CONFIG_HIGHMEM. It *is* the majority of real-world setups, or we would've drown in bug reports by now. Rare setups or not, it needed fixing. -stable fodder. (6/37) copy_page_to_iter(): fix ITER_DISCARD case Another corner case - copy_page_to_iter() forgot to advance the iterator for "discard" flavour. Unfortunately, while discard doesn't have anything like current position, it does have the amount of space left and users expect it to change properly. -stable fodder. (7/37) [xarray] iov_iter_fault_in_readable() should do nothing in xarray case iov_iter_fault_in_readable() used to check that the flavour had been neither kvec nor bvec. It relied upon being called only for data sources, which excludes the possibility of pipe and discard flavours. It worked (despite being ugly as hell) until xarray got merged - xarray *can* be data source and it does need to be excluded. The predicate it wanted to check all along was "it's an iovec-backed one"... (8/37) iov_iter_advance(): use consistent semantics for move past the end Another corner case - iov_iter_advance() on more than left in iov_iter should move to the very end. bvec and discard used to be broken in such case. Trimming the primitives: (9/37) iov_iter: switch ..._full() variants of primitives to use of iov_iter_revert() As suggested by Linus, all "copy everything or fail, advance only in case of success" can be done as normal copy + revert in the unlikely case a short copy happens. Part 2: saner flavour representation Flavour encoding is one of the microoptimizations that shouldn't have been done in the first place; we store it in a kinda-sorta bitmap, along with "is it a data source or data destination" flag. The ability to check if flavour is e.g. kvec or bvec in one test is not worth the trouble - compiler can't show that we never get more than one of "type" bits set in iter->type and the effects on code generation are not nice. (10/37) iov_iter: reorder handling of flavours in primitives Make all flavour checks use iov_iter_is_...(), *including* the iovec case. Allows to reorder them sanely and makes the things a lot more robust, since iovec case is checked explicitly and does not depend upon everything else having been already excluded by the earlier tests. (11/37) iov_iter_advance(): don't modify ->iov_offset for ITER_DISCARD ... seeing that ->iov_offset is not used for those. What's more, now we can drop the iov_iter_is_discard() case there completely. (12/37) iov_iter: separate direction from flavour Separate the "direction" bit into a separate field and turn the type into flat enumeration. Oh, and rename the 'type' field into something easier to grep for - my fault, that. Part 3: reducing abuses of iterating macros In a bunch of places the iterating macros are misused - it's simpler *and* faster to do explicit loops. The next pile eliminates those. (13/37) iov_iter: optimize iov_iter_advance() for iovec and kvec First to go: iov_iter_advance(). Sure, iterate_and_advance() with "do nothing" as step sounds like an elegant solution. Except that it's a bad overkill in this case. That had already been spotted for bvec a while ago, but iovec/kvec cases are no different. (14/37) sanitize iov_iter_fault_in_readable() Stray iterate_iovec() use gone. Particularly unpleasant one, since it had step that might return from the entire function. Use of ({...}) with execution leaving *not* through the end of block should be considered a bug, IMO. (15/37) iov_iter_alignment(): don't bother with iterate_all_kinds() (16/37) iov_iter_gap_alignment(): get rid of iterate_all_kinds() (17/37) get rid of iterate_all_kinds() in iov_iter_get_pages()/iov_iter_get_pages_alloc() Another "return inside a step" one; all we used iterate_all_kinds() for was to locate the first non-empty iovec. This one has an extra (minor) behaviour improvement in it - bvec case with compound page in it will get all subpages of that page now, not just the first one. (18/37) iov_iter_npages(): don't bother with iterate_all_kinds() Yet another "return inside a step" case... (19/37) [xarray] iov_iter_npages(): just use DIV_ROUND_UP() minor followup to the previous (20/37) iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant iov_iter_copy_from_user_atomic() is the last remaining user of iterate_all_kinds(), and we could just as well make its replacement use iterate_and_advance() instead. Callers can revert if they end up using only a part of what's been copied. Replacement primitive is called copy_page_from_iter_atomic(), all callers converted. With that we are rid of the last user of iterate_all_kinds(). RIP. What's more, all remaining users of iterate_and_advance() are copying stuff to/from iterator - they actually access the data, not just counting pages, etc. (21/37) csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter() Namely, have off counted starting from 0 rather than from csstate->off. It's not hard to do so - in case of odd initial offset, just rotate the initial sum by 8 bits and do the same to result. What we get out of that is a bit more redundancy in our variables - from is always equal to addr + off, which will be useful several commits down the road. We could mix that into the next series, but that would make it harder to follow. Part 4: iterate_and_advance() massage By that point we are down to iterate_and_advance() alone. There are several problems with it: * misguided attempts at microoptimizations in iterate_iovec/iterate_kvec * only iovec side is ready to deal with short copies; that used to be OK, but copy_mc_to_iter() can run into short copies on any flavour. It tries to cope, but that's inconsistent and not pretty, to put it mildly * each of those gets 4 callbacks - for iovec, bvec, kvec and xarray cases. First of all, xarray and bvec cases are essentially identical. Furthermore, both are easily derived from kvec one. * there's an additional headache due to the need to maintain the offset in source or destination of all those copies. Leads to boilerplate all over the place; iterate_iovec() et.al. could easily keep track of that stuff. The next group massages this stuff to deal with all that fun. By the end of that the users go from things like iterate_and_advance(i, bytes, v, __copy_from_user_inatomic_nocache((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len), memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page, v.bv_offset, v.bv_len), memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len), memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page, v.bv_offset, v.bv_len) ) to iterate_and_advance(i, bytes, base, len, off, __copy_from_user_inatomic_nocache(addr + off, base, len), memcpy(addr + off, base, len) ) and generated code shrinks quite a bit. (22/37) iterate_and_advance(): get rid of magic in case when n is 0 One irregularity had been the handling of case when requested length is 0. We don't need to bother with that anymore, now that the only user that cared about that (iov_iter_advance()) is not using iterate_and_advance(). For everything else length 0 should be a no-op. (23/37) iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec Attempts to handle the first (partial) segment separately had actually lead to worse code; better let the compiler deal with optimizations. Part of that had been the need to reuse iterate_iovec() and iterate_kvec() among the advancing and non-advancing iterator macros. The latter is gone now, allowing to simplify things. (24/37) iov_iter: unify iterate_iovec and iterate_kvec Essentially identical now; the only difference is that iovec knows about the possibility of short copies. (25/37) iterate_bvec(): expand bvec.h macro forest, massage a bit The logics is not changed, but now one does not follow 4 levels of macros buried in bvec.h to understand what's going on. That matters for the next step, when we teach the sucker about the possibility of short copies. (26/37) iov_iter: teach iterate_{bvec,xarray}() about possible short copies ... allowing to kill "returns in the callback" mess in copy_mc_to_iter() (27/37) iov_iter: get rid of separate bvec and xarray callbacks payoff of the previous commit - now the relationship between bvec, kvec and xarray callbacks has lost the last irregularities; we can reduce the callbacks just to "userland" and "kernel" variants. (28/37) iov_iter: make the amount already copied available to iterator callbacks teach iterate_...() to keep track of the amount we'd already copied. Kills quite a bit of boilerplate in users. (29/37) iov_iter: make iterator callbacks use base and len instead of iovec we used to let the callbacks play with struct iovec, bvec and kvec resp. With the unifications above it's always struct iovec or struct kvec. Might as well pass pointer (userland or kernel one) and length separately. (30/37) pull handling of ->iov_offset into iterate_{iovec,bvec,xarray} clean iterate_iovec() et.al. a bit. Part 5: followup cleanups (31/37) iterate_xarray(): only of the first iteration we might get offset != 0 (32/37) copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases (33/37) copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases (34/37) iov_iter: clean csum_and_copy_...() primitives up a bit (35/37) pipe_zero(): we don't need no stinkin' kmap_atomic()... (36/37) clean up copy_mc_pipe_to_iter() (37/37) csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller