Hi Al, Linus, Here's my latest go at converting the iov_iter iteration macros to be inline functions. The first four functions are the main part of the series: (1) To the recently added iov_iter kunit tests add three benchmarking tests that test copying 256MiB from one buffer to another using three different methods (single-BVEC over the whole buffer, BVEC's allocated dynamically for 256-page chunks and whole-buffer XARRAY). The results are dumped to dmesg. No setting up is required with null blockdevs or anything like that. (2) Renumber the type enum so that the ITER_* constants match the order in iterate_and_advance*(). (3) Since (2) puts UBUF and IOVEC at 0 and 1, change user_backed_iter() to just use the type value and get rid of the extra flag. (4) Converts the iov_iter iteration macros to always-inline functions to make the code easier to follow. It uses function pointers, but they get optimised away. The priv2 argument likewise gets optimised away if unused. The rest of the patches are some options for consideration: (5) Move the iov_iter iteration macros to a header file so that bespoke iterators can be created elsewhere. For instance, rbd has an optimisation that requires it to scan to the buffer it is given to see if it is all zeros. It would be nice if this could use iterate_and_advance() - but that's currently buried inside lib/iov_iter.c. (6) On top of (5), provide a cut-down iteration function that can only handle kernel-backed iterators (ie. BVEC, KVEC, XARRAY and DISCARD) for situations where we know that we won't see UBUF/IOVEC. (7-8) Make copy_to/from_iter() always catch MCE and return a short copy. This doesn't particularly increase the code size as the handling works via exception handling tables. That said, there may be code that doesn't check to result of the copy that could be adversely affected. If we go with this, it might be worth having an 'MCE happened' flag in the iterator or something by which this can be checked for. [?] Is it better to kill the thread than returning a short copy if an MCE occurs? [?] Is it better to manually select MCE handling? (9) On top of (5), move the copy-and-csum code to net/ where it can be in proximity with the code that uses it. This eliminates the code if CONFIG_NET=n and allows for the slim possibility of it being inlined. (10) On top of (9), fold memcpy_and_csum() in to its two users. (11) On top of (9), move csum_and_copy_from_iter_full() out of line and merge in csum_and_copy_from_iter() since the former is the only caller of the latter. (12) Move hash_and_copy_to_iter() to net/ where it can be with its only caller. (13) Add a testing misc device for testing/benchmarking ITER_UBUF and ITER_IOVEC devices. It is driven by read/write/readv/writev and the results dumped through a tracepoint. Further changes I could make: (1) Add an 'ITER_OTHER' type and an ops table pointer and have iterate_and_advance2(), iov_iter_advance(), iov_iter_revert(), etc. jump through it if it sees ITER_OTHER type. This would allow types for, say, scatterlist, bio list, skbuff to be added without further expanding the core. (2) Move the ITER_XARRAY type to being an ITER_OTHER type. This would shrink the core iterators quite a lot and reduce the stack usage as the xarray walking stuff wouldn't be there. Anyway, the changes in compiled function size either side of patch (4) on x86_64 look like: _copy_from_iter inc 0x360 -> 0x3d5 +0x75 _copy_from_iter_flushcache inc 0x34c -> 0x358 +0xc _copy_from_iter_nocache dcr 0x354 -> 0x346 -0xe _copy_mc_to_iter inc 0x396 -> 0x3cf +0x39 _copy_to_iter inc 0x33b -> 0x35d +0x22 copy_page_from_iter_atomic.part.0 inc 0x393 -> 0x408 +0x75 copy_page_to_iter_nofault.part.0 dcr 0x3de -> 0x3b2 -0x2c copyin del 0x30 copyout del 0x2d copyout_mc del 0x2b csum_and_copy_from_iter inc 0x3db -> 0x3f4 +0x19 csum_and_copy_to_iter dcr 0x45d -> 0x45b -0x2 iov_iter_zero dcr 0x34a -> 0x342 -0x8 memcpy_from_iter.isra.0 del 0x1f memcpy_from_iter_mc new 0x2b Note that there's a noticeable expansion on some of the main functions because a number of the helpers get inlined instead of being called. In terms of benchmarking patch (4), three runs without it: # iov_kunit_benchmark_bvec: avg 3175 uS # iov_kunit_benchmark_bvec_split: avg 3404 uS # iov_kunit_benchmark_xarray: avg 3611 uS # iov_kunit_benchmark_bvec: avg 3175 uS # iov_kunit_benchmark_bvec_split: avg 3403 uS # iov_kunit_benchmark_xarray: avg 3611 uS # iov_kunit_benchmark_bvec: avg 3172 uS # iov_kunit_benchmark_bvec_split: avg 3401 uS # iov_kunit_benchmark_xarray: avg 3614 uS and three runs with it: # iov_kunit_benchmark_bvec: avg 3141 uS # iov_kunit_benchmark_bvec_split: avg 3405 uS # iov_kunit_benchmark_xarray: avg 3546 uS # iov_kunit_benchmark_bvec: avg 3140 uS # iov_kunit_benchmark_bvec_split: avg 3405 uS # iov_kunit_benchmark_xarray: avg 3546 uS # iov_kunit_benchmark_bvec: avg 3138 uS # iov_kunit_benchmark_bvec_split: avg 3401 uS # iov_kunit_benchmark_xarray: avg 3542 uS It looks like patch (4) makes things a little bit faster, probably due to the extra inlining. I've pushed the patches here also: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-cleanup David Changes ======= ver #4) - Fix iterate_bvec() and iterate_kvec() to update iter->bvec and iter->kvec after subtracting it to calculate iter->nr_segs. - Change iterate_xarray() to use start+progress rather than increasing start to reduce code size. - Added patches to move some iteration functions over to net/ as the files there can #include the iterator framework. - Added a patch to benchmark the iteration. ver #3) - Use min_t(size_t,) not min() to avoid a warning on Hexagon. - Inline all the step functions. - Added a patch to better handle copy_mc. ver #2) - Rebased on top of Willy's changes in linux-next. - Change the checksum argument to the iteration functions to be a general void* and use it to pass iter->copy_mc flag to memcpy_from_iter_mc() to avoid using a function pointer. - Arrange the end of the iterate_*() functions to look the same to give the optimiser the best chance. - Make iterate_and_advance() a wrapper around iterate_and_advance2(). - Adjust iterate_and_advance2() to use if-else-if-else-if-else rather than switch(), to put ITER_BVEC before KVEC and to mark UBUF and IOVEC as likely(). - Move "iter->count += progress" into iterate_and_advance2() from the iterate functions. - Mark a number of the iterator helpers with __always_inline. - Fix _copy_from_iter_flushcache() to use memcpy_from_iter_flushcache() not memcpy_from_iter(). Link: https://lore.kernel.org/r/3710261.1691764329@xxxxxxxxxxxxxxxxxxxxxx/ # v1 Link: https://lore.kernel.org/r/855.1692047347@xxxxxxxxxxxxxxxxxxxxxx/ # v2 Link: https://lore.kernel.org/r/20230816120741.534415-1-dhowells@xxxxxxxxxx/ # v3 David Howells (13): iov_iter: Add a benchmarking kunit test iov_iter: Renumber ITER_* constants iov_iter: Derive user-backedness from the iterator type iov_iter: Convert iterate*() to inline funcs iov: Move iterator functions to a header file iov_iter: Add a kernel-type iterator-only iteration function iov_iter: Make copy_from_iter() always handle MCE iov_iter: Remove the copy_mc flag and associated functions iov_iter, net: Move csum_and_copy_to/from_iter() to net/ iov_iter, net: Fold in csum_and_memcpy() iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together iov_iter, net: Move hash_and_copy_to_iter() to net/ iov_iter: Create a fake device to allow iov_iter testing/benchmarking arch/x86/include/asm/mce.h | 23 ++ fs/coredump.c | 1 - include/linux/iov_iter.h | 296 +++++++++++++++++++++++++ include/linux/skbuff.h | 3 + include/linux/uio.h | 45 +--- lib/Kconfig.debug | 8 + lib/Makefile | 1 + lib/iov_iter.c | 429 +++++++++++-------------------------- lib/kunit_iov_iter.c | 181 ++++++++++++++++ lib/test_iov_iter.c | 134 ++++++++++++ lib/test_iov_iter_trace.h | 80 +++++++ net/core/datagram.c | 75 ++++++- net/core/skbuff.c | 40 ++++ 13 files changed, 966 insertions(+), 350 deletions(-) create mode 100644 include/linux/iov_iter.h create mode 100644 lib/test_iov_iter.c create mode 100644 lib/test_iov_iter_trace.h