On 5/23/22 9:12 AM, Jens Axboe wrote: >> Current branch pushed to #new.iov_iter (at the moment; will rename >> back to work.iov_iter once it gets more or less stable). > > Sounds good, I'll see what I need to rebase. On the previous branch, ran a few quick numbers. dd from /dev/zero to /dev/null, with /dev/zero using ->read() as it does by default: 32 260MB/sec 1k 6.6GB/sec 4k 17.9GB/sec 16k 28.8GB/sec now comment out ->read() so it uses ->read_iter() instead: 32 259MB/sec 1k 6.6GB/sec 4k 18.0GB/sec 16k 28.6GB/sec which are roughly identical, all things considered. Just a sanity check, but looks good from a performance POV in this basic test. Now let's do ->read_iter() but make iov_iter_zero() copy from the zero page instead: 32 250MB/sec 1k 7.7GB/sec 4k 28.8GB/sec 16k 71.2GB/sec Looks like it's a tad slower for 32-bytes, considerably better for 1k, and massively better at page size and above. This is on an Intel 12900K, so recent CPU. Let's try cacheline and above: Size Method BW 64 copy_from_zero() 508MB/sec 128 copy_from_zero() 1.0GB/sec 64 clear_user() 513MB/sec 128 clear_user() 1.0GB/sec Something like the below may make sense to do, the wins at bigger sizes is substantial and that gets me the best of both worlds. If we really care, we could move the check earlier and not have it per-segment. I doubt it matters in practice, though. diff --git a/lib/iov_iter.c b/lib/iov_iter.c index e93fcfcf2176..f4b80ef446b9 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1049,12 +1049,19 @@ static size_t pipe_zero(size_t bytes, struct iov_iter *i) return bytes; } +static unsigned long copy_from_zero(void __user *buf, size_t len) +{ + if (len >= 128) + return copy_to_user(buf, page_address(ZERO_PAGE(0)), len); + return clear_user(buf, len); +} + size_t iov_iter_zero(size_t bytes, struct iov_iter *i) { if (unlikely(iov_iter_is_pipe(i))) return pipe_zero(bytes, i); iterate_and_advance(i, bytes, base, len, count, - clear_user(base, len), + copy_from_zero(base, len), memset(base, 0, len) ) -- Jens Axboe