On 9/10/21 7:57 AM, Jens Axboe wrote: > On 9/9/21 9:36 PM, Al Viro wrote: >> On Thu, Sep 09, 2021 at 09:30:03PM -0600, Jens Axboe wrote: >> >>>> Again, we should never, ever modify the iovec (or bvec, etc.) array in >>>> ->read_iter()/->write_iter()/->sendmsg()/etc. instances. If you see >>>> such behaviour anywhere, report it immediately. Any such is a blatant >>>> bug. >>> >>> Yes that was wrong, the iovec is obviously const. But that really >>> doesn't change the original point, which was that copying the iov_iter >>> itself unconditionally would be miserable. >> >> Might very well be true, but... won't your patch hit the reimport on >> every short read? And the cost of uaccess in there is *much* higher >> than copying of 48 bytes into local variable... >> >> Or am I misreading your patch? Note that short reads on reaching >> EOF are obviously normal - it's not a rare case at all. > > It was just a quick hack, might very well be too eager to go through > those motions. But pondering this instead of sleeping, we don't need to > copy all of iov_iter in order to restore the state, and we can use the > same advance after restoring. So something like this may be more > palatable. Caveat - again untested, and I haven't tested the performance > impact of this at all. Passes basic testing for me. I added a sysctl switch for this so I can compare performance, running my usual peak-perf-single-core benchmark. That one does ~3.5M IOPS, using polled IO. There's always a slight variability between boots and builds, hence the sysctl so I could toggle this behavior on the fly. Did a few runs, and the differences are very stable. With this enabled, we spend about 0.15% more time in io_read(). That's only worth about 5K IOPS at 3.5M, not enough to notice as the variation for the 1 second reporting window usually swings more than that: Old behavior: IOPS=3536512, IOS/call=32/31, inflight=(75) IOPS=3541888, IOS/call=32/32, inflight=(64) IOPS=3529056, IOS/call=32/31, inflight=(119) IOPS=3521184, IOS/call=32/32, inflight=(96) IOPS=3527456, IOS/call=32/31, inflight=(128) IOPS=3525504, IOS/call=32/32, inflight=(128) IOPS=3524288, IOS/call=32/32, inflight=(128) IOPS=3536192, IOS/call=32/32, inflight=(96) IOPS=3535840, IOS/call=32/32, inflight=(96) IOPS=3533728, IOS/call=32/31, inflight=(128) IOPS=3528384, IOS/call=32/32, inflight=(128) IOPS=3518400, IOS/call=32/32, inflight=(64) Turning it on: IOPS=3533824, IOS/call=32/31, inflight=(64) IOPS=3541408, IOS/call=32/32, inflight=(32) IOPS=3533024, IOS/call=32/31, inflight=(64) IOPS=3528672, IOS/call=32/32, inflight=(35) IOPS=3522272, IOS/call=32/31, inflight=(107) IOPS=3517632, IOS/call=32/32, inflight=(57) IOPS=3516000, IOS/call=32/31, inflight=(96) IOPS=3513568, IOS/call=32/32, inflight=(34) IOPS=3525600, IOS/call=32/31, inflight=(96) IOPS=3527136, IOS/call=32/31, inflight=(101) I think that's tolerable, it was never going to be absolutely free. What do you think of this approach? Parts of iov_iter are going to remain constant, like iter_type and data_source. io_uring already copies iter->count, so that just leaves restoring iov (and unionized friends), nr_segs union, and iov_offset; We could pretty this up and have the state part be explicit in iov_iter, and have the store/restore parts end up in uio.h. That'd tie them closer together, though I don't expect iov_iter changes to be an issue. It would make it more maintainable, though. I'll try and hack up this generic solution, see if that looks any better. -- Jens Axboe