On 5/21/22 1:03 PM, Jens Axboe wrote: > On 5/21/22 11:48 AM, Al Viro wrote: >> [resurrecting an old thread] >> >> On Mon, Jun 21, 2021 at 04:35:01PM +0200, Christoph Hellwig wrote: >>> On Mon, Jun 21, 2021 at 02:32:46PM +0000, Al Viro wrote: >>>> I'd rather have a single helper for those checks, rather than >>>> open-coding IS_SYNC() + IOCB_DSYNC in each, for obvious reasons... >>> >>> Yes, I think something like: >>> >>> static inline bool iocb_is_sync(struct kiocb *iocb) >>> { >>> return (iocb->ki_flags & IOCB_DSYNC) || >>> S_SYNC(iocb->ki_filp->f_mapping->host); >>> } >>> >>> should do the job. >> >> There's a problem with that variant. Take a look at btrfs_direct_write(): >> const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC); >> ... >> /* >> * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw() >> * calls generic_write_sync() (through iomap_dio_complete()), because >> * that results in calling fsync (btrfs_sync_file()) which will try to >> * lock the inode in exclusive/write mode. >> */ >> if (is_sync_write) >> iocb->ki_flags &= ~IOCB_DSYNC; >> ... >> >> /* >> * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do >> * the fsync (call generic_write_sync()). >> */ >> if (is_sync_write) >> iocb->ki_flags |= IOCB_DSYNC; >> >> will run into trouble. How about this (completely untested): >> >> Precalculate iocb_flags() >> >> Store the value in file->f_i_flags; calculate at open time (do_dentry_open() >> for opens, alloc_file() for pipe(2)/socket(2)/etc.), update at FCNTL_SETFL >> time. >> >> IOCB_DSYNC is... special in that respect; replace checks for it with >> an inlined helper (iocb_is_dsync()), leave the checks of underlying fs >> mounted with -o sync and/or inode being marked sync until then. >> To avoid breaking btrfs deadlock avoidance, add an explicit "no dsync" flag >> that would suppress IOCB_DSYNC; that way btrfs_direct_write() can set it >> for the duration of work where it wants to avoid generic_write_sync() >> triggering. >> >> That ought to reduce the overhead of new_sync_{read,write}() quite a bit. >> >> NEEDS TESTING; NOT FOR MERGE IN THAT FORM. > > Definitely generates better code here, unsurprisingly. Unfortunately it > doesn't seem to help a whole lot for a direct comparison, though it does > nudge us in the right direction. > > The main issue here, using urandom and 4k reads (iter reads done, > non-iter writes), the main difference seems to be that with the non-iter > reads, here's our copy overhead: > > + 1.80% dd [kernel.kallsyms] [k] __arch_copy_to_user > + 0.74% dd [kernel.kallsyms] [k] _copy_to_user > > and with the iter variant, for the same workload, it looks like this: > > + 4.13% dd [kernel.kallsyms] [k] _copy_to_iter > + 0.88% dd [kernel.kallsyms] [k] __arch_copy_to_user > + 0.72% dd [kernel.kallsyms] [k] copyout > > and about 1% just doing the iov_iter advance. Since this test case is a > single iovec read, ran a quick test where we simply use this helper: > > static size_t random_copy(void *src, int size, struct iov_iter *to) > { > const struct iovec *iov = to->iov; > size_t to_copy = min_t(size_t, size, iov->iov_len); > > if (copy_to_user(iov->iov_base, src, to_copy)) > return 0; > > to->count -= to_copy; > return to_copy; > } > > rather than copy_to_iter(). That brings us within 0.8% of the non-iter > read performance: > > non-iter: 408MB/sec > iter: 395MB/sec -3.2% > iter+al+hack 406MB/sec -0.8% > > for 32-byte reads: > > non-iter 73.1MB/sec > iter: 72.1MB/sec -1.4% > iter+al+hack 72.5MB/sec -0.8% > > which as mentioned in my initial email is closer than the 4k, but still > sees a nice reduction in overhead. Experimented a bit more - if we special case nr_segs == 1 && is_iovec in _copy_to_iter(): _copy_to_iter() { ... if (iter_is_iovec(i)) { might_fault(); if (i->nr_segs == 1) { /* copy */ iov_iter_advance(i, copied); return copied; } } iterate_and_advance(i, bytes, base, len, off, ... return bytes; } Then we're almost on par, and it looks like we just need to special case iov_iter_advance() for the nr_segs == 1 as well to be on par. This is on top of your patch as well, fwiw. It might make sense to special case the single segment cases, for both setup, iteration, and advancing. With that, I think we'll be where we want to be, and there will be no discernable difference between the iter paths and the old style paths. -- Jens Axboe