On 5/21/22 11:48 AM, Al Viro wrote: > [resurrecting an old thread] > > On Mon, Jun 21, 2021 at 04:35:01PM +0200, Christoph Hellwig wrote: >> On Mon, Jun 21, 2021 at 02:32:46PM +0000, Al Viro wrote: >>> I'd rather have a single helper for those checks, rather than >>> open-coding IS_SYNC() + IOCB_DSYNC in each, for obvious reasons... >> >> Yes, I think something like: >> >> static inline bool iocb_is_sync(struct kiocb *iocb) >> { >> return (iocb->ki_flags & IOCB_DSYNC) || >> S_SYNC(iocb->ki_filp->f_mapping->host); >> } >> >> should do the job. > > There's a problem with that variant. Take a look at btrfs_direct_write(): > const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC); > ... > /* > * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw() > * calls generic_write_sync() (through iomap_dio_complete()), because > * that results in calling fsync (btrfs_sync_file()) which will try to > * lock the inode in exclusive/write mode. > */ > if (is_sync_write) > iocb->ki_flags &= ~IOCB_DSYNC; > ... > > /* > * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do > * the fsync (call generic_write_sync()). > */ > if (is_sync_write) > iocb->ki_flags |= IOCB_DSYNC; > > will run into trouble. How about this (completely untested): > > Precalculate iocb_flags() > > Store the value in file->f_i_flags; calculate at open time (do_dentry_open() > for opens, alloc_file() for pipe(2)/socket(2)/etc.), update at FCNTL_SETFL > time. > > IOCB_DSYNC is... special in that respect; replace checks for it with > an inlined helper (iocb_is_dsync()), leave the checks of underlying fs > mounted with -o sync and/or inode being marked sync until then. > To avoid breaking btrfs deadlock avoidance, add an explicit "no dsync" flag > that would suppress IOCB_DSYNC; that way btrfs_direct_write() can set it > for the duration of work where it wants to avoid generic_write_sync() > triggering. > > That ought to reduce the overhead of new_sync_{read,write}() quite a bit. > > NEEDS TESTING; NOT FOR MERGE IN THAT FORM. Definitely generates better code here, unsurprisingly. Unfortunately it doesn't seem to help a whole lot for a direct comparison, though it does nudge us in the right direction. The main issue here, using urandom and 4k reads (iter reads done, non-iter writes), the main difference seems to be that with the non-iter reads, here's our copy overhead: + 1.80% dd [kernel.kallsyms] [k] __arch_copy_to_user + 0.74% dd [kernel.kallsyms] [k] _copy_to_user and with the iter variant, for the same workload, it looks like this: + 4.13% dd [kernel.kallsyms] [k] _copy_to_iter + 0.88% dd [kernel.kallsyms] [k] __arch_copy_to_user + 0.72% dd [kernel.kallsyms] [k] copyout and about 1% just doing the iov_iter advance. Since this test case is a single iovec read, ran a quick test where we simply use this helper: static size_t random_copy(void *src, int size, struct iov_iter *to) { const struct iovec *iov = to->iov; size_t to_copy = min_t(size_t, size, iov->iov_len); if (copy_to_user(iov->iov_base, src, to_copy)) return 0; to->count -= to_copy; return to_copy; } rather than copy_to_iter(). That brings us within 0.8% of the non-iter read performance: non-iter: 408MB/sec iter: 395MB/sec -3.2% iter+al+hack 406MB/sec -0.8% for 32-byte reads: non-iter 73.1MB/sec iter: 72.1MB/sec -1.4% iter+al+hack 72.5MB/sec -0.8% which as mentioned in my initial email is closer than the 4k, but still sees a nice reduction in overhead. -- Jens Axboe