On Thu, 7 Jan 2021, Matthew Wilcox wrote: > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote: > > I'd like to ask about this piece of code in __kernel_read: > > if (unlikely(!file->f_op->read_iter || file->f_op->read)) > > return warn_unsupported... > > and __kernel_write: > > if (unlikely(!file->f_op->write_iter || file->f_op->write)) > > return warn_unsupported... > > > > - It exits with an error if both read_iter and read or write_iter and > > write are present. > > > > I found out that on NVFS, reading a file with the read method has 10% > > better performance than the read_iter method. The benchmark just reads the > > same 4k page over and over again - and the cost of creating and parsing > > the kiocb and iov_iter structures is just that high. > > Which part of it is so expensive? The read_iter path is much bigger: vfs_read - 0x160 bytes new_sync_read - 0x160 bytes nvfs_rw_iter - 0x100 bytes nvfs_rw_iter_locked - 0x4a0 bytes iov_iter_advance - 0x300 bytes If we go with the "read" method, there's just: vfs_read - 0x160 bytes nvfs_read - 0x200 bytes > Is it worth, eg adding an iov_iter > type that points to a single buffer instead of a single-member iov? > > +++ b/include/linux/uio.h > @@ -19,6 +19,7 @@ struct kvec { > > enum iter_type { > /* iter types */ > + ITER_UBUF = 2, > ITER_IOVEC = 4, > ITER_KVEC = 8, > ITER_BVEC = 16, > @@ -36,6 +36,7 @@ struct iov_iter { > size_t iov_offset; > size_t count; > union { > + void __user *buf; > const struct iovec *iov; > const struct kvec *kvec; > const struct bio_vec *bvec; > > and then doing all the appropriate changes to make that work. I tried this benchmark on nvfs: #include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(void) { unsigned long i; unsigned long l = 1UL << 38; unsigned s = 4096; void *a = valloc(s); if (!a) perror("malloc"), exit(1); for (i = 0; i < l; i += s) { if (pread(0, a, s, 0) != s) perror("read"), exit(1); } return 0; } Result, using the read_iter method: # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 3K of event 'cycles' # Event count (approx.): 1049885560 # # Overhead Command Shared Object Symbol # ........ ....... ................ ..................................... # 47.32% pread [kernel.vmlinux] [k] copy_user_generic_string 7.83% pread [kernel.vmlinux] [k] current_time 6.57% pread [nvfs] [k] nvfs_rw_iter_locked 5.59% pread [kernel.vmlinux] [k] entry_SYSCALL_64 4.23% pread libc-2.31.so [.] __libc_pread 3.51% pread [kernel.vmlinux] [k] syscall_return_via_sysret 2.34% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe 2.34% pread [kernel.vmlinux] [k] vfs_read 2.34% pread [kernel.vmlinux] [k] __fsnotify_parent 2.31% pread [kernel.vmlinux] [k] new_sync_read 2.21% pread [nvfs] [k] nvfs_bmap 1.89% pread [kernel.vmlinux] [k] iov_iter_advance 1.71% pread [kernel.vmlinux] [k] __x64_sys_pread64 1.59% pread [kernel.vmlinux] [k] atime_needs_update 1.24% pread [nvfs] [k] nvfs_rw_iter 0.94% pread [kernel.vmlinux] [k] touch_atime 0.75% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode 0.72% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64 0.68% pread [kernel.vmlinux] [k] down_read 0.62% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare 0.52% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode 0.49% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare 0.47% pread [kernel.vmlinux] [k] __fget_light 0.46% pread [kernel.vmlinux] [k] do_syscall_64 0.42% pread pread [.] main 0.33% pread [kernel.vmlinux] [k] up_read 0.29% pread [kernel.vmlinux] [k] iov_iter_init 0.16% pread [kernel.vmlinux] [k] __fdget 0.10% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack 0.03% pread pread [.] pread@plt 0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all # # (Tip: Use --symfs <dir> if your symbol files are in non-standard locations) # Result, using the read method: # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 3K of event 'cycles' # Event count (approx.): 1312158116 # # Overhead Command Shared Object Symbol # ........ ....... ................ ..................................... # 60.77% pread [kernel.vmlinux] [k] copy_user_generic_string 6.14% pread [kernel.vmlinux] [k] current_time 3.88% pread [kernel.vmlinux] [k] entry_SYSCALL_64 3.55% pread libc-2.31.so [.] __libc_pread 3.04% pread [nvfs] [k] nvfs_bmap 2.84% pread [kernel.vmlinux] [k] syscall_return_via_sysret 2.71% pread [nvfs] [k] nvfs_read 2.56% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe 2.00% pread [kernel.vmlinux] [k] __x64_sys_pread64 1.98% pread [kernel.vmlinux] [k] __fsnotify_parent 1.77% pread [kernel.vmlinux] [k] vfs_read 1.35% pread [kernel.vmlinux] [k] atime_needs_update 0.94% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare 0.91% pread [kernel.vmlinux] [k] __fget_light 0.83% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode 0.70% pread [kernel.vmlinux] [k] down_read 0.70% pread [kernel.vmlinux] [k] touch_atime 0.65% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64 0.55% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode 0.49% pread [kernel.vmlinux] [k] up_read 0.44% pread [kernel.vmlinux] [k] do_syscall_64 0.39% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare 0.34% pread pread [.] main 0.26% pread [kernel.vmlinux] [k] __fdget 0.10% pread pread [.] pread@plt 0.10% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack 0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all # # (Tip: To set sample time separation other than 100ms with --sort time use --time-quantum) # Note that if we sum the percentage of nvfs_iter_locked, new_sync_read, iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the second trace, nvfs_read consumes just 2.71% - and it replaces functionality of all these functions. That is the reason for that 10% degradation with read_iter. Mikulas