On Thu, Jan 07, 2021 at 01:59:01PM -0500, Mikulas Patocka wrote: > On Thu, 7 Jan 2021, Matthew Wilcox wrote: > > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote: > > > I'd like to ask about this piece of code in __kernel_read: > > > if (unlikely(!file->f_op->read_iter || file->f_op->read)) > > > return warn_unsupported... > > > and __kernel_write: > > > if (unlikely(!file->f_op->write_iter || file->f_op->write)) > > > return warn_unsupported... > > > > > > - It exits with an error if both read_iter and read or write_iter and > > > write are present. > > > > > > I found out that on NVFS, reading a file with the read method has 10% > > > better performance than the read_iter method. The benchmark just reads the > > > same 4k page over and over again - and the cost of creating and parsing > > > the kiocb and iov_iter structures is just that high. > > > > Which part of it is so expensive? > > The read_iter path is much bigger: > vfs_read - 0x160 bytes > new_sync_read - 0x160 bytes > nvfs_rw_iter - 0x100 bytes > nvfs_rw_iter_locked - 0x4a0 bytes > iov_iter_advance - 0x300 bytes Number of bytes in a function isn't really correlated with how expensive a particular function is. That said, looking at new_sync_read() shows one part that's particularly bad, init_sync_kiocb(): static inline int iocb_flags(struct file *file) { int res = 0; if (file->f_flags & O_APPEND) res |= IOCB_APPEND; 7ec: 8b 57 40 mov 0x40(%rdi),%edx 7ef: 48 89 75 80 mov %rsi,-0x80(%rbp) if (file->f_flags & O_DIRECT) 7f3: 89 d0 mov %edx,%eax 7f5: c1 e8 06 shr $0x6,%eax 7f8: 83 e0 10 and $0x10,%eax res |= IOCB_DIRECT; if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host)) 7fb: 89 c1 mov %eax,%ecx 7fd: 81 c9 00 00 02 00 or $0x20000,%ecx 803: f6 c6 40 test $0x40,%dh 806: 0f 45 c1 cmovne %ecx,%eax res |= IOCB_DSYNC; 809: f6 c6 10 test $0x10,%dh 80c: 75 18 jne 826 <new_sync_read+0x66> 80e: 48 8b 8f d8 00 00 00 mov 0xd8(%rdi),%rcx 815: 48 8b 09 mov (%rcx),%rcx 818: 48 8b 71 28 mov 0x28(%rcx),%rsi 81c: f6 46 50 10 testb $0x10,0x50(%rsi) 820: 0f 84 e2 00 00 00 je 908 <new_sync_read+0x148> if (file->f_flags & __O_SYNC) 826: 83 c8 02 or $0x2,%eax res |= IOCB_SYNC; return res; 829: 89 c1 mov %eax,%ecx 82b: 83 c9 04 or $0x4,%ecx 82e: 81 e2 00 00 10 00 and $0x100000,%edx We could optimise this by, eg, checking for (__O_SYNC | O_DIRECT | O_APPEND) and returning 0 if none of them are set, since they're all pretty rare. It might be better to maintain an f_iocb_flags in the struct file and just copy that unconditionally. We'd need to remember to update it in fcntl(F_SETFL), but I think that's the only place. > If we go with the "read" method, there's just: > vfs_read - 0x160 bytes > nvfs_read - 0x200 bytes > > > Is it worth, eg adding an iov_iter > > type that points to a single buffer instead of a single-member iov? > 6.57% pread [nvfs] [k] nvfs_rw_iter_locked > 2.31% pread [kernel.vmlinux] [k] new_sync_read > 1.89% pread [kernel.vmlinux] [k] iov_iter_advance > 1.24% pread [nvfs] [k] nvfs_rw_iter > 0.29% pread [kernel.vmlinux] [k] iov_iter_init > 2.71% pread [nvfs] [k] nvfs_read > Note that if we sum the percentage of nvfs_iter_locked, new_sync_read, > iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the > second trace, nvfs_read consumes just 2.71% - and it replaces > functionality of all these functions. > > That is the reason for that 10% degradation with read_iter. You seem to be focusing on your argument for "let's just permit filesystems to implement both ->read and ->read_iter". My suggestion is that we need to optimise the ->read_iter path, but to do that we need to know what's expensive. nvfs_rw_iter_locked() looks very complicated. I suspect it can be simplified. Of course new_sync_read() needs to be improved too, as do the other functions here, but fully a third of the difference between read() and read_iter() is the difference between nvfs_read() and nvfs_rw_iter_locked().