On Tue, Sep 13, 2022 at 05:40:46PM +0300, Amir Goldstein wrote: > On Tue, Jun 21, 2022 at 3:53 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@xxxxxxx> wrote: > > > > > > On Tue 21-06-22 10:49:48, Amir Goldstein wrote: > > > > > How exactly do you imagine the synchronization of buffered read against > > > > > buffered write would work? Lock all pages for the read range in the page > > > > > cache? You'd need to be careful to not bring the machine OOM when someone > > > > > asks to read a huge range... > > > > > > > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is > > > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(), > > > > when reading data into user buffer, but before that, I would like to issue > > > > and wait for read of the pages in the range to reduce the probability > > > > of doing the read I/O under XFS_IOLOCK_SHARED. > > > > > > > > The pre-warm of page cache does not need to abide to the atomic read > > > > semantics and it is also tolerable if some pages are evicted in between > > > > pre-warn and read to user buffer - in the worst case this will result in > > > > I/O amplification, but for the common case, it will be a big win for the > > > > mixed random r/w performance on xfs. > > > > > > > > To reduce risk of page cache thrashing we can limit this optimization > > > > to a maximum number of page cache pre-warm. > > > > > > > > The questions are: > > > > 1. Does this plan sound reasonable? > > > > > > Ah, I see now. So essentially the idea is to pull the readahead (which is > > > currently happening from filemap_read() -> filemap_get_pages()) out from under > > > the i_rwsem. It looks like a fine idea to me. > > > > Although I was able to demonstrate performance improvement > with page cache pre-warming on low latency disks, when testing > on a common standard system [*], page cache pre-warming did not > yield any improvement to the mixed rw workload. > > [*] I ran the following fio workload on e2-standard-8 GCE machine: > > [global] > filename=/mnt/xfs/testfile.fio > norandommap > randrepeat=0 > size=5G > bs=8K > ioengine=psync > numjobs=8 > group_reporting=1 > direct=0 > fallocate=1 > end_fsync=0 > runtime=60 > > [xfs-read] > readwrite=randread > > [xfs-write] > readwrite=randwrite > > The difference between ext4 and xfs with this machine/workload was > two orders of magnitude: > > root@xfstests:~# fio ./ext4.fio > ... > Run status group 0 (all jobs): > READ: bw=826MiB/s (866MB/s), 826MiB/s-826MiB/s (866MB/s-866MB/s), > io=40.0GiB (42.9GB), run=49585-49585msec > WRITE: bw=309MiB/s (324MB/s), 309MiB/s-309MiB/s (324MB/s-324MB/s), > io=18.1GiB (19.5GB), run=60003-60003msec > > root@xfstests:~# fio ./xfs.fio > ... > Run status group 0 (all jobs): > READ: bw=7053KiB/s (7223kB/s), 7053KiB/s-7053KiB/s > (7223kB/s-7223kB/s), io=413MiB (433MB), run=60007-60007msec > WRITE: bw=155MiB/s (163MB/s), 155MiB/s-155MiB/s (163MB/s-163MB/s), > io=9324MiB (9777MB), run=60006-60006msec > > I verified that without XFS_IOLOCK_SHARED xfs fio results are on par > with ext4 results for this workload. > > > > > > just cannot comment on whether calling this without i_rwsem does not break > > > some internal XFS expectations for stuff like reflink etc. > > > > relink is done under xfs_ilock2_io_mmap => filemap_invalidate_lock_two > > so it should not be a problem. > > > > pNFS leases I need to look into. > > > > I wonder if xfs_fs_map_blocks() and xfs_fs_commit_blocks() > should not be taking the invalidate lock before calling > invalidate_inode_pages2() like the xfs callers of > truncate_pagecache_range() do? > > If we do that, then I don't see a problem with buffered read > without XFS_IOLOCK_SHARED w.r.t. correctness of layout leases. > > Dave, Christoph, > > I know that you said that changing the atomic buffered read semantics > is out of the question and that you also objected to a mount option > (which nobody will know how to use) and I accept that. > > Given that a performant range locks implementation is not something > trivial to accomplish (Dave please correct me if I am wrong), > and given the massive performance impact of XFS_IOLOCK_SHARED > on this workload, > what do you think about POSIX_FADV_TORN_RW that a specific > application can use to opt-out of atomic buffer read semantics? > > The specific application that I want to modify to use this hint is Samba. > Samba uses IO threads by default to issue pread/pwrite on the server > for IO requested by the SMB client. The IO size is normally larger than > xfs block size and the range may not be block aligned. > > The SMB protocol has explicit byte range locks and the server implements > them, so it is pretty safe to assume that a client that did not request > range locks does not need xfs to do the implicit range locking for it. > > For this reason and because of the huge performance win, > I would like to implement POSIX_FADV_TORN_RW in xfs and > have Samba try to set this hint when supported. > > It is very much possible that NFSv4 servers (user and kennel) > would also want to set this hint for very similar reasons. > > Thoughts? How about range locks for i_rwsem and invalidate_lock? That could reduce contention on VM farms, though I can only assume that, given that I don't have a reference implementation to play with... --D > Thanks, > Amir.