Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Sep 13, 2022 at 05:40:46PM +0300, Amir Goldstein wrote:
> On Tue, Jun 21, 2022 at 3:53 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> >
> > On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > On Tue 21-06-22 10:49:48, Amir Goldstein wrote:
> > > > > How exactly do you imagine the synchronization of buffered read against
> > > > > buffered write would work? Lock all pages for the read range in the page
> > > > > cache? You'd need to be careful to not bring the machine OOM when someone
> > > > > asks to read a huge range...
> > > >
> > > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is
> > > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(),
> > > > when reading data into user buffer, but before that, I would like to issue
> > > > and wait for read of the pages in the range to reduce the probability
> > > > of doing the read I/O under XFS_IOLOCK_SHARED.
> > > >
> > > > The pre-warm of page cache does not need to abide to the atomic read
> > > > semantics and it is also tolerable if some pages are evicted in between
> > > > pre-warn and read to user buffer - in the worst case this will result in
> > > > I/O amplification, but for the common case, it will be a big win for the
> > > > mixed random r/w performance on xfs.
> > > >
> > > > To reduce risk of page cache thrashing we can limit this optimization
> > > > to a maximum number of page cache pre-warm.
> > > >
> > > > The questions are:
> > > > 1. Does this plan sound reasonable?
> > >
> > > Ah, I see now. So essentially the idea is to pull the readahead (which is
> > > currently happening from filemap_read() -> filemap_get_pages()) out from under
> > > the i_rwsem. It looks like a fine idea to me.
> >
> 
> Although I was able to demonstrate performance improvement
> with page cache pre-warming on low latency disks, when testing
> on a common standard system [*], page cache pre-warming did not
> yield any improvement to the mixed rw workload.
> 
> [*] I ran the following fio workload on e2-standard-8 GCE machine:
> 
> [global]
> filename=/mnt/xfs/testfile.fio
> norandommap
> randrepeat=0
> size=5G
> bs=8K
> ioengine=psync
> numjobs=8
> group_reporting=1
> direct=0
> fallocate=1
> end_fsync=0
> runtime=60
> 
> [xfs-read]
> readwrite=randread
> 
> [xfs-write]
> readwrite=randwrite
> 
> The difference between ext4 and xfs with this machine/workload was
> two orders of magnitude:
> 
> root@xfstests:~# fio ./ext4.fio
> ...
> Run status group 0 (all jobs):
>    READ: bw=826MiB/s (866MB/s), 826MiB/s-826MiB/s (866MB/s-866MB/s),
> io=40.0GiB (42.9GB), run=49585-49585msec
>   WRITE: bw=309MiB/s (324MB/s), 309MiB/s-309MiB/s (324MB/s-324MB/s),
> io=18.1GiB (19.5GB), run=60003-60003msec
> 
> root@xfstests:~# fio ./xfs.fio
> ...
> Run status group 0 (all jobs):
>    READ: bw=7053KiB/s (7223kB/s), 7053KiB/s-7053KiB/s
> (7223kB/s-7223kB/s), io=413MiB (433MB), run=60007-60007msec
>   WRITE: bw=155MiB/s (163MB/s), 155MiB/s-155MiB/s (163MB/s-163MB/s),
> io=9324MiB (9777MB), run=60006-60006msec
> 
> I verified that without XFS_IOLOCK_SHARED xfs fio results are on par
> with ext4 results for this workload.
> 
> >
> > > just cannot comment on whether calling this without i_rwsem does not break
> > > some internal XFS expectations for stuff like reflink etc.
> >
> > relink is done under xfs_ilock2_io_mmap => filemap_invalidate_lock_two
> > so it should not be a problem.
> >
> > pNFS leases I need to look into.
> >
> 
> I wonder if xfs_fs_map_blocks() and xfs_fs_commit_blocks()
> should not be taking the invalidate lock before calling
> invalidate_inode_pages2() like the xfs callers of
> truncate_pagecache_range() do?
> 
> If we do that, then I don't see a problem with buffered read
> without XFS_IOLOCK_SHARED w.r.t. correctness of layout leases.
> 
> Dave, Christoph,
> 
> I know that you said that changing the atomic buffered read semantics
> is out of the question and that you also objected to a mount option
> (which nobody will know how to use) and I accept that.
> 
> Given that a performant range locks implementation is not something
> trivial to accomplish (Dave please correct me if I am wrong),
> and given the massive performance impact of XFS_IOLOCK_SHARED
> on this workload,
> what do you think about POSIX_FADV_TORN_RW that a specific
> application can use to opt-out of atomic buffer read semantics?
> 
> The specific application that I want to modify to use this hint is Samba.
> Samba uses IO threads by default to issue pread/pwrite on the server
> for IO requested by the SMB client. The IO size is normally larger than
> xfs block size and the range may not be block aligned.
> 
> The SMB protocol has explicit byte range locks and the server implements
> them, so it is pretty safe to assume that a client that did not request
> range locks does not need xfs to do the implicit range locking for it.
> 
> For this reason and because of the huge performance win,
> I would like to implement POSIX_FADV_TORN_RW in xfs and
> have Samba try to set this hint when supported.
> 
> It is very much possible that NFSv4 servers (user and kennel)
> would also want to set this hint for very similar reasons.
> 
> Thoughts?

How about range locks for i_rwsem and invalidate_lock?  That could
reduce contention on VM farms, though I can only assume that, given that
I don't have a reference implementation to play with...

--D

> Thanks,
> Amir.



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux