Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 21, 2022 at 3:53 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@xxxxxxx> wrote:
> >
> > On Tue 21-06-22 10:49:48, Amir Goldstein wrote:
> > > > How exactly do you imagine the synchronization of buffered read against
> > > > buffered write would work? Lock all pages for the read range in the page
> > > > cache? You'd need to be careful to not bring the machine OOM when someone
> > > > asks to read a huge range...
> > >
> > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is
> > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(),
> > > when reading data into user buffer, but before that, I would like to issue
> > > and wait for read of the pages in the range to reduce the probability
> > > of doing the read I/O under XFS_IOLOCK_SHARED.
> > >
> > > The pre-warm of page cache does not need to abide to the atomic read
> > > semantics and it is also tolerable if some pages are evicted in between
> > > pre-warn and read to user buffer - in the worst case this will result in
> > > I/O amplification, but for the common case, it will be a big win for the
> > > mixed random r/w performance on xfs.
> > >
> > > To reduce risk of page cache thrashing we can limit this optimization
> > > to a maximum number of page cache pre-warm.
> > >
> > > The questions are:
> > > 1. Does this plan sound reasonable?
> >
> > Ah, I see now. So essentially the idea is to pull the readahead (which is
> > currently happening from filemap_read() -> filemap_get_pages()) out from under
> > the i_rwsem. It looks like a fine idea to me.
>

Although I was able to demonstrate performance improvement
with page cache pre-warming on low latency disks, when testing
on a common standard system [*], page cache pre-warming did not
yield any improvement to the mixed rw workload.

[*] I ran the following fio workload on e2-standard-8 GCE machine:

[global]
filename=/mnt/xfs/testfile.fio
norandommap
randrepeat=0
size=5G
bs=8K
ioengine=psync
numjobs=8
group_reporting=1
direct=0
fallocate=1
end_fsync=0
runtime=60

[xfs-read]
readwrite=randread

[xfs-write]
readwrite=randwrite

The difference between ext4 and xfs with this machine/workload was
two orders of magnitude:

root@xfstests:~# fio ./ext4.fio
...
Run status group 0 (all jobs):
   READ: bw=826MiB/s (866MB/s), 826MiB/s-826MiB/s (866MB/s-866MB/s),
io=40.0GiB (42.9GB), run=49585-49585msec
  WRITE: bw=309MiB/s (324MB/s), 309MiB/s-309MiB/s (324MB/s-324MB/s),
io=18.1GiB (19.5GB), run=60003-60003msec

root@xfstests:~# fio ./xfs.fio
...
Run status group 0 (all jobs):
   READ: bw=7053KiB/s (7223kB/s), 7053KiB/s-7053KiB/s
(7223kB/s-7223kB/s), io=413MiB (433MB), run=60007-60007msec
  WRITE: bw=155MiB/s (163MB/s), 155MiB/s-155MiB/s (163MB/s-163MB/s),
io=9324MiB (9777MB), run=60006-60006msec

I verified that without XFS_IOLOCK_SHARED xfs fio results are on par
with ext4 results for this workload.

>
> > just cannot comment on whether calling this without i_rwsem does not break
> > some internal XFS expectations for stuff like reflink etc.
>
> relink is done under xfs_ilock2_io_mmap => filemap_invalidate_lock_two
> so it should not be a problem.
>
> pNFS leases I need to look into.
>

I wonder if xfs_fs_map_blocks() and xfs_fs_commit_blocks()
should not be taking the invalidate lock before calling
invalidate_inode_pages2() like the xfs callers of
truncate_pagecache_range() do?

If we do that, then I don't see a problem with buffered read
without XFS_IOLOCK_SHARED w.r.t. correctness of layout leases.

Dave, Christoph,

I know that you said that changing the atomic buffered read semantics
is out of the question and that you also objected to a mount option
(which nobody will know how to use) and I accept that.

Given that a performant range locks implementation is not something
trivial to accomplish (Dave please correct me if I am wrong),
and given the massive performance impact of XFS_IOLOCK_SHARED
on this workload,
what do you think about POSIX_FADV_TORN_RW that a specific
application can use to opt-out of atomic buffer read semantics?

The specific application that I want to modify to use this hint is Samba.
Samba uses IO threads by default to issue pread/pwrite on the server
for IO requested by the SMB client. The IO size is normally larger than
xfs block size and the range may not be block aligned.

The SMB protocol has explicit byte range locks and the server implements
them, so it is pretty safe to assume that a client that did not request
range locks does not need xfs to do the implicit range locking for it.

For this reason and because of the huge performance win,
I would like to implement POSIX_FADV_TORN_RW in xfs and
have Samba try to set this hint when supported.

It is very much possible that NFSv4 servers (user and kennel)
would also want to set this hint for very similar reasons.

Thoughts?

Thanks,
Amir.



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux