Unlike many other Linux filesystems, xfs takes a shared inode lock on buffered reads and exclusive inode lock on buffered writes to guarantee that buffered writes are atomic w.r.t. buffered reads. ext4 for example, takes an exclusive inode lock on buffered writes, but does not take an inode lock on buffered reads. This difference in behavior is attributed to legacy xfs code that pre dates Linux. The contention on inode lock in a concurrent mixed randrw workload causes severe performance degradation in xfs compared to ext4. When write range is within the bounds of a single page, the page lock guarantees that the write is atomic w.r.t buffered reads. In that case, take shared inode lock on write to avoid the unneeded contention in case of a randrw workload using 4K IO buffer size. This optimization could be extended to slightly larger IO buffer size that falls within the bounds of a single folio, by limiting iomap to map a single folio and if that fails, fall back to exclusive inode lock. The performance improvment was demonstrated by running the following fio workload on a e2-standard-8 GCE machine: [global] filename=testfile.fio norandommap randrepeat=0 size=5G bs=4K ioengine=psync numjobs=8 group_reporting=1 direct=0 fallocate=1 end_fsync=0 runtime=60 [read] readwrite=randread [write] readwrite=randwrite The numbers in the table below represent the range of results in different test runs per tested filesystem. Each test was run on a freshly formatted fs with single test file with written extents and cold caches. FS: ext4 xfs xfs+ (this path) READ: 136MB/s-210MB/s 2.7MB/s-6.3MB/s 747MB/s-771MB/s WRITE: 227MB/s-248MB/s 86.3MB/s-171MB/s 973MB/s-1050MB/s Needless to say, xfs performance on the same workload with 8K IO buffer size is not improved by this patch and are just as bad as 4K IO buffer without the patch. Suggested-by: Dave Chinner <david@xxxxxxxxxxxxx> Link: https://lore.kernel.org/linux-xfs/20220920022439.GP3600936@xxxxxxxxxxxxxxxxxxx/ Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> --- Hi Dave and Darrick, Following the discussions on my previous patch [1] ("xfs: reduce ilock contention on buffered randrw workload") Here is a patch that optimizes the low hanging case of single page. As you can see in the commit message, the optimization fixes the performance of 4K IO workload. I verified no regrerssions with -g quick on default 4K block config and started running -g auto on all kdevop configs including 1K block configs. Regarding the followup higher order folio optimizations, please let me know if you intend to work on the large folio mapping iomap write code or if you prefer to guide me to do that work. Thanks, Amir. [1] https://lore.kernel.org/linux-xfs/20220920022439.GP3600936@xxxxxxxxxxxxxxxxxxx/ fs/xfs/xfs_file.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index c6c80265c0b2..3c17f2461ec2 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -703,12 +703,29 @@ xfs_file_buffered_write( { struct inode *inode = iocb->ki_filp->f_mapping->host; struct xfs_inode *ip = XFS_I(inode); - ssize_t ret; + ssize_t ret = 0; + off_t end_pos = iocb->ki_pos + iov_iter_count(from); bool cleared_space = false; unsigned int iolock; write_retry: iolock = XFS_IOLOCK_EXCL; + /* + * When write range is within the bounds of a single page, the page + * lock guarantees that the write is atomic w.r.t buffered reads. + * In that case, take shared iolock to avoid contention on iolock in + * concurrent mixed read-write workloads. + * + * Write retry after space cleanup is not interesting to optimize, + * so don't bother optimizing this case. + * + * TODO: Expand this optimization to write range that may fall within + * the bounds of a single folio, ask iomap to try to write into a single + * folio and if that fails, write_retry with exclusive iolock. + */ + if (!ret && iocb->ki_pos >> PAGE_SHIFT == (end_pos - 1) >> PAGE_SHIFT) + iolock = XFS_IOLOCK_SHARED; + ret = xfs_ilock_iocb(iocb, iolock); if (ret) return ret; @@ -740,6 +757,7 @@ xfs_file_buffered_write( xfs_iunlock(ip, iolock); xfs_blockgc_free_quota(ip, XFS_ICWALK_FLAG_SYNC); cleared_space = true; + ret = -EAGAIN; goto write_retry; } else if (ret == -ENOSPC && !cleared_space) { struct xfs_icwalk icw = {0}; @@ -750,6 +768,7 @@ xfs_file_buffered_write( xfs_iunlock(ip, iolock); icw.icw_flags = XFS_ICWALK_FLAG_SYNC; xfs_blockgc_free_space(ip->i_mount, &icw); + ret = -EAGAIN; goto write_retry; } -- 2.25.1