[PATCH] xfs: take shared inode lock on single page buffered writes

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 22 Sep 2022 12:55:14 +0300

Unlike many other Linux filesystems, xfs takes a shared inode lock on
buffered reads and exclusive inode lock on buffered writes to guarantee
that buffered writes are atomic w.r.t. buffered reads.

ext4 for example, takes an exclusive inode lock on buffered writes, but
does not take an inode lock on buffered reads.

This difference in behavior is attributed to legacy xfs code that pre
dates Linux.

The contention on inode lock in a concurrent mixed randrw workload
causes severe performance degradation in xfs compared to ext4.

When write range is within the bounds of a single page, the page
lock guarantees that the write is atomic w.r.t buffered reads.
In that case, take shared inode lock on write to avoid the unneeded
contention in case of a randrw workload using 4K IO buffer size.

This optimization could be extended to slightly larger IO buffer size
that falls within the bounds of a single folio, by limiting iomap to map
a single folio and if that fails, fall back to exclusive inode lock.

The performance improvment was demonstrated by running the following
fio workload on a e2-standard-8 GCE machine:

[global]
filename=testfile.fio
norandommap
randrepeat=0
size=5G
bs=4K
ioengine=psync
numjobs=8
group_reporting=1
direct=0
fallocate=1
end_fsync=0
runtime=60
[read]
readwrite=randread
[write]
readwrite=randwrite

The numbers in the table below represent the range of results in
different test runs per tested filesystem.
Each test was run on a freshly formatted fs with single test file with
written extents and cold caches.

     FS:   ext4               xfs                  xfs+ (this path)
   READ:   136MB/s-210MB/s     2.7MB/s-6.3MB/s     747MB/s-771MB/s
  WRITE:   227MB/s-248MB/s    86.3MB/s-171MB/s     973MB/s-1050MB/s

Needless to say, xfs performance on the same workload with 8K IO buffer
size is not improved by this patch and are just as bad as 4K IO buffer
without the patch.

Suggested-by: Dave Chinner <david@xxxxxxxxxxxxx>
Link: https://lore.kernel.org/linux-xfs/20220920022439.GP3600936@xxxxxxxxxxxxxxxxxxx/
Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
---

Hi Dave and Darrick,

Following the discussions on my previous patch [1]
("xfs: reduce ilock contention on buffered randrw workload")

Here is a patch that optimizes the low hanging case of single page.
As you can see in the commit message, the optimization fixes the
performance of 4K IO workload.

I verified no regrerssions with -g quick on default 4K block config and
started running -g auto on all kdevop configs including 1K block configs.

Regarding the followup higher order folio optimizations, please let
me know if you intend to work on the large folio mapping iomap write
code or if you prefer to guide me to do that work.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-xfs/20220920022439.GP3600936@xxxxxxxxxxxxxxxxxxx/

 fs/xfs/xfs_file.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6c80265c0b2..3c17f2461ec2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -703,12 +703,29 @@ xfs_file_buffered_write(
 {
 	struct inode		*inode = iocb->ki_filp->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
-	ssize_t			ret;
+	ssize_t			ret = 0;
+	off_t			end_pos = iocb->ki_pos + iov_iter_count(from);
 	bool			cleared_space = false;
 	unsigned int		iolock;
 
 write_retry:
 	iolock = XFS_IOLOCK_EXCL;
+	/*
+	 * When write range is within the bounds of a single page, the page
+	 * lock guarantees that the write is atomic w.r.t buffered reads.
+	 * In that case, take shared iolock to avoid contention on iolock in
+	 * concurrent mixed read-write workloads.
+	 *
+	 * Write retry after space cleanup is not interesting to optimize,
+	 * so don't bother optimizing this case.
+	 *
+	 * TODO: Expand this optimization to write range that may fall within
+	 * the bounds of a single folio, ask iomap to try to write into a single
+	 * folio and if that fails, write_retry with exclusive iolock.
+	 */
+	if (!ret && iocb->ki_pos >> PAGE_SHIFT == (end_pos - 1) >> PAGE_SHIFT)
+		iolock = XFS_IOLOCK_SHARED;
+
 	ret = xfs_ilock_iocb(iocb, iolock);
 	if (ret)
 		return ret;
@@ -740,6 +757,7 @@ xfs_file_buffered_write(
 		xfs_iunlock(ip, iolock);
 		xfs_blockgc_free_quota(ip, XFS_ICWALK_FLAG_SYNC);
 		cleared_space = true;
+		ret = -EAGAIN;
 		goto write_retry;
 	} else if (ret == -ENOSPC && !cleared_space) {
 		struct xfs_icwalk	icw = {0};
@@ -750,6 +768,7 @@ xfs_file_buffered_write(
 		xfs_iunlock(ip, iolock);
 		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
 		xfs_blockgc_free_space(ip->i_mount, &icw);
+		ret = -EAGAIN;
 		goto write_retry;
 	}
 
-- 
2.25.1