xfs_release lock contention

Mateusz Guzik <mjguzik@xxxxxxxxx> · Wed, 7 Aug 2024 06:27:21 +0200

I'm looking at false-sharing problems concerning multicore open + read +
close cycle on one inode and during my survey I found xfs is heavily
serializing on a spinlock in xfs_release, making it perform the worst
out of the btrfs/ext4/xfs trio.

A trivial test case plopped into will-it-scale is at the end.

bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }' tells me:
[snip]
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irqsave+49
    rwsem_wake.isra.0+57
    up_write+69
    xfs_iunlock+244
    xfs_release+175
    __fput+238
    __x64_sys_close+60
    do_syscall_64+82
    entry_SYSCALL_64_after_hwframe+118
]: 41132
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irq+42
    rwsem_down_read_slowpath+164
    down_read+72
    xfs_ilock+125
    xfs_file_buffered_read+71
    xfs_file_read_iter+115
    vfs_read+604
    ksys_read+103
    do_syscall_64+82
    entry_SYSCALL_64_after_hwframe+118
]: 137639
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock+41
    xfs_release+196
    __fput+238
    __x64_sys_close+60
    do_syscall_64+82
    entry_SYSCALL_64_after_hwframe+118
]: 1432766

The xfs_release -> _raw_spin_lock thing is the XFS_ITRUNCATED flag test.

Also note how eofblock code inducing write trylock gets in the way of
doing the read (first 2 stacks).

General note is that for most files real files there are no "blocks past
eof" or otherwise truncation going on and there is presumably a way to
locklessly handle that, which should also reduce single-threaded
overhead.

For testing purposes I wrote a total hack which merely branches on
i_flags and i_delayed_blks == 0, but I have no idea if that's any good
-- *something* definitely is doable here, I leave that to people who
know the fs.

When running a kernel with a change whacking one lockref cycle on open:
https://lore.kernel.org/linux-fsdevel/20240806163256.882140-1-mjguzik@xxxxxxxxx/T/#u

... and toggling the short-circuit outlined above I'm seeing +50% ops/s
and going above btrfs. Then top of the profile is the false sharing I'm
looking at with other filesystems.

So that would be my report. Whatever you guys do with it is your
business, I'm bailing to the false-sharing problem.

Should you address this in a committable manner, there is no need to
credit or cc me.

Cheers.

the not patch:

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7dc6f326936c..1cc62c21e709 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1079,6 +1079,8 @@ xfs_itruncate_extents_flags(
        return error;
 }

+extern unsigned long magic_tunable;
+
 int
 xfs_release(
        xfs_inode_t     *ip)
@@ -1089,6 +1091,13 @@ xfs_release(
        if (!S_ISREG(VFS_I(ip)->i_mode) || (VFS_I(ip)->i_mode == 0))
                return 0;

+       if (magic_tunable) {
+               if (!(ip->i_flags & XFS_ITRUNCATED))
+                       return 0;
+               if (ip->i_delayed_blks == 0)
+                       return 0;
+       }
+
        /* If this is a read-only mount, don't do this (would generate I/O) */
        if (xfs_is_readonly(mp))
                return 0;


test case (plop into will-it-scale, say tests/openreadclose3.c and run
./openreadclose3_processes -t 24):

#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>

#define BUFSIZE 1024

static char tmpfile[] = "/tmp/willitscale.XXXXXX";

char *testcase_description = "Same file open/read/close";

void testcase_prepare(unsigned long nr_tasks)
{
        char buf[BUFSIZE];
        int fd = mkstemp(tmpfile);

        assert(fd >= 0);
        memset(buf, 'A', sizeof(buf));
        assert(write(fd, buf, sizeof(buf)) == sizeof(buf));
        close(fd);
}

void testcase(unsigned long long *iterations, unsigned long nr)
{
        char buf[BUFSIZE];

        while (1) {
                int fd = open(tmpfile, O_RDONLY);
                assert(fd >= 0);
                assert(read(fd, buf, sizeof(buf)) == sizeof(buf));
                close(fd);

                (*iterations)++;
        }
}

void testcase_cleanup(void)
{
        unlink(tmpfile);
}