I'm looking at false-sharing problems concerning multicore open + read + close cycle on one inode and during my survey I found xfs is heavily serializing on a spinlock in xfs_release, making it perform the worst out of the btrfs/ext4/xfs trio. A trivial test case plopped into will-it-scale is at the end. bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }' tells me: [snip] @[ __pv_queued_spin_lock_slowpath+5 _raw_spin_lock_irqsave+49 rwsem_wake.isra.0+57 up_write+69 xfs_iunlock+244 xfs_release+175 __fput+238 __x64_sys_close+60 do_syscall_64+82 entry_SYSCALL_64_after_hwframe+118 ]: 41132 @[ __pv_queued_spin_lock_slowpath+5 _raw_spin_lock_irq+42 rwsem_down_read_slowpath+164 down_read+72 xfs_ilock+125 xfs_file_buffered_read+71 xfs_file_read_iter+115 vfs_read+604 ksys_read+103 do_syscall_64+82 entry_SYSCALL_64_after_hwframe+118 ]: 137639 @[ __pv_queued_spin_lock_slowpath+5 _raw_spin_lock+41 xfs_release+196 __fput+238 __x64_sys_close+60 do_syscall_64+82 entry_SYSCALL_64_after_hwframe+118 ]: 1432766 The xfs_release -> _raw_spin_lock thing is the XFS_ITRUNCATED flag test. Also note how eofblock code inducing write trylock gets in the way of doing the read (first 2 stacks). General note is that for most files real files there are no "blocks past eof" or otherwise truncation going on and there is presumably a way to locklessly handle that, which should also reduce single-threaded overhead. For testing purposes I wrote a total hack which merely branches on i_flags and i_delayed_blks == 0, but I have no idea if that's any good -- *something* definitely is doable here, I leave that to people who know the fs. When running a kernel with a change whacking one lockref cycle on open: https://lore.kernel.org/linux-fsdevel/20240806163256.882140-1-mjguzik@xxxxxxxxx/T/#u ... and toggling the short-circuit outlined above I'm seeing +50% ops/s and going above btrfs. Then top of the profile is the false sharing I'm looking at with other filesystems. So that would be my report. Whatever you guys do with it is your business, I'm bailing to the false-sharing problem. Should you address this in a committable manner, there is no need to credit or cc me. Cheers. the not patch: diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 7dc6f326936c..1cc62c21e709 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1079,6 +1079,8 @@ xfs_itruncate_extents_flags( return error; } +extern unsigned long magic_tunable; + int xfs_release( xfs_inode_t *ip) @@ -1089,6 +1091,13 @@ xfs_release( if (!S_ISREG(VFS_I(ip)->i_mode) || (VFS_I(ip)->i_mode == 0)) return 0; + if (magic_tunable) { + if (!(ip->i_flags & XFS_ITRUNCATED)) + return 0; + if (ip->i_delayed_blks == 0) + return 0; + } + /* If this is a read-only mount, don't do this (would generate I/O) */ if (xfs_is_readonly(mp)) return 0; test case (plop into will-it-scale, say tests/openreadclose3.c and run ./openreadclose3_processes -t 24): #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> #define BUFSIZE 1024 static char tmpfile[] = "/tmp/willitscale.XXXXXX"; char *testcase_description = "Same file open/read/close"; void testcase_prepare(unsigned long nr_tasks) { char buf[BUFSIZE]; int fd = mkstemp(tmpfile); assert(fd >= 0); memset(buf, 'A', sizeof(buf)); assert(write(fd, buf, sizeof(buf)) == sizeof(buf)); close(fd); } void testcase(unsigned long long *iterations, unsigned long nr) { char buf[BUFSIZE]; while (1) { int fd = open(tmpfile, O_RDONLY); assert(fd >= 0); assert(read(fd, buf, sizeof(buf)) == sizeof(buf)); close(fd); (*iterations)++; } } void testcase_cleanup(void) { unlink(tmpfile); }