[PATCH 0/11] vfs: inode cache scalability improvements

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 6 Dec 2023 17:05:29 +1100

We all know that the global inode_hash_lock and the per-fs global 
sb->s_inode_list_lock locks are contention points in filesystem workloads
that stream inodes through memory, so it's about time we addressed these
limitations.

The first part of the patchset address the sb->s_inode_list_lock.
This was done a long time ago by Waiman Long by converting the
global linked list to a per-cpu linked list - those infrastructure
patches are pretty much unchanged from when Waiman first wrote them,
and as such the still carry the RVB that Jan Kara gave for them. I
have no idea if the problem that Waiman was trying to solve still
exists, but that's largely irrelevant because there are other
problems that I can easily reproduce.

That is, once at ~16 threads trying to instantiate or tear down
inodes at the same time in a filesystem, the sb->s_inode_list_lock
becomes a single point of contention. Adding an inode to the inode
cache requires adding it to the sb->s_inodes list, and removing an
inode from the cache requires removing it from the sb->s_inodes
list. That's two exclusive lock operations per inode we cycle
through the inode cache.

This creates a hard limit on the number of inodes we can cycle
through memory in a single filesystem. It tops out at around
600-700,000 inodes per second on XFS, and at that point we see
catastrophic cacheline contention breakdown and nothing goes any
faster. We can easily burn hundreds of CPUs on the sb->s_inodes list
operations, yet we still can only get 600-700k inodes/s through the
cache.

Converting the sb->s_inodes list to a dlist gets rid of this single
contention point and makes the sb->s_inodes list operations
disappear from the profiles completely. Prior to this change, at 32
threads XFS could pull 12.8 million inodes into cache in ~20s
(that's ~600k inodes/s - sound familiar?). With this change, those
12.8 million inodes are pulled into cache in ~10s. That's double the
rate at which XFS can pull inodes into memory from the
filesystem....

I'm talking about XFS here, because none of the other filesystem
actually stress the sb->s_inode_list_lock at all. They all hit
catastrophic cacheline contention on the inode_hash_lock long before
they get anywhere near the sb->s_inodes limits. For ext4 and
bcachefs, the inode_hash_lock becomes a limiting factor at 8
threads. btrfs hits internal namespace tree contention limits at 2
threads, so it's not even stressing the inode_hash_lock unless
highly threaded workloads are manually sharded across subvolumes.

So to bring the rest of the filesystems up, we need to fix the
inode_hash_lock contention problems.  This patchset replaces the
global inode_hash_lock with the same lock-per-chain implementation
that the dentry cache uses. i.e. hash-bl lists. This is more complex
than the dentry cache implementation, however, because we nest spin
locks inside the inode_hash_lock. This conversion means we nest spin
locks inside bit spin locks in the inode cache.

Whilst this doesn't sound particularly problematic, the issue arises
on CONFIG_PREEMPT_RT kernels, where spinlocks are converted to
sleeping locks. We can't place sleeping locks inside spinning bit
locks, and that's exactly what happens if we use hash-bl lists in
the inode cache and then turn on CONFIG_PREEMPT_RT.

The other downside to converting to hash-bl is that we lose lockdep
coverage of the inode hash table - lockdep does not track bit locks
at all.

Both of these issues can be solved the same way: whenever either of
these two config options are turned on, we change the hash-bl
implementation from using a bit spin lock on th elowest bit of the
chain head pointer to using as dedicated spinlock per chain. This
trades off performance and memory footprint for configurations where
correctness is more important than performance, but allows optimal
implementations of hash-bl lists when performance is the primary
concern.

In making this conversion, we make all hash-bl implementations safe
for PREEMPT_RT usage and gain lockdep coverage of all hash-bl lists.
It also pointed out that several hash-bl list users did not actually
initialise the hash list heads correctly - they elided the
initialisation and only got away with it because they allocated
zeroed memory and the hash list head would still work from empty.
This needed fixing for lockdep....

The result of this conversion is that inode cache lookup heavy
workloads such as filesystem traversals and inode creation/removal
no longer contend on the inode_hash_lock to stream inodes through
the inode cache. This results in big performance improvements at
thread counts of >= 8.

I've run this through fstests with lockdep enabled on ext4 and XFS
without discovering any issues except for dm-snapshot needing
lockdep nesting annotations for list-bl locks. I've run a bunch of
"will-it-scale" like tests across XFS, ext4, bcachefs and btrfs, and
the raw table results for 6.7-rc4 are below.

The tests runs a fixed number of files per thread, so as the thread
count increases we should see runtimes stay constant if scalability
is perfect. I'm not caring about operation rates, I'm not caring
about which filesystems are faster, all I'm looking at is whether
the scalability of individual filesytsems improves with the changes.

base:  vanilla 6.7-rc4 kernel
scale: 6.7-rc4 plus this patchset

Filesystem      Files  Threads      Create             Walk             chmod            Unlink
				base    scale	  base    scale     base    scale    base    scale
       xfs     400000     1     11.217 10.477     11.621 11.570     14.980 14.797    18.314 18.248
       xfs     800000     2     12.313 11.470     11.673 11.158     15.271 14.782    19.413 18.533
       xfs    1600000     4     14.130 13.246      9.665  9.444     14.794 13.710    19.582 17.015
       xfs    3200000     8     16.654 16.108     10.622  9.275     15.854 14.575    20.679 19.237
       xfs    6400000    16     17.587 18.151     12.148  9.508     16.655 17.691    26.044 21.865
       xfs   12800000    32     20.833 21.491     20.518 10.308     23.614 19.334    42.572 27.404

All of the operations that require directory traversal show
significant improvements at 16 or more threads on XFS. This is
entirely from the sb->s_inodes modifications.

Filesystem      Files  Threads      Create             Walk             chmod            Unlink
				base    scale	  base    scale     base    scale    base     scale
      ext4     400000     1      9.344  9.394      7.691  7.847      9.188  9.212    11.340  12.517
      ext4     800000     2     10.445 10.375      7.923  7.358     10.158 10.114    14.366  14.093
      ext4    1600000     4     11.008 10.948      8.152  7.530     11.140 11.150    18.093  17.153
      ext4    3200000     8     23.348 12.134     13.090  7.871     15.437 12.824    30.806  31.968
      ext4    6400000    16     17.343 29.112     24.602  9.540     31.034 22.057    60.538  57.636
      ext4   12800000    32     40.125 44.638     49.536 19.314     63.183 38.905   138.688 138.632

Walk on ext4 shows major improvements at 8 threads and above, as
does the recursive chmod. This largely comes from the inode hash
lock removal, but the full scalability improvements are not realised
until the sb->s_inodes changes are added as well. 

Note that unlink doesn't scale or improve because the mkfs.ext4
binary in debian unstable does not support the orphan file option
and so it is completely bottlenecked on orphan list scalability
issues.

Filesystem      Files  Threads      Create             Walk             chmod            Unlink
				base    scale	  base    scale     base    scale    base     scale
  bcachefs     400000     1     16.999 17.193      6.546  6.355    13.973  13.024    28.890  19.014
  bcachefs     800000     2     20.133 19.597      8.003  7.276    22.042  20.070    28.959  29.141
  bcachefs    1600000     4     22.836 23.764      9.097  8.506    58.827  56.108    38.955  37.435
  bcachefs    3200000     8     27.932 27.545     11.752 10.015   192.802 185.834    64.402  77.188
  bcachefs    6400000    16     32.389 32.021     24.614 13.989   409.873 408.098   243.757 249.220
  bcachefs   12800000    32     39.809 40.221     49.179 25.982   879.721 821.497   501.357 485.781

bcachefs walk shows major improvements at 16 threads and above, but
chmod and unlink are drowned by internal contention problems.

Filesystem      Files  Threads      Create             Walk             chmod            Unlink
				base    scale	  base     scale    base     scale    base      scale
     btrfs     400000     1     10.307  10.228    12.597  12.104    14.744  14.030    24.171   24.273
     btrfs     800000     2     15.956  14.018    19.693  17.180    24.859  20.872    59.338   48.725
     btrfs    1600000     4     22.441  20.951    32.855  29.013    37.975  33.575   140.601  125.305
     btrfs    3200000     8     34.157  32.693    55.066  56.726    66.676  64.661   343.379  325.816
     btrfs    6400000    16     60.847  59.123    90.097  89.340   116.994 114.280   683.244  681.953
     btrfs   12800000    32    122.525 118.510   118.036 125.761   206.122 212.102  1612.940 1629.689

There's little point in doing scalability testing on plain btrfs -
it is entirely bottlenecked on internal algorithms long before
anything in the VFS becomes a scalability limitation.

Filesystem      Files  Threads      Create             Walk             chmod            Unlink
				base    scale	  base    scale     base    scale    base     scale
btrfs-svol     400000     1     10.417  9.830     12.011 12.154     14.894 14.913    24.157  23.447
btrfs-svol     800000     2     12.079 11.681     12.596 12.208     16.535 15.310    28.031  26.412
btrfs-svol    1600000     4     15.219 15.074     12.711 10.735     18.079 16.948    34.330  31.949
btrfs-svol    3200000     8     23.140 21.307     14.706 10.934     22.580 21.907    53.183  52.129
btrfs-svol    6400000    16     40.657 40.226     26.062 11.471     34.058 33.333   101.133  99.504
btrfs-svol   12800000    32     81.516 79.412     50.320 12.406     65.691 58.229   193.847 200.050

Once btrfs is running with a sharded namespace (i.e. subvol per
thread) we results very similar in nature to bcachefs - walk
improves dramatically at high thread counts, but nothing else
changes as all the scalability limitations are internal to the
filesystem.

I have tested to 64 threads, but there's not a lot extra to add. The
XFs walk was done in 14.1s, so scalability is falling off but I
haven't spent any time looking at it in detail because there's just
so much other internal stuff to fix up before the rest of this
benchmark scales to 64 threads on XFS....

Git tree containing this series can be pulled from:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale

-Dave.