On Tue, Jun 11, 2024 at 12:16:30PM +0200, Mateusz Guzik wrote: > I think the appropriate blurb which needs to land here also needs to be > in the commit message for the first patch, so here it is copy pasted > with some modifications at the end: > > [quote] > Instantiating a new inode normally takes the global inode hash lock > twice: > 1. once to check if it happens to already be present > 2. once to add it to the hash > > The back-to-back lock/unlock pattern is known to degrade performance > significantly, which is further exacerbated if the hash is heavily > populated (long chains to walk, extending hold time). Arguably hash > sizing and hashing algo need to be revisited, but that's beyond the > scope of this patch. > > A long term fix would introduce finer-grained locking. An attempt was > made several times, most recently in [1], but the effort appears > stalled. > > A simpler idea which solves majority of the problem and which may be > good enough for the time being is to use RCU for the initial lookup. > Basic RCU support is already present in the hash. This being a temporary > measure I tried to keep the change as small as possible. > > iget_locked consumers (notably ext4) get away without any changes > because inode comparison method is built-in. > > iget5_locked and ilookup5_nowait consumers pass a custom callback. Since > removal of locking adds more problems (inode can be changing) it's not > safe to assume all filesystems happen to cope. Thus iget5_locked_rcu, > ilookup5_rcu and ilookup5_nowait_rcu get added, requiring manual > conversion. > > In order to reduce code duplication find_inode and find_inode_fast grow > an argument indicating whether inode hash lock is held, which is passed > down should sleeping be necessary. They always rcu_read_lock, which is > redundant but harmless. Doing it conditionally reduces readability for > no real gain that I can see. RCU-alike restrictions were already put on > callbacks due to the hash spinlock being held. > > There is a real cache-busting workload scanning millions of files in > parallel (it's a backup server thing), where the initial lookup is > guaranteed to fail resulting in the 2 lock acquires. > > Implemented below is a synthehic benchmark which provides the same > behavior. [I shall note the workload is not running on Linux, instead it > was causing trouble elsewhere. Benchmark below was used while addressing > said problems and was found to adequately represent the real workload.] > > Total real time fluctuates by 1-2s. > > With 20 threads each walking a dedicated 1000 dirs * 1000 files > directory tree to stat(2) on a 32 core + 24GB RAM vm: > [/quote] > > Specific results: > > ext4 (needed mkfs.ext4 -N 24000000): > before: 3.77s user 890.90s system 1939% cpu 46.118 total > after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%) > > btrfs (s/iget5_locked/iget5_locked_rcu in fs/btrfs/inode.c): > before: 3.54s user 892.30s system 1966% cpu 45.549 total > after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%) > > btrfs bottlenecks itself on its own locks here. > > Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz > > fs rundown is as follows: > - ext4 patched implicitly > - xfs does not use the inode hash > - bcachefs is out of the picture as Kent decided to implement his own > inode hashing based on rhashtable, for now private to his fs. > > I have not looked at others. > > [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@xxxxxxxxxxxxx/ > Reviewed-by: Josef Bacik <josef@xxxxxxxxxxxxxx> Thanks, Josef