Re: [PATCH v3 1/2] vfs: add rcu-based find_inode variants for iget ops

Jan Kara <jack@xxxxxxx> · Tue, 11 Jun 2024 12:50:11 +0200

On Tue 11-06-24 12:16:31, Mateusz Guzik wrote:
> Instantiating a new inode normally takes the global inode hash lock
> twice:
> 1. once to check if it happens to already be present
> 2. once to add it to the hash
> 
> The back-to-back lock/unlock pattern is known to degrade performance
> significantly, which is further exacerbated if the hash is heavily
> populated (long chains to walk, extending hold time). Arguably hash
> sizing and hashing algo need to be revisited, but that's beyond the
> scope of this patch.
> 
> A long term fix would introduce finer-grained locking. An attempt was
> made several times, most recently in [1], but the effort appears
> stalled.
> 
> A simpler idea which solves majority of the problem and which may be
> good enough for the time being is to use RCU for the initial lookup.
> Basic RCU support is already present in the hash. This being a temporary
> measure I tried to keep the change as small as possible.
> 
> iget_locked consumers (notably ext4) get away without any changes
> because inode comparison method is built-in.
> 
> iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
> removal of locking adds more problems (inode can be changing) it's not
> safe to assume all filesystems happen to cope.  Thus iget5_locked_rcu,
> ilookup5_rcu and ilookup5_nowait_rcu get added, requiring manual
> conversion.
> 
> In order to reduce code duplication find_inode and find_inode_fast grow
> an argument indicating whether inode hash lock is held, which is passed
> down should sleeping be necessary. They always rcu_read_lock, which is
> redundant but harmless. Doing it conditionally reduces readability for
> no real gain that I can see. RCU-alike restrictions were already put on
> callbacks due to the hash spinlock being held.
> 
> There is a real cache-busting workload scanning millions of files in
> parallel (it's a backup server thing), where the initial lookup is
> guaranteed to fail resulting in the 2 lock acquires.
> 
> Implemented below is a synthehic benchmark which provides the same
> behavior. [I shall note the workload is not running on Linux, instead it
> was causing trouble elsewhere. Benchmark below was used while addressing
> said problems and was found to adequately represent the real workload.]
> 
> Total real time fluctuates by 1-2s.
> 
> With 20 threads each walking a dedicated 1000 dirs * 1000 files
> directory tree to stat(2) on a 32 core + 24GB RAM vm:
> 
> ext4 (needed mkfs.ext4 -N 24000000):
> before:	3.77s user 890.90s system 1939% cpu 46.118 total
> after:  3.24s user 397.73s system 1858% cpu 21.581 total (-53%)
> 
> Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz
> 
> [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@xxxxxxxxxxxxx/
> 
> Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx>

Nice speedups and the patch looks good to me. It would be lovely to get
Dave's speedups finished but this is already nice. I've found just two nits:

> +/**
> + * ilookup5 - search for an inode in the inode cache
      ^^^ ilookup5_rcu

> + * @sb:		super block of file system to search
> + * @hashval:	hash value (usually inode number) to search for
> + * @test:	callback used for comparisons between inodes
> + * @data:	opaque data pointer to pass to @test
> + *
> + * This is equivalent to ilookup5, except the @test callback must
> + * tolerate the inode not being stable, including being mid-teardown.
> + */
...
> +struct inode *ilookup5_nowait_rcu(struct super_block *sb, unsigned long hashval,
> +		int (*test)(struct inode *, void *), void *data);

I'd prefer wrapping the above so that it fits into 80 columns.

Otherwise feel free to add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

									Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR