On Tue 11-06-24 12:16:31, Mateusz Guzik wrote: > Instantiating a new inode normally takes the global inode hash lock > twice: > 1. once to check if it happens to already be present > 2. once to add it to the hash > > The back-to-back lock/unlock pattern is known to degrade performance > significantly, which is further exacerbated if the hash is heavily > populated (long chains to walk, extending hold time). Arguably hash > sizing and hashing algo need to be revisited, but that's beyond the > scope of this patch. > > A long term fix would introduce finer-grained locking. An attempt was > made several times, most recently in [1], but the effort appears > stalled. > > A simpler idea which solves majority of the problem and which may be > good enough for the time being is to use RCU for the initial lookup. > Basic RCU support is already present in the hash. This being a temporary > measure I tried to keep the change as small as possible. > > iget_locked consumers (notably ext4) get away without any changes > because inode comparison method is built-in. > > iget5_locked and ilookup5_nowait consumers pass a custom callback. Since > removal of locking adds more problems (inode can be changing) it's not > safe to assume all filesystems happen to cope. Thus iget5_locked_rcu, > ilookup5_rcu and ilookup5_nowait_rcu get added, requiring manual > conversion. > > In order to reduce code duplication find_inode and find_inode_fast grow > an argument indicating whether inode hash lock is held, which is passed > down should sleeping be necessary. They always rcu_read_lock, which is > redundant but harmless. Doing it conditionally reduces readability for > no real gain that I can see. RCU-alike restrictions were already put on > callbacks due to the hash spinlock being held. > > There is a real cache-busting workload scanning millions of files in > parallel (it's a backup server thing), where the initial lookup is > guaranteed to fail resulting in the 2 lock acquires. > > Implemented below is a synthehic benchmark which provides the same > behavior. [I shall note the workload is not running on Linux, instead it > was causing trouble elsewhere. Benchmark below was used while addressing > said problems and was found to adequately represent the real workload.] > > Total real time fluctuates by 1-2s. > > With 20 threads each walking a dedicated 1000 dirs * 1000 files > directory tree to stat(2) on a 32 core + 24GB RAM vm: > > ext4 (needed mkfs.ext4 -N 24000000): > before: 3.77s user 890.90s system 1939% cpu 46.118 total > after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%) > > Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz > > [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@xxxxxxxxxxxxx/ > > Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx> Nice speedups and the patch looks good to me. It would be lovely to get Dave's speedups finished but this is already nice. I've found just two nits: > +/** > + * ilookup5 - search for an inode in the inode cache ^^^ ilookup5_rcu > + * @sb: super block of file system to search > + * @hashval: hash value (usually inode number) to search for > + * @test: callback used for comparisons between inodes > + * @data: opaque data pointer to pass to @test > + * > + * This is equivalent to ilookup5, except the @test callback must > + * tolerate the inode not being stable, including being mid-teardown. > + */ ... > +struct inode *ilookup5_nowait_rcu(struct super_block *sb, unsigned long hashval, > + int (*test)(struct inode *, void *), void *data); I'd prefer wrapping the above so that it fits into 80 columns. Otherwise feel free to add: Reviewed-by: Jan Kara <jack@xxxxxxx> Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR