Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 16 May 2023 17:45:19 +0200

On Wed, May 10, 2023 at 02:45:57PM +1000, Dave Chinner wrote:
> On Tue, May 09, 2023 at 12:56:47PM -0400, Kent Overstreet wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > Because scalability of the global inode_hash_lock really, really
> > sucks.
> > 
> > 32-way concurrent create on a couple of different filesystems
> > before:
> > 
> > -   52.13%     0.04%  [kernel]            [k] ext4_create
> >    - 52.09% ext4_create
> >       - 41.03% __ext4_new_inode
> >          - 29.92% insert_inode_locked
> >             - 25.35% _raw_spin_lock
> >                - do_raw_spin_lock
> >                   - 24.97% __pv_queued_spin_lock_slowpath
> > 
> > -   72.33%     0.02%  [kernel]            [k] do_filp_open
> >    - 72.31% do_filp_open
> >       - 72.28% path_openat
> >          - 57.03% bch2_create
> >             - 56.46% __bch2_create
> >                - 40.43% inode_insert5
> >                   - 36.07% _raw_spin_lock
> >                      - do_raw_spin_lock
> >                           35.86% __pv_queued_spin_lock_slowpath
> >                     4.02% find_inode
> > 
> > Convert the inode hash table to a RCU-aware hash-bl table just like
> > the dentry cache. Note that we need to store a pointer to the
> > hlist_bl_head the inode has been added to in the inode so that when
> > it comes to unhash the inode we know what list to lock. We need to
> > do this because the hash value that is used to hash the inode is
> > generated from the inode itself - filesystems can provide this
> > themselves so we have to either store the hash or the head pointer
> > in the inode to be able to find the right list head for removal...
> > 
> > Same workload after:
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
> > Cc: Christian Brauner <brauner@xxxxxxxxxx>
> > Cc: linux-fsdevel@xxxxxxxxxxxxxxx
> > Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> 
> I have been maintaining this patchset uptodate in my own local trees
> and the code in this patch looks the same. The commit message above,
> however, has been mangled. The full commit message should be:
> 
> vfs: inode cache conversion to hash-bl
> 
> Because scalability of the global inode_hash_lock really, really
> sucks and prevents me from doing scalability characterisation and
> analysis of bcachefs algorithms.
> 
> Profiles of a 32-way concurrent create of 51.2m inodes with fsmark
> on a couple of different filesystems on a 5.10 kernel:
> 
> -   52.13%     0.04%  [kernel]            [k] ext4_create
>    - 52.09% ext4_create
>       - 41.03% __ext4_new_inode
>          - 29.92% insert_inode_locked
>             - 25.35% _raw_spin_lock
>                - do_raw_spin_lock
>                   - 24.97% __pv_queued_spin_lock_slowpath
> 
> 
> -   72.33%     0.02%  [kernel]            [k] do_filp_open
>    - 72.31% do_filp_open
>       - 72.28% path_openat
>          - 57.03% bch2_create
>             - 56.46% __bch2_create
>                - 40.43% inode_insert5
>                   - 36.07% _raw_spin_lock
>                      - do_raw_spin_lock
>                           35.86% __pv_queued_spin_lock_slowpath
>                     4.02% find_inode
> 
> btrfs was tested but it is limited by internal lock contention at
> >=2 threads on this workload, so never hammers the inode cache lock
> hard enough for this change to matter to it's performance.
> 
> However, both bcachefs and ext4 demonstrate poor scaling at >=8
> threads on concurrent lookup or create workloads.
> 
> Hence convert the inode hash table to a RCU-aware hash-bl table just
> like the dentry cache. Note that we need to store a pointer to the
> hlist_bl_head the inode has been added to in the inode so that when
> it comes to unhash the inode we know what list to lock. We need to
> do this because, unlike the dentry cache, the hash value that is
> used to hash the inode is not generated from the inode itself. i.e.
> filesystems can provide this themselves so we have to either store
> the hashval or the hlist head pointer in the inode to be able to
> find the right list head for removal...
> 
> Concurrent create with variying thread count (files/s):
> 
>                 ext4                    bcachefs
> threads         vanilla  patched        vanilla patched
> 2               117k     112k            80k     85k
> 4               185k     190k           133k    145k
> 8               303k     346k           185k    255k
> 16              389k     465k           190k    420k
> 32              360k     437k           142k    481k
> 
> CPU usage for both bcachefs and ext4 at 16 and 32 threads has been
> halved on the patched kernel, while performance has increased
> marginally on ext4 and massively on bcachefs. Internal filesystem
> algorithms now limit performance on these workloads, not the global
> inode_hash_lock.
> 
> Profile of the workloads on the patched kernels:
> 
> -   35.94%     0.07%  [kernel]                  [k] ext4_create
>    - 35.87% ext4_create
>       - 20.45% __ext4_new_inode
> ...
>            3.36% insert_inode_locked
> 
>    - 78.43% do_filp_open
>       - 78.36% path_openat
>          - 53.95% bch2_create
>             - 47.99% __bch2_create
> ....
>               - 7.57% inode_insert5
>                     6.94% find_inode
> 
> Spinlock contention is largely gone from the inode hash operations
> and the filesystems are limited by contention in their internal
> algorithms.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
> 
> Other than that, the diffstat is the same and I don't see any obvious
> differences in the code comapred to what I've been running locally.

There's a bit of a backlog before I get around to looking at this but
it'd be great if we'd have a few reviewers for this change.