Re: [RFC PATCH] vfs: add rcu-based find_inode variants for iget ops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 6, 2024 at 6:31 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Thu 06-06-24 16:05:15, Mateusz Guzik wrote:
> > Instantiating a new inode normally takes the global inode hash lock
> > twice:
> > 1. once to check if it happens to already be present
> > 2. once to add it to the hash
> >
> > The back-to-back lock/unlock pattern is known to degrade performance
> > significantly, which is further exacerbated if the hash is heavily
> > populated (long chains to walk, extending hold time). Arguably hash
> > sizing and hashing algo need to be revisited, but that's beyond the
> > scope of this patch.
> >
> > A long term fix would introduce fine-grained locking, this was attempted
> > in [1], but that patchset was already posted several times and appears
> > stalled.
> >
> > A simpler idea which solves majority of the problem and which may be
> > good enough for the time being is to use RCU for the initial lookup.
> > Basic RCU support is already present in the hash, it is just not being
> > used for lookup on inode creation.
> >
> > iget_locked consumers (notably ext4) get away without any changes
> > because inode comparison method is built-in.
> >
> > iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
> > removal of locking adds more problems (inode can be changing) it's not
> > safe to assume all filesystems happen to cope.  Thus iget5_locked_rcu
> > ilookup5_nowait_rcu get added, requiring manual conversion.
>
> BTW, why not ilookup5_rcu() as well? To keep symmetry with non-RCU APIs and
> iget5_locked_rcu() could then use ilookup5_rcu().

I don't have a strong opinion. Note that the routine as implemented
right now mimicks iget_locked.

> I presume eventually we'd like to trasition everything to these RCU based methods?
>

That is up in the air, but I would not go for it. Note every
iget5_locked consumer would have to get reviewed for safety of the
callback, which is a lot of work for no real gain for vast majority of
filesystems out there.

Also note the rcu variants are only used in cases which tolerate a
false negative -- if the inode fails to match, the caller is expected
to cope by creating a new inode and performing a locked lookup. It
does not have to be this way, but I find it less error prone.

> > In order to reduce code duplication find_inode and find_inode_fast grow
> > an argument indicating whether inode hash lock is held, which is passed
> > down should sleeping be necessary. They always rcu_read_lock, which is
> > redundant but harmless. Doing it conditionally reduces readability for
> > no real gain that I can see. RCU-alike restrictions were already put on
> > callbacks due to the hash spinlock being held.
> >
> > Benchmarked with the following: a 32-core vm with 24GB of RAM, a
> > dedicated fs partition. 20 separate trees with 1000 directories * 1000
> > files.  Then walked by 20 processes issuing stat on files, each on a
> > dedicated tree. Testcase is at [2].
> >
> > In this particular workload, mimicking a real-world setup $elsewhere,
> > the initial lookup is guaranteed to fail, guaranteeing the 2 lock
> > acquires. At the same time RAM is scarce enough enough compared to the
> > demand that inodes keep needing to be recycled.
> >
> > Total real time fluctuates by 1-2s, sample results:
> >
> > ext4 (needed mkfs.ext4 -N 24000000):
> > before:       3.77s user 890.90s system 1939% cpu 46.118 total
> > after:  3.24s user 397.73s system 1858% cpu 21.581 total (-53%)
> >
> > btrfs (s/iget5_locked/iget5_locked_rcu in fs/btrfs/inode.c):
> > before: 3.54s user 892.30s system 1966% cpu 45.549 total
> > after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
> >
> > btrfs is heavily bottlenecked on its own locks, so the improvement is
> > small in comparison.
> >
> > [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@xxxxxxxxxxxxx/
> > [2] https://people.freebsd.org/~mjg/fstree.tgz
>
> Nice results. I've looked through the patch and otherwise I didn't find any
> issue.
>
>                                                                 Honza
>
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx>
> > ---
> >
> > This is an initial submission to gauge interest.
> >
> > I do claim this provides great bang for the buck, I don't claim it
> > solves the problem overall. *something* finer-grained will need to
> > land.
> >
> > I wanted to add bcachefs to the list, but I ran into memory reclamation
> > issues again (first time here:
> > https://lore.kernel.org/all/CAGudoHGenxzk0ZqPXXi1_QDbfqQhGHu+wUwzyS6WmfkUZ1HiXA@xxxxxxxxxxxxxx/),
> > did not have time to mess with diagnostic to write a report yet.
> >
> > I'll post a patchset with this (+ tidy ups to comments and whatnot) +
> > btrfs + bcachefs conversion after the above gets reported and sorted
> > out.
> >
> > Also interestingly things improved since last year, when Linux needed
> > about a minute.
> >
> >  fs/inode.c         | 106 +++++++++++++++++++++++++++++++++++++--------
> >  include/linux/fs.h |  10 ++++-
> >  2 files changed, 98 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 3a41f83a4ba5..f40b868f491f 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -886,36 +886,43 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
> >       return freed;
> >  }
> >
> > -static void __wait_on_freeing_inode(struct inode *inode);
> > +static void __wait_on_freeing_inode(struct inode *inode, bool locked);
> >  /*
> >   * Called with the inode lock held.
> >   */
> >  static struct inode *find_inode(struct super_block *sb,
> >                               struct hlist_head *head,
> >                               int (*test)(struct inode *, void *),
> > -                             void *data)
> > +                             void *data, bool locked)
> >  {
> >       struct inode *inode = NULL;
> >
> > +     if (locked)
> > +             lockdep_assert_held(&inode_hash_lock);
> > +
> > +     rcu_read_lock();
> >  repeat:
> > -     hlist_for_each_entry(inode, head, i_hash) {
> > +     hlist_for_each_entry_rcu(inode, head, i_hash) {
> >               if (inode->i_sb != sb)
> >                       continue;
> >               if (!test(inode, data))
> >                       continue;
> >               spin_lock(&inode->i_lock);
> >               if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> > -                     __wait_on_freeing_inode(inode);
> > +                     __wait_on_freeing_inode(inode, locked);
> >                       goto repeat;
> >               }
> >               if (unlikely(inode->i_state & I_CREATING)) {
> >                       spin_unlock(&inode->i_lock);
> > +                     rcu_read_unlock();
> >                       return ERR_PTR(-ESTALE);
> >               }
> >               __iget(inode);
> >               spin_unlock(&inode->i_lock);
> > +             rcu_read_unlock();
> >               return inode;
> >       }
> > +     rcu_read_unlock();
> >       return NULL;
> >  }
> >
> > @@ -924,29 +931,37 @@ static struct inode *find_inode(struct super_block *sb,
> >   * iget_locked for details.
> >   */
> >  static struct inode *find_inode_fast(struct super_block *sb,
> > -                             struct hlist_head *head, unsigned long ino)
> > +                             struct hlist_head *head, unsigned long ino,
> > +                             bool locked)
> >  {
> >       struct inode *inode = NULL;
> >
> > +     if (locked)
> > +             lockdep_assert_held(&inode_hash_lock);
> > +
> > +     rcu_read_lock();
> >  repeat:
> > -     hlist_for_each_entry(inode, head, i_hash) {
> > +     hlist_for_each_entry_rcu(inode, head, i_hash) {
> >               if (inode->i_ino != ino)
> >                       continue;
> >               if (inode->i_sb != sb)
> >                       continue;
> >               spin_lock(&inode->i_lock);
> >               if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> > -                     __wait_on_freeing_inode(inode);
> > +                     __wait_on_freeing_inode(inode, locked);
> >                       goto repeat;
> >               }
> >               if (unlikely(inode->i_state & I_CREATING)) {
> >                       spin_unlock(&inode->i_lock);
> > +                     rcu_read_unlock();
> >                       return ERR_PTR(-ESTALE);
> >               }
> >               __iget(inode);
> >               spin_unlock(&inode->i_lock);
> > +             rcu_read_unlock();
> >               return inode;
> >       }
> > +     rcu_read_unlock();
> >       return NULL;
> >  }
> >
> > @@ -1161,7 +1176,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
> >
> >  again:
> >       spin_lock(&inode_hash_lock);
> > -     old = find_inode(inode->i_sb, head, test, data);
> > +     old = find_inode(inode->i_sb, head, test, data, true);
> >       if (unlikely(old)) {
> >               /*
> >                * Uhhuh, somebody else created the same inode under us.
> > @@ -1245,6 +1260,43 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
> >  }
> >  EXPORT_SYMBOL(iget5_locked);
> >
> > +/**
> > + * iget5_locked_rcu - obtain an inode from a mounted file system
> > + *
> > + * This is equivalent to iget5_locked, except the @test callback must
> > + * tolerate inode not being stable, including being mid-teardown.
> > + */
> > +struct inode *iget5_locked_rcu(struct super_block *sb, unsigned long hashval,
> > +             int (*test)(struct inode *, void *),
> > +             int (*set)(struct inode *, void *), void *data)
> > +{
> > +     struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> > +     struct inode *inode, *new;
> > +
> > +again:
> > +     inode = find_inode(sb, head, test, data, false);
> > +     if (inode) {
> > +             if (IS_ERR(inode))
> > +                     return NULL;
> > +             wait_on_inode(inode);
> > +             if (unlikely(inode_unhashed(inode))) {
> > +                     iput(inode);
> > +                     goto again;
> > +             }
> > +             return inode;
> > +     }
> > +
> > +     new = alloc_inode(sb);
> > +     if (new) {
> > +             new->i_state = 0;
> > +             inode = inode_insert5(new, hashval, test, set, data);
> > +             if (unlikely(inode != new))
> > +                     destroy_inode(new);
> > +     }
> > +     return inode;
> > +}
> > +EXPORT_SYMBOL(iget5_locked_rcu);
> > +
> >  /**
> >   * iget_locked - obtain an inode from a mounted file system
> >   * @sb:              super block of file system
> > @@ -1263,9 +1315,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
> >       struct hlist_head *head = inode_hashtable + hash(sb, ino);
> >       struct inode *inode;
> >  again:
> > -     spin_lock(&inode_hash_lock);
> > -     inode = find_inode_fast(sb, head, ino);
> > -     spin_unlock(&inode_hash_lock);
> > +     inode = find_inode_fast(sb, head, ino, false);
> >       if (inode) {
> >               if (IS_ERR(inode))
> >                       return NULL;
> > @@ -1283,7 +1333,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
> >
> >               spin_lock(&inode_hash_lock);
> >               /* We released the lock, so.. */
> > -             old = find_inode_fast(sb, head, ino);
> > +             old = find_inode_fast(sb, head, ino, true);
> >               if (!old) {
> >                       inode->i_ino = ino;
> >                       spin_lock(&inode->i_lock);
> > @@ -1419,13 +1469,31 @@ struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
> >       struct inode *inode;
> >
> >       spin_lock(&inode_hash_lock);
> > -     inode = find_inode(sb, head, test, data);
> > +     inode = find_inode(sb, head, test, data, true);
> >       spin_unlock(&inode_hash_lock);
> >
> >       return IS_ERR(inode) ? NULL : inode;
> >  }
> >  EXPORT_SYMBOL(ilookup5_nowait);
> >
> > +/**
> > + * ilookup5_nowait_rcu - search for an inode in the inode cache
> > + *
> > + * This is equivalent to ilookup5_nowait, except the @test callback must
> > + * tolerate inode not being stable, including being mid-teardown.
> > + */
> > +struct inode *ilookup5_nowait_rcu(struct super_block *sb, unsigned long hashval,
> > +             int (*test)(struct inode *, void *), void *data)
> > +{
> > +     struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> > +     struct inode *inode;
> > +
> > +     inode = find_inode(sb, head, test, data, false);
> > +
> > +     return IS_ERR(inode) ? NULL : inode;
> > +}
> > +EXPORT_SYMBOL(ilookup5_nowait_rcu);
> > +
> >  /**
> >   * ilookup5 - search for an inode in the inode cache
> >   * @sb:              super block of file system to search
> > @@ -1474,7 +1542,7 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
> >       struct inode *inode;
> >  again:
> >       spin_lock(&inode_hash_lock);
> > -     inode = find_inode_fast(sb, head, ino);
> > +     inode = find_inode_fast(sb, head, ino, true);
> >       spin_unlock(&inode_hash_lock);
> >
> >       if (inode) {
> > @@ -2235,17 +2303,21 @@ EXPORT_SYMBOL(inode_needs_sync);
> >   * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
> >   * will DTRT.
> >   */
> > -static void __wait_on_freeing_inode(struct inode *inode)
> > +static void __wait_on_freeing_inode(struct inode *inode, bool locked)
> >  {
> >       wait_queue_head_t *wq;
> >       DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
> >       wq = bit_waitqueue(&inode->i_state, __I_NEW);
> >       prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> >       spin_unlock(&inode->i_lock);
> > -     spin_unlock(&inode_hash_lock);
> > +     rcu_read_unlock();
> > +     if (locked)
> > +             spin_unlock(&inode_hash_lock);
> >       schedule();
> >       finish_wait(wq, &wait.wq_entry);
> > -     spin_lock(&inode_hash_lock);
> > +     if (locked)
> > +             spin_lock(&inode_hash_lock);
> > +     rcu_read_lock();
> >  }
> >
> >  static __initdata unsigned long ihash_entries;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 0283cf366c2a..2817c915d355 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3021,6 +3021,9 @@ extern void d_mark_dontcache(struct inode *inode);
> >  extern struct inode *ilookup5_nowait(struct super_block *sb,
> >               unsigned long hashval, int (*test)(struct inode *, void *),
> >               void *data);
> > +extern struct inode *ilookup5_nowait_rcu(struct super_block *sb,
> > +             unsigned long hashval, int (*test)(struct inode *, void *),
> > +             void *data);
> >  extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
> >               int (*test)(struct inode *, void *), void *data);
> >  extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
> > @@ -3029,7 +3032,12 @@ extern struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
> >               int (*test)(struct inode *, void *),
> >               int (*set)(struct inode *, void *),
> >               void *data);
> > -extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
> > +extern struct inode * iget5_locked(struct super_block *, unsigned long,
> > +                                int (*test)(struct inode *, void *),
> > +                                int (*set)(struct inode *, void *), void *);
> > +extern struct inode * iget5_locked_rcu(struct super_block *, unsigned long,
> > +                                    int (*test)(struct inode *, void *),
> > +                                    int (*set)(struct inode *, void *), void *);
> >  extern struct inode * iget_locked(struct super_block *, unsigned long);
> >  extern struct inode *find_inode_nowait(struct super_block *,
> >                                      unsigned long,
> > --
> > 2.43.0
> >
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR



-- 
Mateusz Guzik <mjguzik gmail.com>





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux