Re: [RFC PATCH 06/19] rust: fs: introduce `FileSystem::init_root`

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 30 Oct 2023 13:29:47 +1100

On Mon, Oct 23, 2023 at 09:55:08AM -0300, Wedson Almeida Filho wrote:
> On Mon, 23 Oct 2023 at 02:29, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Sat, Oct 21, 2023 at 12:33:57PM -0700, Boqun Feng wrote:
> > > On Sat, Oct 21, 2023 at 06:01:02PM +0100, Matthew Wilcox wrote:
> > > > I'm only an expert on the page cache, not the rest of the VFS.  So
> > > > what are the rules around modifying i_state for the VFS?
> > >
> > > Agreed, same question here.
> >
> > inode->i_state should only be modified under inode->i_lock.
> >
> > And in most situations, you have to hold the inode->i_lock to read
> > state flags as well so that reads are serialised against
> > modifications which are typically non-atomic RMW operations.
> >
> > There is, I think, one main exception to read side locking and this
> > is find_inode_rcu() which does an unlocked check for I_WILL_FREE |
> > I_FREEING. In this case, the inode->i_state updates in iput_final()
> > use WRITE_ONCE under the inode->i_lock to provide the necessary
> > semantics for the unlocked READ_ONCE() done under rcu_read_lock().
> >
> > IOWs, if you follow the general rule that any inode->i_state access
> > (read or write) needs to hold inode->i_lock, you probably won't
> > screw up.
> 
> I don't see filesystems doing this though. In particular, see
> iget_locked() -- if a new inode is returned, then it is locked, but if
> a cached one is found, it's not locked.

I did say "if you follow the general rule".

And where there is a "general rule" there is the implication that
there are special cases where the "general rule" doesn't get
applied, yes? :)

I_NEW is the exception to the general rule, and very few people
writing filesystems actually know about it let alone care about
it...

> So we're in this situation where a returned inode may or may not be
> locked. And the way to determine if it's locked or not is to read
> i_state.
> 
> Here are examples of kernfs, ext2, ext4 and squashfs doing it:
> https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/kernfs/inode.c#L252
> https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/ext2/inode.c#L1392
> https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/ext4/inode.c#L4707
> https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/squashfs/inode.c#L82
> 
> They all call iget_locked(), and if I_NEW is set, they initialise the
> inode and unlock it with unlock_new_inode(); otherwise they just
> return the unlocked inode.

All of them are perfectly fine.

I_NEW is the bit we use to synchronise inode initialisation - we
have to ensure there is only a single initialisation running while
there are concurrent lookups that can find the inode whilst it is
being initialised. We cannot hold a spin lock over inode
initialisation (it may have to do IO!), so we set the I_NEW flag
under the i_lock and the inode_hash_lock during hash insertion so
that they are set atomically from the hash lookup POV. If the inode
is then found in cache, wait_on_inode() does the serialisation
against the running initialisation indicated by the __I_NEW bit in
the i_state word.

Hence if the caller of iget_locked() ever sees I_NEW, it is
guaranteed to have exclusive access to the inode and -must- first
initialise the inode and then call unlock_new_inode() when it has
completed. It doesn't need to hold inode->i_lock in this case
because there's nothing it needs to serialise against as
iget_locked() has already done all that work.

If the inode is found in cache by iget_locked, then the
wait_on_inode() call is guaranteed to ensure that I_NEW is not set
when it returns. The atomic bit operations on __I_NEW and the memory
barriers in unlock_new_inode() plays an important part in this
dance, and they guarantee that I_NEW has been cleared before
iget_locked() returns. No need for inode->i_lock to be held in this
case, either, because iget_locked() did all the serialisation for
us.

This special dance is an optimisation that avoids the need to take
inode->i_lock in the inode lookup fast path just to check I_NEW. It
is an exception to the general rule but internal it uses
inode->i_lock in the places it is needed to ensure anything using
the general rule about accessing i_state still behaves correctly.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx