Re: [RFC PATCH 06/19] rust: fs: introduce `FileSystem::init_root`

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 31, 2023 at 05:49:19PM -0300, Wedson Almeida Filho wrote:
> On Sun, 29 Oct 2023 at 23:29, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Oct 23, 2023 at 09:55:08AM -0300, Wedson Almeida Filho wrote:
> > > On Mon, 23 Oct 2023 at 02:29, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > IOWs, if you follow the general rule that any inode->i_state access
> > > > (read or write) needs to hold inode->i_lock, you probably won't
> > > > screw up.
> > >
> > > I don't see filesystems doing this though. In particular, see
> > > iget_locked() -- if a new inode is returned, then it is locked, but if
> > > a cached one is found, it's not locked.
> >
> > I did say "if you follow the general rule".
> >
> > And where there is a "general rule" there is the implication that
> > there are special cases where the "general rule" doesn't get
> > applied, yes? :)
> 
> Sure. But when you say "if _you_ do X", it gives me the impression
> that I have a choice. But if want to use `iget_locked`, I don't have
> the option to follow the "general rule" you state.
> 
> I guess I have the option to ignore `iget_locked`. :)
> 
> > I_NEW is the exception to the general rule, and very few people
> > writing filesystems actually know about it let alone care about
> > it...
> <snip>
> > All of them are perfectly fine.
> 
> I'm not sure I agree with this. They may be fine, but I wouldn't say
> perfectly. :)
> 
> > I_NEW is the bit we use to synchronise inode initialisation - we
> > have to ensure there is only a single initialisation running while
> > there are concurrent lookups that can find the inode whilst it is
> > being initialised. We cannot hold a spin lock over inode
> > initialisation (it may have to do IO!), so we set the I_NEW flag
> > under the i_lock and the inode_hash_lock during hash insertion so
> > that they are set atomically from the hash lookup POV. If the inode
> > is then found in cache, wait_on_inode() does the serialisation
> > against the running initialisation indicated by the __I_NEW bit in
> > the i_state word.
> >
> > Hence if the caller of iget_locked() ever sees I_NEW, it is
> > guaranteed to have exclusive access to the inode and -must- first
> > initialise the inode and then call unlock_new_inode() when it has
> > completed. It doesn't need to hold inode->i_lock in this case
> > because there's nothing it needs to serialise against as
> > iget_locked() has already done all that work.
> >
> > If the inode is found in cache by iget_locked, then the
> > wait_on_inode() call is guaranteed to ensure that I_NEW is not set
> > when it returns. The atomic bit operations on __I_NEW and the memory
> > barriers in unlock_new_inode() plays an important part in this
> > dance, and they guarantee that I_NEW has been cleared before
> > iget_locked() returns. No need for inode->i_lock to be held in this
> > case, either, because iget_locked() did all the serialisation for
> > us.
> 
> Thanks for explanation!
> 
> Let's consider the case when I call `inode_get`, and it finds an inode
> that _has_ been fully initialised before, so I_NEW is not set in
> inode->i_state and the inode is _not_ locked.
> 
> But the only means of checking that is by inspecting the i_state
> field, so I do something like:
> 
> if (!(inode->i_state & I_NEW))
>     return inode;
> 
> But now suppose that while I'm doing a naked load on inode->i_state,
> another cpu is running concurrently and happens to be holding the
> inode->i_lock, so it is within its right to write to inode->i_state,
> for example through a call to __inode_add_lru, which has the
> following:
> 
> inode->i_state |= I_REFERENCED;
> 
> So we have a thread doing a naked read and another thread doing a
> naked write, no ordering between them.
> 
> Would you agree that this is a data race? (Note that I'm not asking if
> "it will be ok" or "the compilers today generate the right code", I'm
> asking merely if you agree this is a data race.)

I'll agree that technically it is a data race on the entire i_state
word. Practically, however, it is not a data race on the I_NEW bit
within that word. The I_NEW bit remains unchanged across the entire
operation.

i.e. it does not matter where the read of i_state intersects with
the RMW of I_REFERENCED bit, the I_NEW bit remains unchanged in
memory across the operation. If the above operation results in the
I_NEW bit changing state in memory - even transiently - then the
compiler implementation is simply broken...

> If you do, then you'd have to agree that we are in undefined-behaviour
> territory. I can quote the spec if you'd like.

/me shrugs

I can point you at lots of code that it will break if bit operations
are allowed to randomly change other bits in the word transiently.

> Anyway, the discussion here is that this is also undefined behaviour
> in Rust. And we're trying really hard to avoid that. Of course, in
> cases like this there's not much we can do on the Rust side alone so
> the conclusion now appears to be that we'll introduce helper functions
> for this now and live with it. If one day we have a better solution,
> we'll update just one place.

All the rust code that calls iget_locked() needs to do to "be safe"
is the rust equivalent of:

	spin_lock(&inode->i_lock);
	if (!(inode->i_state & I_NEW)) {
		spin_unlock(&inode->i_lock);
		return inode;
	}
	spin_unlock(&inode->i_lock);

IOWs, we solve the "safety" concern by ensuring that Rust filesystem
implementations follow the general rule of "always hold the i_lock
when accessing inode->i_state" I originally outlined, yes?

> But we want the be very deliberate about these. We don't want to
> accidentally introduce data races (and therefore potential undefined
> behaviour).

The stop looking at the C code and all the exceptions we make for
special case optimisations and just code to the generic rules for
safe access to given fields. Yes, rust will then have to give up the
optimisations we make in the C code, but there's always a price for
safety...

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux