On Sun, 29 Oct 2023 at 23:29, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Mon, Oct 23, 2023 at 09:55:08AM -0300, Wedson Almeida Filho wrote: > > On Mon, 23 Oct 2023 at 02:29, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > IOWs, if you follow the general rule that any inode->i_state access > > > (read or write) needs to hold inode->i_lock, you probably won't > > > screw up. > > > > I don't see filesystems doing this though. In particular, see > > iget_locked() -- if a new inode is returned, then it is locked, but if > > a cached one is found, it's not locked. > > I did say "if you follow the general rule". > > And where there is a "general rule" there is the implication that > there are special cases where the "general rule" doesn't get > applied, yes? :) Sure. But when you say "if _you_ do X", it gives me the impression that I have a choice. But if want to use `iget_locked`, I don't have the option to follow the "general rule" you state. I guess I have the option to ignore `iget_locked`. :) > I_NEW is the exception to the general rule, and very few people > writing filesystems actually know about it let alone care about > it... <snip> > All of them are perfectly fine. I'm not sure I agree with this. They may be fine, but I wouldn't say perfectly. :) > I_NEW is the bit we use to synchronise inode initialisation - we > have to ensure there is only a single initialisation running while > there are concurrent lookups that can find the inode whilst it is > being initialised. We cannot hold a spin lock over inode > initialisation (it may have to do IO!), so we set the I_NEW flag > under the i_lock and the inode_hash_lock during hash insertion so > that they are set atomically from the hash lookup POV. If the inode > is then found in cache, wait_on_inode() does the serialisation > against the running initialisation indicated by the __I_NEW bit in > the i_state word. > > Hence if the caller of iget_locked() ever sees I_NEW, it is > guaranteed to have exclusive access to the inode and -must- first > initialise the inode and then call unlock_new_inode() when it has > completed. It doesn't need to hold inode->i_lock in this case > because there's nothing it needs to serialise against as > iget_locked() has already done all that work. > > If the inode is found in cache by iget_locked, then the > wait_on_inode() call is guaranteed to ensure that I_NEW is not set > when it returns. The atomic bit operations on __I_NEW and the memory > barriers in unlock_new_inode() plays an important part in this > dance, and they guarantee that I_NEW has been cleared before > iget_locked() returns. No need for inode->i_lock to be held in this > case, either, because iget_locked() did all the serialisation for > us. Thanks for explanation! Let's consider the case when I call `inode_get`, and it finds an inode that _has_ been fully initialised before, so I_NEW is not set in inode->i_state and the inode is _not_ locked. But the only means of checking that is by inspecting the i_state field, so I do something like: if (!(inode->i_state & I_NEW)) return inode; But now suppose that while I'm doing a naked load on inode->i_state, another cpu is running concurrently and happens to be holding the inode->i_lock, so it is within its right to write to inode->i_state, for example through a call to __inode_add_lru, which has the following: inode->i_state |= I_REFERENCED; So we have a thread doing a naked read and another thread doing a naked write, no ordering between them. Would you agree that this is a data race? (Note that I'm not asking if "it will be ok" or "the compilers today generate the right code", I'm asking merely if you agree this is a data race.) If you do, then you'd have to agree that we are in undefined-behaviour territory. I can quote the spec if you'd like. Anyway, the discussion here is that this is also undefined behaviour in Rust. And we're trying really hard to avoid that. Of course, in cases like this there's not much we can do on the Rust side alone so the conclusion now appears to be that we'll introduce helper functions for this now and live with it. If one day we have a better solution, we'll update just one place. But we want the be very deliberate about these. We don't want to accidentally introduce data races (and therefore potential undefined behaviour). Cheers, -Wedson