On Mon, Oct 23, 2023 at 09:55:08AM -0300, Wedson Almeida Filho wrote: > On Mon, 23 Oct 2023 at 02:29, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Sat, Oct 21, 2023 at 12:33:57PM -0700, Boqun Feng wrote: > > > On Sat, Oct 21, 2023 at 06:01:02PM +0100, Matthew Wilcox wrote: > > > > I'm only an expert on the page cache, not the rest of the VFS. So > > > > what are the rules around modifying i_state for the VFS? > > > > > > Agreed, same question here. > > > > inode->i_state should only be modified under inode->i_lock. > > > > And in most situations, you have to hold the inode->i_lock to read > > state flags as well so that reads are serialised against > > modifications which are typically non-atomic RMW operations. > > > > There is, I think, one main exception to read side locking and this > > is find_inode_rcu() which does an unlocked check for I_WILL_FREE | > > I_FREEING. In this case, the inode->i_state updates in iput_final() > > use WRITE_ONCE under the inode->i_lock to provide the necessary > > semantics for the unlocked READ_ONCE() done under rcu_read_lock(). > > > > IOWs, if you follow the general rule that any inode->i_state access > > (read or write) needs to hold inode->i_lock, you probably won't > > screw up. > > I don't see filesystems doing this though. In particular, see > iget_locked() -- if a new inode is returned, then it is locked, but if > a cached one is found, it's not locked. I did say "if you follow the general rule". And where there is a "general rule" there is the implication that there are special cases where the "general rule" doesn't get applied, yes? :) I_NEW is the exception to the general rule, and very few people writing filesystems actually know about it let alone care about it... > So we're in this situation where a returned inode may or may not be > locked. And the way to determine if it's locked or not is to read > i_state. > > Here are examples of kernfs, ext2, ext4 and squashfs doing it: > https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/kernfs/inode.c#L252 > https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/ext2/inode.c#L1392 > https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/ext4/inode.c#L4707 > https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/squashfs/inode.c#L82 > > They all call iget_locked(), and if I_NEW is set, they initialise the > inode and unlock it with unlock_new_inode(); otherwise they just > return the unlocked inode. All of them are perfectly fine. I_NEW is the bit we use to synchronise inode initialisation - we have to ensure there is only a single initialisation running while there are concurrent lookups that can find the inode whilst it is being initialised. We cannot hold a spin lock over inode initialisation (it may have to do IO!), so we set the I_NEW flag under the i_lock and the inode_hash_lock during hash insertion so that they are set atomically from the hash lookup POV. If the inode is then found in cache, wait_on_inode() does the serialisation against the running initialisation indicated by the __I_NEW bit in the i_state word. Hence if the caller of iget_locked() ever sees I_NEW, it is guaranteed to have exclusive access to the inode and -must- first initialise the inode and then call unlock_new_inode() when it has completed. It doesn't need to hold inode->i_lock in this case because there's nothing it needs to serialise against as iget_locked() has already done all that work. If the inode is found in cache by iget_locked, then the wait_on_inode() call is guaranteed to ensure that I_NEW is not set when it returns. The atomic bit operations on __I_NEW and the memory barriers in unlock_new_inode() plays an important part in this dance, and they guarantee that I_NEW has been cleared before iget_locked() returns. No need for inode->i_lock to be held in this case, either, because iget_locked() did all the serialisation for us. This special dance is an optimisation that avoids the need to take inode->i_lock in the inode lookup fast path just to check I_NEW. It is an exception to the general rule but internal it uses inode->i_lock in the places it is needed to ensure anything using the general rule about accessing i_state still behaves correctly. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx