On 8/9/23, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote: >> Hello, >> >> new_inode_pseudo is: >> struct inode *inode = alloc_inode(sb); >> >> if (inode) { >> spin_lock(&inode->i_lock); >> inode->i_state = 0; >> spin_unlock(&inode->i_lock); >> } >> >> I'm trying to understand: >> 1. why is it zeroing i_state (as opposed to have it happen in >> inode_init_always) >> 2. why is zeroing taking place with i_lock held >> >> The inode is freshly allocated, not yet added to the hash -- I would >> expect that nobody else can see it. > > Maybe not at this point, but as soon as the function returns with > the new inode, it could be published in some list that can be > accessed concurrently and then the i_state visible on other CPUs > better be correct. > > I'll come back to this, because the answer lies in this code: > >> Moreover, another consumer of alloc_inode zeroes without bothering to >> lock -- see iget5_locked: >> [snip] >> struct inode *new = alloc_inode(sb); >> >> if (new) { >> new->i_state = 0; >> [/snip] > > Yes, that one is fine because the inode has not been published yet. > The actual i_state serialisation needed to publish the inode happens > in the function called in the very next line - inode_insert5(). > > That does: > > spin_lock(&inode_hash_lock); > > ..... > /* > * Return the locked inode with I_NEW set, the > * caller is responsible for filling in the contents > */ > spin_lock(&inode->i_lock); > inode->i_state |= I_NEW; > hlist_add_head_rcu(&inode->i_hash, head); > spin_unlock(&inode->i_lock); > ..... > > spin_unlock(&inode_hash_lock); > > The i_lock is held across the inode state initialisation and hash > list insert so that if anything finds the inode in the hash > immediately after insert, they should set an initialised value. > > Don't be fooled by the inode_hash_lock here. We have > find_inode_rcu() which walks hash lists without holding the hash > lock, hence if anything needs to do a state check on the found > inode, they are guaranteed to see I_NEW after grabbing the i_lock.... > > Further, inode_insert5() adds the inode to the superblock inode > list, which means concurrent sb inode list walkers can also see this > inode whilst the inode_hash_lock is still held by inode_insert5(). > Those inode list walkers *must* see I_NEW at this point, and they > are guaranteed to do so by taking i_lock before checking i_state.... > > IOWs, the initialisation of inode->i_state for normal inodes must be > done under i_lock so that lookups that occur after hash/sb list > insert are guaranteed to see the correct value. > > If we now go back to new_inode_pseudo(), we see one of the callers > is new_inode(), and it does this: > > struct inode *new_inode(struct super_block *sb) > { > struct inode *inode; > > spin_lock_prefetch(&sb->s_inode_list_lock); > > inode = new_inode_pseudo(sb); > if (inode) > inode_sb_list_add(inode); > return inode; > } > > IOWs, the inode is immediately published on the superblock inode > list, and so inode list walkers can see it immediately. As per > inode_insert5(), this requires the inode state to be fully > initialised and memory barriers in place such that any walker will > see the correct value of i_state. The simplest, safest way to do > this is to initialise i_state under the i_lock.... > Thanks for the detailed answer, I do think you have a valid point but I don't think it works with the given example. ;) inode_sb_list_add is: spin_lock(&inode->i_sb->s_inode_list_lock); list_add(&inode->i_sb_list, &inode->i_sb->s_inodes); spin_unlock(&inode->i_sb->s_inode_list_lock); ... thus i_state is published by the time it unlocks. According to my grep all iterations over the list hold the s_inode_list_lock, thus they are guaranteed to see the update, making the release fence in new_inode_pseudo redundant for this case. With this in mind I'm assuming the fence was there as a safety measure, for consumers which would maybe need it. Then the code can: struct inode *inode = alloc_inode(sb); if (inode) { inode->i_state = 0; /* make sure i_state update will be visible before we insert * the inode anywhere */ smp_wmb(); } Upshots: - replaces 2 atomics with a mere release fence, which is way cheaper to do everywhere and virtually free on x86-64 - people reading the code don't wonder who on earth are we locking against All that said, if the (possibly redundant) fence is literally the only reason for the lock trip, I would once more propose zeroing in inode_init_always: diff --git a/fs/inode.c b/fs/inode.c index 8fefb69e1f84..ce9664c4efe9 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -232,6 +232,13 @@ int inode_init_always(struct super_block *sb, struct inode *inode) return -ENOMEM; this_cpu_inc(nr_inodes); + inode->i_state = 0; + /* + * Make sure i_state update is visible before this inode gets inserted + * anywhere. + */ + smp_wmb(); + return 0; } EXPORT_SYMBOL(inode_init_always); This is more in the spirit of making sure everybody has published i_state = 0 and facilitates cleanup. - new_inode_pseudo is now just alloc_inode - confusing unlocked/unfenced i_state = 0 disappears from iget5_locked And probably some more tidyups. Now, I'm not going to flame with anyone over doing smp_wmb instead of the lock trip (looks like a no-brainer to me, but I got flamed for another one earlier today ;>). I am however going to /strongly suggest/ that a comment explaining what's going on is added there, if the current state is to remain. As far as I'm concerned *locking* when a mere smp_wmb would sufficne is heavily misleading and should be whacked if only for that reason. Cheers, -- Mateusz Guzik <mjguzik gmail.com>