On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote: > Hello, > > new_inode_pseudo is: > struct inode *inode = alloc_inode(sb); > > if (inode) { > spin_lock(&inode->i_lock); > inode->i_state = 0; > spin_unlock(&inode->i_lock); > } > > I'm trying to understand: > 1. why is it zeroing i_state (as opposed to have it happen in inode_init_always) > 2. why is zeroing taking place with i_lock held > > The inode is freshly allocated, not yet added to the hash -- I would > expect that nobody else can see it. Maybe not at this point, but as soon as the function returns with the new inode, it could be published in some list that can be accessed concurrently and then the i_state visible on other CPUs better be correct. I'll come back to this, because the answer lies in this code: > Moreover, another consumer of alloc_inode zeroes without bothering to > lock -- see iget5_locked: > [snip] > struct inode *new = alloc_inode(sb); > > if (new) { > new->i_state = 0; > [/snip] Yes, that one is fine because the inode has not been published yet. The actual i_state serialisation needed to publish the inode happens in the function called in the very next line - inode_insert5(). That does: spin_lock(&inode_hash_lock); ..... /* * Return the locked inode with I_NEW set, the * caller is responsible for filling in the contents */ spin_lock(&inode->i_lock); inode->i_state |= I_NEW; hlist_add_head_rcu(&inode->i_hash, head); spin_unlock(&inode->i_lock); ..... spin_unlock(&inode_hash_lock); The i_lock is held across the inode state initialisation and hash list insert so that if anything finds the inode in the hash immediately after insert, they should set an initialised value. Don't be fooled by the inode_hash_lock here. We have find_inode_rcu() which walks hash lists without holding the hash lock, hence if anything needs to do a state check on the found inode, they are guaranteed to see I_NEW after grabbing the i_lock.... Further, inode_insert5() adds the inode to the superblock inode list, which means concurrent sb inode list walkers can also see this inode whilst the inode_hash_lock is still held by inode_insert5(). Those inode list walkers *must* see I_NEW at this point, and they are guaranteed to do so by taking i_lock before checking i_state.... IOWs, the initialisation of inode->i_state for normal inodes must be done under i_lock so that lookups that occur after hash/sb list insert are guaranteed to see the correct value. If we now go back to new_inode_pseudo(), we see one of the callers is new_inode(), and it does this: struct inode *new_inode(struct super_block *sb) { struct inode *inode; spin_lock_prefetch(&sb->s_inode_list_lock); inode = new_inode_pseudo(sb); if (inode) inode_sb_list_add(inode); return inode; } IOWs, the inode is immediately published on the superblock inode list, and so inode list walkers can see it immediately. As per inode_insert5(), this requires the inode state to be fully initialised and memory barriers in place such that any walker will see the correct value of i_state. The simplest, safest way to do this is to initialise i_state under the i_lock.... > I don't know the original justification nor whether it made sense at > the time, this is definitely problematic today in the rather heavy > multicore era -- there is tons of work happening between the prefetch > and actually take the s_inode_list_lock lock, meaning if there is > contention, the cacheline is going to be marked invalid by the time > spin_lock on it is called. But then this only adds to cacheline > bouncing. Well know problem - sb->s_inode_list_lock has been the heaviest contended lock in the VFS for various XFS workloads for the best part of a decade, yet XFS does not use new_inode(). See this branch for how we fix the s_inode_list_lock contention issues: https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/log/?h=vfs-scale This commit removes the spin_lock_prefetch() you talk about: https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=vfs-scale&id=fb17545b70d8c228295105044dd6b52085197d75 If only I had the time and resources available right now to push this to completion..... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx