Re: new_inode_pseudo vs locked inode->i_state = 0

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 9 Aug 2023 09:15:50 +1000

On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
> Hello,
> 
> new_inode_pseudo is:
>         struct inode *inode = alloc_inode(sb);
> 
> 	if (inode) {
> 		spin_lock(&inode->i_lock);
> 		inode->i_state = 0;
> 		spin_unlock(&inode->i_lock);
> 	}
> 
> I'm trying to understand:
> 1. why is it zeroing i_state (as opposed to have it happen in inode_init_always)
> 2. why is zeroing taking place with i_lock held
> 
> The inode is freshly allocated, not yet added to the hash -- I would
> expect that nobody else can see it.

Maybe not at this point, but as soon as the function returns with
the new inode, it could be published in some list that can be
accessed concurrently and then the i_state visible on other CPUs
better be correct.

I'll come back to this, because the answer lies in this code:

> Moreover, another consumer of alloc_inode zeroes without bothering to
> lock -- see iget5_locked:
> [snip]
> 	struct inode *new = alloc_inode(sb);
> 
> 		if (new) {
> 			new->i_state = 0;
> [/snip]

Yes, that one is fine because the inode has not been published yet.
The actual i_state serialisation needed to publish the inode happens
in the function called in the very next line - inode_insert5().

That does:

	spin_lock(&inode_hash_lock);

	.....
        /*
         * Return the locked inode with I_NEW set, the
         * caller is responsible for filling in the contents
         */
        spin_lock(&inode->i_lock);
        inode->i_state |= I_NEW;
        hlist_add_head_rcu(&inode->i_hash, head);
        spin_unlock(&inode->i_lock);
	.....

	spin_unlock(&inode_hash_lock);

The i_lock is held across the inode state initialisation and hash
list insert so that if anything finds the inode in the hash
immediately after insert, they should set an initialised value.

Don't be fooled by the inode_hash_lock here. We have
find_inode_rcu() which walks hash lists without holding the hash
lock, hence if anything needs to do a state check on the found
inode, they are guaranteed to see I_NEW after grabbing the i_lock....

Further, inode_insert5() adds the inode to the superblock inode
list, which means concurrent sb inode list walkers can also see this
inode whilst the inode_hash_lock is still held by inode_insert5().
Those inode list walkers *must* see I_NEW at this point, and they
are guaranteed to do so by taking i_lock before checking i_state....

IOWs, the initialisation of inode->i_state for normal inodes must be
done under i_lock so that lookups that occur after hash/sb list
insert are guaranteed to see the correct value.

If we now go back to new_inode_pseudo(), we see one of the callers
is new_inode(), and it does this:

struct inode *new_inode(struct super_block *sb)
{
        struct inode *inode;

        spin_lock_prefetch(&sb->s_inode_list_lock);

        inode = new_inode_pseudo(sb);
        if (inode)
                inode_sb_list_add(inode);
        return inode;
}

IOWs, the inode is immediately published on the superblock inode
list, and so inode list walkers can see it immediately. As per
inode_insert5(), this requires the inode state to be fully
initialised and memory barriers in place such that any walker will
see the correct value of i_state. The simplest, safest way to do
this is to initialise i_state under the i_lock....

> I don't know the original justification nor whether it made sense at
> the time, this is definitely problematic today in the rather heavy
> multicore era -- there is tons of work happening between the prefetch
> and actually take the s_inode_list_lock lock, meaning if there is
> contention, the cacheline is going to be marked invalid by the time
> spin_lock on it is called. But then this only adds to cacheline
> bouncing.

Well know problem - sb->s_inode_list_lock has been the heaviest
contended lock in the VFS for various XFS workloads for the best
part of a decade, yet XFS does not use new_inode(). See this branch
for how we fix the s_inode_list_lock contention issues:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/log/?h=vfs-scale

This commit removes the spin_lock_prefetch() you talk about:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=vfs-scale&id=fb17545b70d8c228295105044dd6b52085197d75

If only I had the time and resources available right now to push
this to completion.....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx