Re: new_inode_pseudo vs locked inode->i_state = 0

Mateusz Guzik <mjguzik@xxxxxxxxx> · Wed, 9 Aug 2023 02:23:59 +0200

On 8/9/23, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
>> Hello,
>>
>> new_inode_pseudo is:
>>         struct inode *inode = alloc_inode(sb);
>>
>> 	if (inode) {
>> 		spin_lock(&inode->i_lock);
>> 		inode->i_state = 0;
>> 		spin_unlock(&inode->i_lock);
>> 	}
>>
>> I'm trying to understand:
>> 1. why is it zeroing i_state (as opposed to have it happen in
>> inode_init_always)
>> 2. why is zeroing taking place with i_lock held
>>
>> The inode is freshly allocated, not yet added to the hash -- I would
>> expect that nobody else can see it.
>
> Maybe not at this point, but as soon as the function returns with
> the new inode, it could be published in some list that can be
> accessed concurrently and then the i_state visible on other CPUs
> better be correct.
>
> I'll come back to this, because the answer lies in this code:
>
>> Moreover, another consumer of alloc_inode zeroes without bothering to
>> lock -- see iget5_locked:
>> [snip]
>> 	struct inode *new = alloc_inode(sb);
>>
>> 		if (new) {
>> 			new->i_state = 0;
>> [/snip]
>
> Yes, that one is fine because the inode has not been published yet.
> The actual i_state serialisation needed to publish the inode happens
> in the function called in the very next line - inode_insert5().
>
> That does:
>
> 	spin_lock(&inode_hash_lock);
>
> 	.....
>         /*
>          * Return the locked inode with I_NEW set, the
>          * caller is responsible for filling in the contents
>          */
>         spin_lock(&inode->i_lock);
>         inode->i_state |= I_NEW;
>         hlist_add_head_rcu(&inode->i_hash, head);
>         spin_unlock(&inode->i_lock);
> 	.....
>
> 	spin_unlock(&inode_hash_lock);
>
> The i_lock is held across the inode state initialisation and hash
> list insert so that if anything finds the inode in the hash
> immediately after insert, they should set an initialised value.
>
> Don't be fooled by the inode_hash_lock here. We have
> find_inode_rcu() which walks hash lists without holding the hash
> lock, hence if anything needs to do a state check on the found
> inode, they are guaranteed to see I_NEW after grabbing the i_lock....
>
> Further, inode_insert5() adds the inode to the superblock inode
> list, which means concurrent sb inode list walkers can also see this
> inode whilst the inode_hash_lock is still held by inode_insert5().
> Those inode list walkers *must* see I_NEW at this point, and they
> are guaranteed to do so by taking i_lock before checking i_state....
>
> IOWs, the initialisation of inode->i_state for normal inodes must be
> done under i_lock so that lookups that occur after hash/sb list
> insert are guaranteed to see the correct value.
>
> If we now go back to new_inode_pseudo(), we see one of the callers
> is new_inode(), and it does this:
>
> struct inode *new_inode(struct super_block *sb)
> {
>         struct inode *inode;
>
>         spin_lock_prefetch(&sb->s_inode_list_lock);
>
>         inode = new_inode_pseudo(sb);
>         if (inode)
>                 inode_sb_list_add(inode);
>         return inode;
> }
>
> IOWs, the inode is immediately published on the superblock inode
> list, and so inode list walkers can see it immediately. As per
> inode_insert5(), this requires the inode state to be fully
> initialised and memory barriers in place such that any walker will
> see the correct value of i_state. The simplest, safest way to do
> this is to initialise i_state under the i_lock....
>

Thanks for the detailed answer, I do think you have a valid point but
I don't think it works with the given example. ;)

inode_sb_list_add is:
        spin_lock(&inode->i_sb->s_inode_list_lock);
        list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
        spin_unlock(&inode->i_sb->s_inode_list_lock);

... thus i_state is published by the time it unlocks.

According to my grep all iterations over the list hold the
s_inode_list_lock, thus they are guaranteed to see the update, making
the release fence in new_inode_pseudo redundant for this case.

With this in mind I'm assuming the fence was there as a safety
measure, for consumers which would maybe need it.

Then the code can:
        struct inode *inode = alloc_inode(sb);

        if (inode) {
                inode->i_state = 0;
                /* make sure i_state update will be visible before we insert
                 * the inode anywhere */
                smp_wmb();
        }

Upshots:
- replaces 2 atomics with a mere release fence, which is way cheaper
to do everywhere and virtually free on x86-64
- people reading the code don't wonder who on earth are we locking against

All that said, if the (possibly redundant) fence is literally the only
reason for the lock trip, I would once more propose zeroing in
inode_init_always:

diff --git a/fs/inode.c b/fs/inode.c
index 8fefb69e1f84..ce9664c4efe9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -232,6 +232,13 @@ int inode_init_always(struct super_block *sb,
struct inode *inode)
                return -ENOMEM;
        this_cpu_inc(nr_inodes);

+       inode->i_state = 0;
+       /*
+        * Make sure i_state update is visible before this inode gets inserted
+        * anywhere.
+        */
+       smp_wmb();
+
        return 0;
 }
 EXPORT_SYMBOL(inode_init_always);

This is more in the spirit of making sure everybody has published
i_state = 0 and facilitates cleanup.
- new_inode_pseudo is now just alloc_inode
- confusing unlocked/unfenced i_state = 0 disappears from iget5_locked

And probably some more tidyups.

Now, I'm not going to flame with anyone over doing smp_wmb instead of
the lock trip (looks like a no-brainer to me, but I got flamed for
another one earlier today ;>).

I am however going to /strongly suggest/ that a comment explaining
what's going on is added there, if the current state is to remain.

As far as I'm concerned *locking* when a mere smp_wmb would sufficne
is heavily misleading and should be whacked if only for that reason.

Cheers,
-- 
Mateusz Guzik <mjguzik gmail.com>