Re: [PATCH RFC 0/4] fs: port files to rcuref_long_t

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 5 Oct 2024 14:42:25 -0700

On Sat, 5 Oct 2024 at 12:17, Christian Brauner <brauner@xxxxxxxxxx> wrote:
>
> Note that atomic_inc_not_zero() contained a full memory barrier that we
> relied upon. But we only need an acquire barrier and so I replaced the
> second load from the file table with a smp_load_acquire(). I'm not
> completely sure this is correct or if we could get away with something
> else. Linus, maybe you have input here?

I don't think this is valid.

You go from  this:

        file = rcu_dereference_raw(*f);
        if (!file)
                return NULL;
        if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
                return ERR_PTR(-EAGAIN);
        file_reloaded = rcu_dereference_raw(*f);

to this:

        file = rcu_dereference_raw(*f);
        if (!file)
                return NULL;
        if (unlikely(!rcuref_long_get(&file->f_count)))
                return ERR_PTR(-EAGAIN);
        file_reloaded = smp_load_acquire(f);

and the thing is, that rcuref_long_get() had better be a *full* memory barrier.

The smp_load_acquire() does absolutely nothing: it means that the load
will be done before *subsequent* memory operations. But it is not a
barrier to preceding memory operations.

So if rcuref_long_get() isn't ordered (and you made it relaxed, the
same way the existing rcuref_get() is), the CPU can basically move
that down past the smp_load_acquire(), so the code actually
effectively turns into this:

        file = rcu_dereference_raw(*f);
        if (!file)
                return NULL;
        file_reloaded = smp_load_acquire(f);
        if (unlikely(!rcuref_long_get(&file->f_count)))
                return ERR_PTR(-EAGAIN);

and now the "file_reloaded" thing is completely pointless.

Of course, you will never *see* this on x86, because all atomics are
always full memory barriers on x86, so when you test this all on that
noce 256-thread CPU, it all works fine. Because on x86, the "relaxed"
memory ordering isn't.

So no. You can't use atomic_long_add_negative_relaxed() in the file
handling, because it really does require a proper memory barrier.

Now, I do think that replacing the cmpxchg loop with
atomic_long_add_negative() is a good idea.

But you can't do it this way, and you can't use the RCUREF logic and
just extend it to the file ref.

I would suggest you take that whole "rcuref_long_get()" code and *not*
try to make it look like the rcuref code, but instead make it
explicitly just about the file counting. Because the file counting
really does have these special rules.

Also, honestly, the only reason the file counting is using a "long" is
because the code does *NOT* do overflow checking. But once you start
looking at the sign and do conditional increments, you can actually
just make the whole refcount be a "int" instead, and make "struct
file" potentially smaller.

And yes, that requires that people who get a file() will fail when the
file count goes negative, but that's *good*.

That's in fact exactly what we do for the page ref count, for the
exact same reason, btw.

IOW, I think you should make "f_count" be a regular "atomic_t", and
you should make the above sequence actually just more along the lines
of

        file = rcu_dereference_raw(*f);
        if (!file)
                return NULL;
        ret = atomic_inc_return(&file->f_count);
        file_reloaded = smp_load_acquire(f);

and now you've got the right memory ordering, but you also have the
newly incremented return value and the file reloaded value, so now you
can continue on with something like this:

        // Check that we didn't increment the refcount
        // out of bounds or from zero..
        if (ret > 1) {
                struct file *file_reloaded_cmp = file_reloaded;
                OPTIMIZER_HIDE_VAR(file_reloaded_cmp);
                if (file == file_reloaded_cmp)
                        return file_reloaded;
        }

        // Uhhuh, we got some other file or the file count had
        // overflowed. Decrement the refcount again
        fput(file);
        return ERR_PTR(-EAGAIN);

and I think that might work, although the zero count case worries me
(ie 'fput twice').

Currently we avoid the fput twice because we use that
"inc_not_zero()". So that needs some thinking about.

               Linus