On Sat, 5 Oct 2024 at 12:17, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > Note that atomic_inc_not_zero() contained a full memory barrier that we > relied upon. But we only need an acquire barrier and so I replaced the > second load from the file table with a smp_load_acquire(). I'm not > completely sure this is correct or if we could get away with something > else. Linus, maybe you have input here? I don't think this is valid. You go from this: file = rcu_dereference_raw(*f); if (!file) return NULL; if (unlikely(!atomic_long_inc_not_zero(&file->f_count))) return ERR_PTR(-EAGAIN); file_reloaded = rcu_dereference_raw(*f); to this: file = rcu_dereference_raw(*f); if (!file) return NULL; if (unlikely(!rcuref_long_get(&file->f_count))) return ERR_PTR(-EAGAIN); file_reloaded = smp_load_acquire(f); and the thing is, that rcuref_long_get() had better be a *full* memory barrier. The smp_load_acquire() does absolutely nothing: it means that the load will be done before *subsequent* memory operations. But it is not a barrier to preceding memory operations. So if rcuref_long_get() isn't ordered (and you made it relaxed, the same way the existing rcuref_get() is), the CPU can basically move that down past the smp_load_acquire(), so the code actually effectively turns into this: file = rcu_dereference_raw(*f); if (!file) return NULL; file_reloaded = smp_load_acquire(f); if (unlikely(!rcuref_long_get(&file->f_count))) return ERR_PTR(-EAGAIN); and now the "file_reloaded" thing is completely pointless. Of course, you will never *see* this on x86, because all atomics are always full memory barriers on x86, so when you test this all on that noce 256-thread CPU, it all works fine. Because on x86, the "relaxed" memory ordering isn't. So no. You can't use atomic_long_add_negative_relaxed() in the file handling, because it really does require a proper memory barrier. Now, I do think that replacing the cmpxchg loop with atomic_long_add_negative() is a good idea. But you can't do it this way, and you can't use the RCUREF logic and just extend it to the file ref. I would suggest you take that whole "rcuref_long_get()" code and *not* try to make it look like the rcuref code, but instead make it explicitly just about the file counting. Because the file counting really does have these special rules. Also, honestly, the only reason the file counting is using a "long" is because the code does *NOT* do overflow checking. But once you start looking at the sign and do conditional increments, you can actually just make the whole refcount be a "int" instead, and make "struct file" potentially smaller. And yes, that requires that people who get a file() will fail when the file count goes negative, but that's *good*. That's in fact exactly what we do for the page ref count, for the exact same reason, btw. IOW, I think you should make "f_count" be a regular "atomic_t", and you should make the above sequence actually just more along the lines of file = rcu_dereference_raw(*f); if (!file) return NULL; ret = atomic_inc_return(&file->f_count); file_reloaded = smp_load_acquire(f); and now you've got the right memory ordering, but you also have the newly incremented return value and the file reloaded value, so now you can continue on with something like this: // Check that we didn't increment the refcount // out of bounds or from zero.. if (ret > 1) { struct file *file_reloaded_cmp = file_reloaded; OPTIMIZER_HIDE_VAR(file_reloaded_cmp); if (file == file_reloaded_cmp) return file_reloaded; } // Uhhuh, we got some other file or the file count had // overflowed. Decrement the refcount again fput(file); return ERR_PTR(-EAGAIN); and I think that might work, although the zero count case worries me (ie 'fput twice'). Currently we avoid the fput twice because we use that "inc_not_zero()". So that needs some thinking about. Linus