As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has O(N^2) behaviour under contention with N concurrent operations. The rcuref infrastructure uses atomic_add_negative_relaxed() for the fast path, which scales better under contention and we get overflow protection for free. I've been testing this with will-it-scale using a multi-threaded fstat() on the same file descriptor on a machine that Jens gave me access (thank you very much!): processor : 511 vendor_id : AuthenticAMD cpu family : 25 model : 160 model name : AMD EPYC 9754 128-Core Processor and I consistently get a 3-5% improvement on workloads with 256+ and more threads comparing v6.12-rc1 as base against with these patches applied. Note that atomic_inc_not_zero() contained a full memory barrier that we relied upon. But we only need an acquire barrier and so I replaced the second load from the file table with a smp_load_acquire(). I'm not completely sure this is correct or if we could get away with something else. Linus, maybe you have input here? Maybe this is all a bad idea but I've wasted enough time on performance testing this that I at least wanted to have it on list. Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> --- Christian Brauner (4): fs: protect backing files with rcu types: add rcuref_long_t rcuref: add rcuref_long_*() helpers fs: port files to rcuref_long_t drivers/gpu/drm/i915/gt/shmem_utils.c | 2 +- drivers/gpu/drm/vmwgfx/ttm_object.c | 2 +- fs/eventpoll.c | 2 +- fs/file.c | 17 ++-- fs/file_table.c | 18 ++-- include/linux/fs.h | 9 +- include/linux/rcuref_long.h | 166 ++++++++++++++++++++++++++++++++++ include/linux/types.h | 10 ++ lib/rcuref.c | 104 +++++++++++++++++++++ 9 files changed, 308 insertions(+), 22 deletions(-) --- base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc change-id: 20240927-brauner-file-rcuref-bfa4a4ba915b