On 10/7/24 8:23 AM, Christian Brauner wrote: > As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has > O(N^2) behaviour under contention with N concurrent operations and it is > in a hot path in __fget_files_rcu(). > > The rcuref infrastructures remedies this problem by using an > unconditional increment relying on safe- and dead zones to make this > work and requiring rcu protection for the data structure in question. > This not just scales better it also introduces overflow protection. > > However, in contrast to generic rcuref, files require a memory barrier > and thus cannot rely on *_relaxed() atomic operations and also require > to be built on atomic_long_t as having massive amounts of reference > isn't unheard of even if it is just an attack. > > As suggested by Linus, add a file specific variant instead of making > this a generic library. > > I've been testing this with will-it-scale using a multi-threaded fstat() > on the same file descriptor on a machine that Jens gave me access (thank > you very much!): > > processor : 511 > vendor_id : AuthenticAMD > cpu family : 25 > model : 160 > model name : AMD EPYC 9754 128-Core Processor > > and I consistently get a 3-5% improvement on workloads with 256+ and > more threads comparing v6.12-rc1 as base with and without these patches > applied. FWIW, I ran this on another box, which is a 2-socket with these CPUs: AMD EPYC 7763 64-Core Processor hence 128 cores, 256 threads. I ran my usual max iops test case, which is 24 threads, each driving a fast drive. If I run without io_uring direct descriptors, then fget/fput is hit decently hard. In that case, I see a net reduction of about 1.2% CPU time for the fget/fput parts. So not as huge a win as mentioned above, but it's also using way fewer threads and different file descriptors. I'd say that's a pretty noticeable win! -- Jens Axboe