On Fri, May 31, 2019 at 9:52 AM Paul E. McKenney <paulmck@xxxxxxxxxxxxx> wrote: > > On Fri, May 31, 2019 at 09:10:16AM -0400, Joel Fernandes wrote: > > Hi, > > As per the documentation for rationale of percpu-rwsem, the Documentation says: > > > > The problem with traditional read-write semaphores is that when multiple > > cores take the lock for reading, the cache line containing the semaphore > > is bouncing between L1 caches of the cores, causing performance > > degradation. > > > > However, it appears to me that the struct percpu_rwsem "rss" element > > which is used by the RCU-sync is not a per-cpu element. So even in the > > fastpath case (only readers and no writers), the cacheline containing > > rss is shared and will bounce by multiple CPUs. For that matter, even > > the cacheline containing the percpu_rw_semaphore itself will be bounce > > among multiple reader CPUs. > > > > So how does percpu-rwsem eliminate cache line bouncing in the common > > case. Could you let me know what I am missing? > > > > Thanks a lot. > > The accesses are loads, except for the __this_cpu_inc(), which updates > a per-CPU variable. The locations loaded will replicate across the > CPUs' caches and the per-CPU variables are private to each CPU. Hence > no cacheline bouncing. Makes sense, thanks for the answer! > > Either way, it would be good for you to just try it. Create a kernel > module or similar than hammers on percpu_down_read() and percpu_up_read(), > and empirically check the scalability on a largish system. Then compare > this to down_read() and up_read() Will do! thanks.