> > Things to note about the results: > > > > - The results are slightly variable so don't get too caught up on > > individual thread count - it's the trend that is important. > > - In terms of throughput with this specific benchmark a *very* macro view > > is that the RW spinlock provides 40-60% more throughput than the > > spinlock. The per-CPU RW semaphore provides in the order of 50-100% > > more throughput then the spinlock. > > - This doesn't fully reflect the large reduction in latency that we have > > seen in application based measurements. However, it does demonstrate > > that even the trivial change of going to a RW spinlock provides > > significant benefits. > > This is probably because trig-uprobe-nop creates a single uprobe that > is triggered on many CPUs. While in production we have also *many* > uprobes running on many CPUs. In this benchmark, besides contention on > uprobes_treelock, we are also hammering on other per-uprobe locks > (register_rwsem, also if you don't have [0] patch locally, there will > be another filter lock taken each time, filter->rwlock). There is also > atomic refcounting going on, which when you have the same uprobe > across all CPUs at the same time will cause a bunch of cache line > bouncing. > > So yes, it's understandable that in practice in production you see an > even larger effect of optimizing uprobe_treelock than in this > micro-benchmark. > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf Thanks for the reply and the thoughts on this Andrii. Yes, I do have the filter->rwlock fix applied but, as you say, there are no doubt other effects at play here as to be expected in such a synthetic workload. I'm pleased with the outcomes though as they show a good result even if they are at the lower end of what I expect. The results also show that pursuing an RCU solution is definitely worth it but that write penalty is brutal in the case of a full synchronize_rcu()! Should be fun. > > for num_threads in {1..20} > > do > > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary > > just want to mention -a (affinity) option that you can pass a bench > tool, it will pin each thread on its own CPU. It generally makes tests > more uniform, eliminating CPU migrations variability. Thanks for pointing that flag out! Jon. > > > done > > > > > > spinlock > > > > Summary: hits 1.453 ± 0.005M/s ( 1.453M/prod) > > Summary: hits 2.087 ± 0.005M/s ( 1.043M/prod) > > Summary: hits 2.701 ± 0.012M/s ( 0.900M/prod) > > I also wanted to point out that the first measurement (1.453M/s in > this row) is total throughput across all threads, while value in > parenthesis (0.900M/prod) is averaged throughput per each thread. So > this M/prod value is the most interesting in this benchmark where we > assess the effect of reducing contention. > > > Summary: hits 1.917 ± 0.011M/s ( 0.479M/prod) > > Summary: hits 2.105 ± 0.003M/s ( 0.421M/prod) > > Summary: hits 1.615 ± 0.006M/s ( 0.269M/prod) > > [...]