On Wed, Apr 3, 2024 at 4:05 AM Jonthan Haslam <jonathan.haslam@xxxxxxxxx> wrote: > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > > (internal) batched attachment API for uprobes, do you think you can > > > > > apply this patch as is for now? We can then gain initial improvements > > > > > in scalability that are also easy to backport, and Jonathan will work > > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > > suggested by Ingo. > > > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > > I would like to wait for the next version. > > > > > > My initial tests show a nice improvement on the over RW spinlocks but > > > significant regression in acquiring a write lock. I've got a few days > > > vacation over Easter but I'll aim to get some more formalised results out > > > to the thread toward the end of next week. > > > > As far as the write lock is only on the cold path, I think you can choose > > per-cpu RW semaphore. Since it does not do busy wait, the total system > > performance impact will be small. > > I look forward to your formalized results :) > > Sorry for the delay in getting back to you on this Masami. > > I have used one of the bpf selftest benchmarks to provide some form of > comparison of the 3 different approaches (spinlock, RW spinlock and > per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop' > benchmark which just executes a single uprobe with a minimal bpf program > attached. The tests were done on a 32 core qemu/kvm instance. > Thanks a lot for running benchmarks and providing results! > Things to note about the results: > > - The results are slightly variable so don't get too caught up on > individual thread count - it's the trend that is important. > - In terms of throughput with this specific benchmark a *very* macro view > is that the RW spinlock provides 40-60% more throughput than the > spinlock. The per-CPU RW semaphore provides in the order of 50-100% > more throughput then the spinlock. > - This doesn't fully reflect the large reduction in latency that we have > seen in application based measurements. However, it does demonstrate > that even the trivial change of going to a RW spinlock provides > significant benefits. This is probably because trig-uprobe-nop creates a single uprobe that is triggered on many CPUs. While in production we have also *many* uprobes running on many CPUs. In this benchmark, besides contention on uprobes_treelock, we are also hammering on other per-uprobe locks (register_rwsem, also if you don't have [0] patch locally, there will be another filter lock taken each time, filter->rwlock). There is also atomic refcounting going on, which when you have the same uprobe across all CPUs at the same time will cause a bunch of cache line bouncing. So yes, it's understandable that in practice in production you see an even larger effect of optimizing uprobe_treelock than in this micro-benchmark. [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf > > I haven't included the measurements on per-CPU RW semaphore write > performance as they are completely in line with those that Paul McKenney > posted on his journal [0]. On a 32 core system I see semaphore writes to > take in the order of 25-28 millisecs - the cost of the synchronize_rcu(). > > Each block of results below show 1 line per execution of the benchmark (the > "Summary" line) and each line is a run with one more thread added - a > thread is a "producer". The lines are edited to remove extraneous output > that adds no value here. > > The tests were executed with this driver script: > > for num_threads in {1..20} > do > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary just want to mention -a (affinity) option that you can pass a bench tool, it will pin each thread on its own CPU. It generally makes tests more uniform, eliminating CPU migrations variability. > done > > > spinlock > > Summary: hits 1.453 ± 0.005M/s ( 1.453M/prod) > Summary: hits 2.087 ± 0.005M/s ( 1.043M/prod) > Summary: hits 2.701 ± 0.012M/s ( 0.900M/prod) I also wanted to point out that the first measurement (1.453M/s in this row) is total throughput across all threads, while value in parenthesis (0.900M/prod) is averaged throughput per each thread. So this M/prod value is the most interesting in this benchmark where we assess the effect of reducing contention. > Summary: hits 1.917 ± 0.011M/s ( 0.479M/prod) > Summary: hits 2.105 ± 0.003M/s ( 0.421M/prod) > Summary: hits 1.615 ± 0.006M/s ( 0.269M/prod) [...]