On Tue, Sep 3, 2024 at 3:08 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Tue, Sep 3, 2024 at 9:32 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > Hello! > > > > This series provides light-weight readers for SRCU. This lightness > > is selected by the caller by using the new srcu_read_lock_lite() and > > srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and > > srcu_read_unlock() flavors. Although this passes significant rcutorture > > testing, this should still be considered to be experimental. > > > > There are a few restrictions: (1) If srcu_read_lock_lite() is called > > on a given srcu_struct structure, then no other flavor may be used on > > that srcu_struct structure, before, during, or after. (2) The _lite() > > readers may only be invoked from regions of code where RCU is watching > > (as in those regions in which rcu_is_watching() returns true). (3) > > There is no auto-expediting for srcu_struct structures that have > > been passed to _lite() readers. (4) SRCU grace periods for _lite() > > srcu_struct structures invoke synchronize_rcu() at least twice, thus > > having longer latencies than their non-_lite() counterparts. (5) Even > > with synchronize_srcu_expedited(), the resulting SRCU grace period > > will invoke synchronize_rcu() at least twice, as opposed to invoking > > the IPI-happy synchronize_rcu_expedited() function. (6) Just as with > > srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and > > srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked > > from NMI handlers (that is what the _nmisafe() interface are for). > > Although one could imagine readers that were both _lite() and _nmisafe(), > > one might also imagine that the read-modify-write atomic operations that > > are needed by any NMI-safe SRCU read marker would make this unhelpful > > from a performance perspective. > > > > All that said, the patches in this series are as follows: > > > > 1. Rename srcu_might_be_idle() to srcu_should_expedite(). > > > > 2. Introduce srcu_gp_is_expedited() helper function. > > > > 3. Renaming in preparation for additional reader flavor. > > > > 4. Bit manipulation changes for additional reader flavor. > > > > 5. Standardize srcu_data pointers to "sdp" and similar. > > > > 6. Convert srcu_data ->srcu_reader_flavor to bit field. > > > > 7. Add srcu_read_lock_lite() and srcu_read_unlock_lite(). > > > > 8. rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits. > > > > 9. rcutorture: Add reader_flavor parameter for SRCU readers. > > > > 10. rcutorture: Add srcu_read_lock_lite() support to > > rcutorture.reader_flavor. > > > > 11. refscale: Add srcu_read_lock_lite() support using "srcu-lite". > > > > Thanx, Paul > > > > Thanks Paul for working on this! > > I applied your patches on top of all my uprobe changes (including the > RFC patches that remove locks, optimize VMA to inode resolution, etc, > etc; basically the fastest uprobe/uretprobe state I can get to). And > then tested a few changes: > > - A) baseline (no SRCU-lite, RCU Tasks Trace for uprobe, normal SRCU > for uretprobes) > - B) A + SRCU-lite for uretprobes (i.e., SRCU to SRCU-lite conversion) > - C) B + RCU Tasks Trace converted to SRCU-lite > - D) I also pessimized baseline by reverting RCU Tasks Trace, so > both uprobes and uretprobes are SRCU protected. This allowed me to see > a pure gain of SRCU-lite over SRCU for uprobes, taking RCU Tasks Trace > performance out of the equation. > > In uprobes I used basically two benchmarks. One, uprobe-nop, that > benchmarks entry uprobes (which are the fastest most optimized case, > using RCU Tasks Trace in A and SRCU in D), and another that benchmarks > return uprobes (uretprobes), called uretprobe-nop, which is normal > SRCU both in A) and D). The latter uretprobe-nop benchmark basically > combines entry and return probe overheads, because that's how > uretprobes work. > Ok, so I created B' and C' cases, which are just like B and C from before, but each now uses inlined versions of SRCU-lite API. I also re-ran the latest BASELINE, which I'll call A', just to make sure all the results are compatible and based off of the same tip/perf/core branch state (uretprobe performance significantly improved for >64 CPUs, I don't know exactly why, tbh). I'll augment benchmark results below inline for easier comparison. > So, below are the most meaningful comparisons. First, SRCU vs > SRCU-lite for uretprobes: > > BASELINE (A) > ============ > uretprobe-nop ( 1 cpus): 1.941 ± 0.002M/s ( 1.941M/s/cpu) > uretprobe-nop ( 2 cpus): 3.731 ± 0.001M/s ( 1.866M/s/cpu) > uretprobe-nop ( 3 cpus): 5.492 ± 0.002M/s ( 1.831M/s/cpu) > uretprobe-nop ( 4 cpus): 7.234 ± 0.003M/s ( 1.808M/s/cpu) > uretprobe-nop ( 8 cpus): 13.448 ± 0.098M/s ( 1.681M/s/cpu) > uretprobe-nop (16 cpus): 22.905 ± 0.009M/s ( 1.432M/s/cpu) > uretprobe-nop (32 cpus): 44.760 ± 0.069M/s ( 1.399M/s/cpu) > uretprobe-nop (40 cpus): 52.986 ± 0.104M/s ( 1.325M/s/cpu) > uretprobe-nop (64 cpus): 43.650 ± 0.435M/s ( 0.682M/s/cpu) > uretprobe-nop (80 cpus): 46.831 ± 0.938M/s ( 0.585M/s/cpu) > > SRCU-lite for uretprobe (B) > =========================== > uretprobe-nop ( 1 cpus): 2.014 ± 0.014M/s ( 2.014M/s/cpu) > uretprobe-nop ( 2 cpus): 3.820 ± 0.002M/s ( 1.910M/s/cpu) > uretprobe-nop ( 3 cpus): 5.640 ± 0.003M/s ( 1.880M/s/cpu) > uretprobe-nop ( 4 cpus): 7.410 ± 0.003M/s ( 1.852M/s/cpu) > uretprobe-nop ( 8 cpus): 13.877 ± 0.009M/s ( 1.735M/s/cpu) > uretprobe-nop (16 cpus): 23.372 ± 0.022M/s ( 1.461M/s/cpu) > uretprobe-nop (32 cpus): 45.748 ± 0.048M/s ( 1.430M/s/cpu) > uretprobe-nop (40 cpus): 54.327 ± 0.093M/s ( 1.358M/s/cpu) > uretprobe-nop (64 cpus): 43.672 ± 0.371M/s ( 0.682M/s/cpu) > uretprobe-nop (80 cpus): 47.470 ± 0.753M/s ( 0.593M/s/cpu) > NEW BASELINE (A') ================= uretprobe-nop ( 1 cpus): 1.946 ± 0.001M/s ( 1.946M/s/cpu) uretprobe-nop ( 2 cpus): 3.660 ± 0.002M/s ( 1.830M/s/cpu) uretprobe-nop ( 3 cpus): 5.522 ± 0.002M/s ( 1.841M/s/cpu) uretprobe-nop ( 4 cpus): 7.145 ± 0.001M/s ( 1.786M/s/cpu) uretprobe-nop ( 8 cpus): 13.449 ± 0.004M/s ( 1.681M/s/cpu) uretprobe-nop (16 cpus): 22.374 ± 0.008M/s ( 1.398M/s/cpu) uretprobe-nop (32 cpus): 45.039 ± 0.011M/s ( 1.407M/s/cpu) uretprobe-nop (40 cpus): 42.422 ± 0.073M/s ( 1.061M/s/cpu) uretprobe-nop (64 cpus): 65.136 ± 0.084M/s ( 1.018M/s/cpu) uretprobe-nop (80 cpus): 76.004 ± 0.066M/s ( 0.950M/s/cpu) SRCU-lite for uretprobe (B') ============================ uretprobe-nop ( 1 cpus): 1.973 ± 0.001M/s ( 1.973M/s/cpu) uretprobe-nop ( 2 cpus): 3.756 ± 0.002M/s ( 1.878M/s/cpu) uretprobe-nop ( 3 cpus): 5.623 ± 0.003M/s ( 1.874M/s/cpu) uretprobe-nop ( 4 cpus): 7.206 ± 0.029M/s ( 1.802M/s/cpu) uretprobe-nop ( 8 cpus): 13.668 ± 0.004M/s ( 1.708M/s/cpu) uretprobe-nop (16 cpus): 23.067 ± 0.016M/s ( 1.442M/s/cpu) uretprobe-nop (32 cpus): 45.757 ± 0.030M/s ( 1.430M/s/cpu) uretprobe-nop (40 cpus): 54.550 ± 0.035M/s ( 1.364M/s/cpu) uretprobe-nop (64 cpus): 67.124 ± 0.057M/s ( 1.049M/s/cpu) uretprobe-nop (80 cpus): 77.150 ± 0.158M/s ( 0.964M/s/cpu) Inlining does help a bit, adding +200-300K/s in some cases. > You can see that across the board (except for noisy 64 CPU case) > SRCU-lite is faster. > > > Now, comparing A) vs C) on uprobe-nop, so we can see RCU Tasks Trace > vs SRCU-lite for uprobes. > > BASELINE (A) > ============ > uprobe-nop ( 1 cpus): 3.574 ± 0.004M/s ( 3.574M/s/cpu) > uprobe-nop ( 2 cpus): 6.735 ± 0.006M/s ( 3.368M/s/cpu) > uprobe-nop ( 3 cpus): 10.102 ± 0.005M/s ( 3.367M/s/cpu) > uprobe-nop ( 4 cpus): 13.087 ± 0.008M/s ( 3.272M/s/cpu) > uprobe-nop ( 8 cpus): 24.622 ± 0.031M/s ( 3.078M/s/cpu) > uprobe-nop (16 cpus): 41.752 ± 0.020M/s ( 2.610M/s/cpu) > uprobe-nop (32 cpus): 84.973 ± 0.115M/s ( 2.655M/s/cpu) > uprobe-nop (40 cpus): 102.229 ± 0.030M/s ( 2.556M/s/cpu) > uprobe-nop (64 cpus): 125.537 ± 0.045M/s ( 1.962M/s/cpu) > uprobe-nop (80 cpus): 143.091 ± 0.044M/s ( 1.789M/s/cpu) > > SRCU-lite for uprobes (C) > ========================= > uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu) > uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu) > uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu) > uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu) > uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu) > uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu) > uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu) > uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu) > uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu) > uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu) > NEW BASELINE (A') ================= uprobe-nop ( 1 cpus): 3.480 ± 0.036M/s ( 3.480M/s/cpu) uprobe-nop ( 2 cpus): 6.652 ± 0.026M/s ( 3.326M/s/cpu) uprobe-nop ( 3 cpus): 10.050 ± 0.011M/s ( 3.350M/s/cpu) uprobe-nop ( 4 cpus): 13.079 ± 0.008M/s ( 3.270M/s/cpu) uprobe-nop ( 8 cpus): 24.620 ± 0.004M/s ( 3.077M/s/cpu) uprobe-nop (16 cpus): 41.566 ± 0.030M/s ( 2.598M/s/cpu) uprobe-nop (32 cpus): 77.314 ± 1.620M/s ( 2.416M/s/cpu) uprobe-nop (40 cpus): 102.667 ± 0.047M/s ( 2.567M/s/cpu) uprobe-nop (64 cpus): 126.298 ± 0.026M/s ( 1.973M/s/cpu) uprobe-nop (80 cpus): 146.682 ± 0.035M/s ( 1.834M/s/cpu) SRCU-lite for uprobes w/ inlining (C') ====================================== uprobe-nop ( 1 cpus): 3.444 ± 0.014M/s ( 3.444M/s/cpu) uprobe-nop ( 2 cpus): 6.400 ± 0.021M/s ( 3.200M/s/cpu) uprobe-nop ( 3 cpus): 9.568 ± 0.025M/s ( 3.189M/s/cpu) uprobe-nop ( 4 cpus): 12.473 ± 0.020M/s ( 3.118M/s/cpu) uprobe-nop ( 8 cpus): 23.552 ± 0.007M/s ( 2.944M/s/cpu) uprobe-nop (16 cpus): 39.844 ± 0.016M/s ( 2.490M/s/cpu) uprobe-nop (32 cpus): 78.667 ± 0.201M/s ( 2.458M/s/cpu) uprobe-nop (40 cpus): 97.477 ± 0.094M/s ( 2.437M/s/cpu) uprobe-nop (64 cpus): 119.472 ± 0.120M/s ( 1.867M/s/cpu) uprobe-nop (80 cpus): 139.825 ± 0.042M/s ( 1.748M/s/cpu) > > Overall, RCU Tasks Trace beats SRCU-lite, which I think is expected, > so consider this just a confirmation. I'm not sure I'd like to switch > from RCU Tasks Trace to SRCU-lite for uprobes part, but at least we > have numbers to make that decision. > > Finally, to see SRCU vs SRCU-lite for entry uprobes improvements > (i.e., if we never had RCU Tasks Trace). I've included a bit more > extensive set of CPU counts for completeness. > > BASELINE w/ SRCU for uprobes (D) > ================================ > uprobe-nop ( 1 cpus): 3.413 ± 0.003M/s ( 3.413M/s/cpu) > uprobe-nop ( 2 cpus): 6.305 ± 0.003M/s ( 3.153M/s/cpu) > uprobe-nop ( 3 cpus): 9.442 ± 0.018M/s ( 3.147M/s/cpu) > uprobe-nop ( 4 cpus): 12.253 ± 0.006M/s ( 3.063M/s/cpu) > uprobe-nop ( 5 cpus): 15.316 ± 0.007M/s ( 3.063M/s/cpu) > uprobe-nop ( 6 cpus): 18.287 ± 0.030M/s ( 3.048M/s/cpu) > uprobe-nop ( 7 cpus): 21.378 ± 0.025M/s ( 3.054M/s/cpu) > uprobe-nop ( 8 cpus): 23.044 ± 0.010M/s ( 2.881M/s/cpu) > uprobe-nop (10 cpus): 28.778 ± 0.012M/s ( 2.878M/s/cpu) > uprobe-nop (12 cpus): 31.300 ± 0.016M/s ( 2.608M/s/cpu) > uprobe-nop (14 cpus): 36.580 ± 0.007M/s ( 2.613M/s/cpu) > uprobe-nop (16 cpus): 38.848 ± 0.017M/s ( 2.428M/s/cpu) > uprobe-nop (24 cpus): 60.298 ± 0.080M/s ( 2.512M/s/cpu) > uprobe-nop (32 cpus): 77.137 ± 1.957M/s ( 2.411M/s/cpu) > uprobe-nop (40 cpus): 89.205 ± 1.278M/s ( 2.230M/s/cpu) > uprobe-nop (48 cpus): 99.207 ± 0.444M/s ( 2.067M/s/cpu) > uprobe-nop (56 cpus): 102.399 ± 0.484M/s ( 1.829M/s/cpu) > uprobe-nop (64 cpus): 115.390 ± 0.972M/s ( 1.803M/s/cpu) > uprobe-nop (72 cpus): 127.476 ± 0.050M/s ( 1.770M/s/cpu) > uprobe-nop (80 cpus): 137.304 ± 0.068M/s ( 1.716M/s/cpu) > > SRCU-lite for uprobes (C) > ========================= > uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu) > uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu) > uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu) > uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu) > uprobe-nop ( 5 cpus): 15.634 ± 0.008M/s ( 3.127M/s/cpu) > uprobe-nop ( 6 cpus): 18.443 ± 0.018M/s ( 3.074M/s/cpu) > uprobe-nop ( 7 cpus): 21.793 ± 0.057M/s ( 3.113M/s/cpu) > uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu) > uprobe-nop (10 cpus): 29.430 ± 0.021M/s ( 2.943M/s/cpu) > uprobe-nop (12 cpus): 32.035 ± 0.008M/s ( 2.670M/s/cpu) > uprobe-nop (14 cpus): 37.174 ± 0.046M/s ( 2.655M/s/cpu) > uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu) > uprobe-nop (24 cpus): 61.656 ± 0.187M/s ( 2.569M/s/cpu) > uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu) > uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu) > uprobe-nop (48 cpus): 104.178 ± 0.033M/s ( 2.170M/s/cpu) > uprobe-nop (56 cpus): 105.689 ± 0.703M/s ( 1.887M/s/cpu) > uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu) > uprobe-nop (72 cpus): 127.574 ± 0.033M/s ( 1.772M/s/cpu) > uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu) > > So, say, at 32 threads, we get 79.6 vs 77.1, which is about 3% > throughput win. Which is not negligible! > > Note that as we get to 80 cores data is more noisy (hyperthreading, > background system noise, etc). But you can still see an improvement > across basically the entire range. > > Hopefully the above data is useful. > > > ------------------------------------------------------------------------ > > > > Documentation/admin-guide/kernel-parameters.txt | 4 > > b/Documentation/admin-guide/kernel-parameters.txt | 8 + > > b/include/linux/srcu.h | 21 +- > > b/include/linux/srcutree.h | 2 > > b/kernel/rcu/rcutorture.c | 28 +-- > > b/kernel/rcu/refscale.c | 54 +++++-- > > b/kernel/rcu/srcutree.c | 16 +- > > include/linux/srcu.h | 86 +++++++++-- > > include/linux/srcutree.h | 5 > > kernel/rcu/rcutorture.c | 37 +++- > > kernel/rcu/srcutree.c | 168 +++++++++++++++------- > > 11 files changed, 308 insertions(+), 121 deletions(-)