On Wed, 31 Jan 2024 at 20:52, Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > On 1/31/24 6:18 AM, Marco Elver wrote: > > To allow the compiler to inline the bpf_local_storage_lookup() fast- > > path, factor it out by making bpf_local_storage_lookup() a static inline > > function and move the slow-path to bpf_local_storage_lookup_slowpath(). > > > > Base on results from './benchs/run_bench_local_storage.sh' this produces > > improvements in throughput and latency in the majority of cases: > > > > | Hashmap Control > > | =============== > > | num keys: 10 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) > > | hits latency: 71.968 ns/op | 71.318 ns/op (-0.9%) > > | important_hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) > > | > > | num keys: 1000 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) > > | hits latency: 84.794 ns/op | 85.874 ns/op (+1.3%) > > | important_hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) > > | > > | num keys: 10000 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) > > | hits latency: 140.581 ns/op | 142.113 ns/op (+1.1%) > > | important_hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) > > My understanding is the change in this patch should not affect the hashmap > control result, so the above +/- ~1% change could be mostly noise. Yes, I think they are noise. > > | > > | num keys: 100000 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) > > | hits latency: 208.623 ns/op | 200.401 ns/op (-3.9%) > > | important_hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) > > | > > | num keys: 4194304 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) > > | hits latency: 478.851 ns/op | 337.648 ns/op (-29.5%) > > | important_hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) > > The last one has a big difference. Did you run it a couple of times without the > change and check if the result was consistent ? Based on what you say above this might be noise. I will rerun a few times (and also rebased against the latest v6.8-rc). > > | > > | Local Storage > > | ============= > > | num_maps: 1 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) > > | hits latency: 30.676 ns/op | 25.988 ns/op (-15.3%) > > | important_hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) > > | hits latency: 27.054 ns/op | 22.807 ns/op (-15.7%) > > | important_hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) > > | > > | num_maps: 10 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 32.078 ± 0.004 M ops/s | 37.813 ± 0.020 M ops/s (+17.9%) > > | hits latency: 31.174 ns/op | 26.446 ns/op (-15.2%) > > | important_hits throughput: 3.208 ± 0.000 M ops/s | 3.781 ± 0.002 M ops/s (+17.9%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 34.564 ± 0.011 M ops/s | 40.082 ± 0.037 M ops/s (+16.0%) > > | hits latency: 28.932 ns/op | 24.949 ns/op (-13.8%) > > | important_hits throughput: 12.344 ± 0.004 M ops/s | 14.315 ± 0.013 M ops/s (+16.0%) > > | > > | num_maps: 16 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 32.493 ± 0.023 M ops/s | 38.147 ± 0.029 M ops/s (+17.4%) > > | hits latency: 30.776 ns/op | 26.215 ns/op (-14.8%) > > | important_hits throughput: 2.031 ± 0.001 M ops/s | 2.384 ± 0.002 M ops/s (+17.4%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 34.380 ± 0.521 M ops/s | 41.605 ± 0.095 M ops/s (+21.0%) > > | hits latency: 29.087 ns/op | 24.035 ns/op (-17.4%) > > | important_hits throughput: 10.939 ± 0.166 M ops/s | 13.238 ± 0.030 M ops/s (+21.0%) > > | > > | num_maps: 17 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 28.748 ± 0.028 M ops/s | 32.248 ± 0.080 M ops/s (+12.2%) > > | hits latency: 34.785 ns/op | 31.009 ns/op (-10.9%) > > | important_hits throughput: 1.693 ± 0.002 M ops/s | 1.899 ± 0.005 M ops/s (+12.2%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 31.313 ± 0.030 M ops/s | 35.911 ± 0.020 M ops/s (+14.7%) > > | hits latency: 31.936 ns/op | 27.847 ns/op (-12.8%) > > | important_hits throughput: 9.533 ± 0.009 M ops/s | 10.933 ± 0.006 M ops/s (+14.7%) > > | > > | num_maps: 24 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 18.475 ± 0.027 M ops/s | 19.000 ± 0.006 M ops/s (+2.8%) > > | hits latency: 54.127 ns/op | 52.632 ns/op (-2.8%) > > | important_hits throughput: 0.770 ± 0.001 M ops/s | 0.792 ± 0.000 M ops/s (+2.9%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 21.361 ± 0.028 M ops/s | 22.388 ± 0.099 M ops/s (+4.8%) > > | hits latency: 46.814 ns/op | 44.667 ns/op (-4.6%) > > | important_hits throughput: 6.009 ± 0.008 M ops/s | 6.298 ± 0.028 M ops/s (+4.8%) > > | > > | num_maps: 32 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 14.220 ± 0.006 M ops/s | 14.168 ± 0.020 M ops/s (-0.4%) > > | hits latency: 70.323 ns/op | 70.580 ns/op (+0.4%) > > | important_hits throughput: 0.445 ± 0.000 M ops/s | 0.443 ± 0.001 M ops/s (-0.4%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 17.250 ± 0.011 M ops/s | 16.650 ± 0.021 M ops/s (-3.5%) > > | hits latency: 57.971 ns/op | 60.061 ns/op (+3.6%) > > | important_hits throughput: 4.815 ± 0.003 M ops/s | 4.647 ± 0.006 M ops/s (-3.5%) > > | > > | num_maps: 100 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 5.212 ± 0.012 M ops/s | 5.878 ± 0.004 M ops/s (+12.8%) > > | hits latency: 191.877 ns/op | 170.116 ns/op (-11.3%) > > | important_hits throughput: 0.052 ± 0.000 M ops/s | 0.059 ± 0.000 M ops/s (+13.5%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 6.521 ± 0.053 M ops/s | 7.086 ± 0.010 M ops/s (+8.7%) > > | hits latency: 153.343 ns/op | 141.116 ns/op (-8.0%) > > | important_hits throughput: 1.703 ± 0.014 M ops/s | 1.851 ± 0.003 M ops/s (+8.7%) > > | > > | num_maps: 1000 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 0.357 ± 0.005 M ops/s | 0.325 ± 0.005 M ops/s (-9.0%) > > | hits latency: 2803.738 ns/op | 3076.923 ns/op (+9.7%) > > Is it understood why the slow down here? The same goes for the "num_maps: 32" > case above but not as bad as here. num_maps:32 could be noise. > > | important_hits throughput: 0.000 ± 0.000 M ops/s | 0.000 ± 0.000 M ops/s > > The important_hits is very little in this case? It seems to be below 0.000M on the test machine. > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 0.434 ± 0.007 M ops/s | 0.447 ± 0.007 M ops/s (+3.0%) > > | hits latency: 2306.539 ns/op | 2237.687 ns/op (-3.0%) > > | important_hits throughput: 0.109 ± 0.002 M ops/s | 0.112 ± 0.002 M ops/s (+2.8%) > > > > Signed-off-by: Marco Elver <elver@xxxxxxxxxx> > > --- > > include/linux/bpf_local_storage.h | 17 ++++++++++++++++- > > kernel/bpf/bpf_local_storage.c | 14 ++++---------- > > .../selftests/bpf/progs/cgrp_ls_recursion.c | 2 +- > > .../selftests/bpf/progs/task_ls_recursion.c | 2 +- > > 4 files changed, 22 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h > > index 173ec7f43ed1..c8cecf7fff87 100644 > > --- a/include/linux/bpf_local_storage.h > > +++ b/include/linux/bpf_local_storage.h > > @@ -130,9 +130,24 @@ bpf_local_storage_map_alloc(union bpf_attr *attr, > > bool bpf_ma); > > > > struct bpf_local_storage_data * > > +bpf_local_storage_lookup_slowpath(struct bpf_local_storage *local_storage, > > + struct bpf_local_storage_map *smap, > > + bool cacheit_lockit); > > +static inline struct bpf_local_storage_data * > > bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > > struct bpf_local_storage_map *smap, > > - bool cacheit_lockit); > > + bool cacheit_lockit) > > +{ > > + struct bpf_local_storage_data *sdata; > > + > > + /* Fast path (cache hit) */ > > + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > > + bpf_rcu_lock_held()); > > + if (likely(sdata && rcu_access_pointer(sdata->smap) == smap)) > > + return sdata; > > + > > + return bpf_local_storage_lookup_slowpath(local_storage, smap, cacheit_lockit); > > +} > > > > void bpf_local_storage_destroy(struct bpf_local_storage *local_storage); > > > > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c > > index 146824cc9689..2ef782a1bd6f 100644 > > --- a/kernel/bpf/bpf_local_storage.c > > +++ b/kernel/bpf/bpf_local_storage.c > > @@ -415,20 +415,14 @@ void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now) > > } > > > > /* If cacheit_lockit is false, this lookup function is lockless */ > > -struct bpf_local_storage_data * > > -bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > > - struct bpf_local_storage_map *smap, > > - bool cacheit_lockit) > > +noinline struct bpf_local_storage_data * > > Is noinline needed ? Yes, so that this TU or LTO kernels do not inline the slowpath, which would cause worse codegen in the caller. > > +bpf_local_storage_lookup_slowpath(struct bpf_local_storage *local_storage, > > + struct bpf_local_storage_map *smap, > > + bool cacheit_lockit) > > { > > struct bpf_local_storage_data *sdata; > > struct bpf_local_storage_elem *selem; > > > > - /* Fast path (cache hit) */ > > - sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > > - bpf_rcu_lock_held()); > > - if (sdata && rcu_access_pointer(sdata->smap) == smap) > > - return sdata; > > - > > /* Slow path (cache miss) */ > > hlist_for_each_entry_rcu(selem, &local_storage->list, snode, > > rcu_read_lock_trace_held()) > > diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > > index a043d8fefdac..9895087a9235 100644 > > --- a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > > +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > > @@ -21,7 +21,7 @@ struct { > > __type(value, long); > > } map_b SEC(".maps"); > > > > -SEC("fentry/bpf_local_storage_lookup") > > +SEC("fentry/bpf_local_storage_lookup_slowpath") > > The selftest is trying to catch recursion. The change here cannot test the same > thing because the slowpath will never be hit in the test_progs. I don't have a > better idea for now also. > > It has a conflict with the bpf-next tree also. Was the patch created against an > internal tree? Base was v6.7. I will do a rebase and rerun benchmarks.