On Thu, Feb 08, 2024 at 08:37AM +0100, Marco Elver wrote: > On Thu, 8 Feb 2024 at 00:58, Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > > On 2/7/24 4:26 AM, Marco Elver wrote: > > > In various performance profiles of kernels with BPF programs attached, > > > bpf_local_storage_lookup() appears as a significant portion of CPU > > > cycles spent. To enable the compiler generate more optimal code, turn > > > bpf_local_storage_lookup() into a static inline function, where only the > > > cache insertion code path is outlined > > > > > > Notably, outlining cache insertion helps avoid bloating callers by > > > duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on > > > architectures which do not inline spin_lock/unlock, such as x86), which > > > would cause the compiler produce worse code by deciding to outline > > > otherwise inlinable functions. The call overhead is neutral, because we > > > make 2 calls either way: either calling raw_spin_lock_irqsave() and > > > raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(), > > > which calls raw_spin_lock_irqsave(), followed by a tail-call to > > > raw_spin_unlock_irqsave() where the compiler can perform TCO and (in > > > optimized uninstrumented builds) turns it into a plain jump. The call to > > > __bpf_local_storage_insert_cache() can be elided entirely if > > > cacheit_lockit is a false constant expression. > > > > > > Based on results from './benchs/run_bench_local_storage.sh' (21 trials, > > > reboot between each trial; x86 defconfig + BPF, clang 16) this produces > > > improvements in throughput and latency in the majority of cases, with an > > > average (geomean) improvement of 8%: > [...] > > > include/linux/bpf_local_storage.h | 30 ++++++++++- > > > kernel/bpf/bpf_local_storage.c | 52 +++++-------------- > > > .../bpf/prog_tests/task_local_storage.c | 6 --- > > > .../selftests/bpf/progs/cgrp_ls_recursion.c | 26 ---------- > > > .../selftests/bpf/progs/task_ls_recursion.c | 17 ------ > > > 5 files changed, 41 insertions(+), 90 deletions(-) > > > > > > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h > > > index 173ec7f43ed1..dcddb0aef7d8 100644 > > > --- a/include/linux/bpf_local_storage.h > > > +++ b/include/linux/bpf_local_storage.h > > > @@ -129,10 +129,36 @@ bpf_local_storage_map_alloc(union bpf_attr *attr, > > > struct bpf_local_storage_cache *cache, > > > bool bpf_ma); > > > > > > -struct bpf_local_storage_data * > > > +void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage, > > > + struct bpf_local_storage_map *smap, > > > + struct bpf_local_storage_elem *selem); > > > +/* If cacheit_lockit is false, this lookup function is lockless */ > > > +static inline struct bpf_local_storage_data * > > > bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > > > struct bpf_local_storage_map *smap, > > > - bool cacheit_lockit); > > > + bool cacheit_lockit) > > > +{ > > > + struct bpf_local_storage_data *sdata; > > > + struct bpf_local_storage_elem *selem; > > > + > > > + /* Fast path (cache hit) */ > > > + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > > > + bpf_rcu_lock_held()); > > > + if (sdata && rcu_access_pointer(sdata->smap) == smap) > > > + return sdata; > > > > I think we should focus on fast path (your v1 patch) > > and I suppose most production environments > > want to hit fast path in most times. In your production environment did > > you see more than 16 local storage maps per entity (task/sk/inode)? > > I think having more than 16 local storage maps isn't entirely unlikely > as eBPF usage grows. But at the moment, it should be rare. > > > In the fast path, the memory accesses are > > two from local_storage->cache[smap->cache_idx] and > > one from sdata->smap > > > > > > > + > > > + /* Slow path (cache miss) */ > > > + hlist_for_each_entry_rcu(selem, &local_storage->list, snode, > > > + rcu_read_lock_trace_held()) > > > + if (rcu_access_pointer(SDATA(selem)->smap) == smap) > > > + break; > > > > But if we reach slow path here which means we have more than 16 local > > storage maps, then traversing the list and getting SDATA(selem)->smap > > will be very expensive, in addition to memory accesses in fast path. > > > > I suppose here we mostly care about socket local storage since it is > > totally possible for a production workload to have millions of sockets. > > To improve performance, fast path should hit in most cases. > > If there are too many sk local storage maps, some kind of sharing > > can be done so multiple applications might be using a single sk > > local storage. > > > > Your above inlining/outlining analysis also show how tricky it is > > for compilation optimization. Without profiling, it is totally > > possible that compiler might do optimization differently in > > the future. > > Sure, but it's usually the case that we have to help the compiler a > little to produce more optimal code - if the compiler becomes stupid > in future, we need either fix the compiler or help it some more. > > > So here is my suggestion, let us do inlining > > for fast path and focus on performance of fast path. > > The slow-path (iterate list w/o cache insertion) is still relatively > small (it's a pointer-chasing loop and a compare), and I decided that > it can be justified inlining it. Martin asked in v1 why there were > slowdowns above 16 local maps, and I analyzed, and concluded that > inlining most is needed to fix and does not hurt performance: in fact, > the current version is better than v1 in all cases (even for 16 maps > or below). > > Let me know which version you prefer, and I'll change it. However, > based on the results, I would prefer the current version. FTR, these were the results going from v1 (before) -> v2 (after): +---- Local Storage ---------------------- | | + num_maps: 1 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 38.593 M ops/s | 39.068 M ops/s (+1.2%) | +- hits latency | 25.913 ns/op | 25.598 ns/op (-1.2%) | +- important_hits throughput | 38.593 M ops/s | 39.068 M ops/s (+1.2%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 44.406 M ops/s | 44.926 M ops/s (+1.2%) | +- hits latency | 22.521 ns/op | 22.259 ns/op (-1.2%) | +- important_hits throughput | 44.406 M ops/s | 44.926 M ops/s (+1.2%) | | + num_maps: 10 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 37.583 M ops/s | 38.099 M ops/s (+1.4%) | +- hits latency | 26.609 ns/op | 26.248 ns/op (-1.4%) | +- important_hits throughput | 3.758 M ops/s | 3.810 M ops/s (+1.4%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 40.698 M ops/s | 41.145 M ops/s (+1.1%) | +- hits latency | 24.573 ns/op | 24.307 ns/op (-1.1%) | +- important_hits throughput | 14.535 M ops/s | 14.695 M ops/s (+1.1%) | | + num_maps: 16 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 38.061 M ops/s | 38.341 M ops/s ( ~ ) | +- hits latency | 26.275 ns/op | 26.083 ns/op ( ~ ) | +- important_hits throughput | 2.379 M ops/s | 2.396 M ops/s ( ~ ) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 40.890 M ops/s | 41.338 M ops/s (+1.1%) | +- hits latency | 24.458 ns/op | 24.193 ns/op (-1.1%) | +- important_hits throughput | 13.010 M ops/s | 13.153 M ops/s (+1.1%) | | + num_maps: 17 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 31.799 M ops/s | 32.756 M ops/s (+3.0%) | +- hits latency | 31.448 ns/op | 30.530 ns/op (-2.9%) | +- important_hits throughput | 1.873 M ops/s | 1.929 M ops/s (+3.0%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 35.284 M ops/s | 36.110 M ops/s (+2.3%) | +- hits latency | 28.343 ns/op | 27.697 ns/op (-2.3%) | +- important_hits throughput | 10.742 M ops/s | 10.993 M ops/s (+2.3%) | | + num_maps: 24 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 17.947 M ops/s | 19.937 M ops/s (+11.1%) | +- hits latency | 55.725 ns/op | 50.166 ns/op (-10.0%) | +- important_hits throughput | 0.748 M ops/s | 0.831 M ops/s (+11.1%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 21.379 M ops/s | 23.332 M ops/s (+9.1%) | +- hits latency | 46.775 ns/op | 42.865 ns/op (-8.4%) | +- important_hits throughput | 6.014 M ops/s | 6.564 M ops/s (+9.1%) | | + num_maps: 32 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 13.279 M ops/s | 14.626 M ops/s (+10.1%) | +- hits latency | 75.317 ns/op | 68.381 ns/op (-9.2%) | +- important_hits throughput | 0.416 M ops/s | 0.458 M ops/s (+10.2%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 16.444 M ops/s | 17.906 M ops/s (+8.9%) | +- hits latency | 60.816 ns/op | 55.865 ns/op (-8.1%) | +- important_hits throughput | 4.590 M ops/s | 4.998 M ops/s (+8.9%) | | + num_maps: 100 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 4.912 M ops/s | 5.528 M ops/s (+12.5%) | +- hits latency | 207.291 ns/op | 183.059 ns/op (-11.7%) | +- important_hits throughput | 0.049 M ops/s | 0.055 M ops/s (+12.7%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 6.039 M ops/s | 6.498 M ops/s (+7.6%) | +- hits latency | 167.325 ns/op | 152.877 ns/op (-8.6%) | +- important_hits throughput | 1.577 M ops/s | 1.697 M ops/s (+7.6%) | | + num_maps: 1000 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 0.342 M ops/s | 0.354 M ops/s (+3.6%) | +- hits latency | 2930.550 ns/op | 2827.139 ns/op (-3.5%) | +- important_hits throughput | 0.000 M ops/s | 0.000 M ops/s ( ~ ) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 0.413 M ops/s | 0.403 M ops/s (-2.5%) | +- hits latency | 2427.830 ns/op | 2487.555 ns/op (+2.5%) | +- important_hits throughput | 0.104 M ops/s | 0.101 M ops/s (-2.6%) | | Geomean: | hits throughput: 102.93% | hits latency: 97.11% | important_hits throughput: 102.77%