Re: [PATCH bpf-next] selftests/bpf: Add benchmark for local_storage get

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 11 May 2022 10:33:05 -0700

On Sun, May 08, 2022 at 02:53:01PM -0700, Dave Marchevsky wrote:
> Add a benchmarks to demonstrate the performance cliff for local_storage
> get as the number of local_storage maps increases beyond current
> local_storage implementation's cache size.
> 
> "sequential get" and "interleaved get" benchmarks are added, both of
> which do many bpf_task_storage_get calls on a set of {10, 100, 1000}
> task local_storage maps, while considering a single specific map to be
> 'important' and counting task_storage_gets to the important map
> separately in addition to normal 'hits' count of all gets. Goal here is
> to mimic scenario where a particular program using one map - the
> important one - is running on a system where many other local_storage
> maps exist and are accessed often.
> 
> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
> bpf_task_storage_gets for the important map for every 10 map gets. This
> is meant to highlight performance differences when important map is
> accessed far more frequently than non-important maps.
> 
> Addition of this benchmark is inspired by conversation with Alexei in a
> previous patchset's thread [0], which highlighted the need for such a
> benchmark to motivate and validate improvements to local_storage
> implementation. My approach in that series focused on improving
> performance for explicitly-marked 'important' maps and was rejected
> with feedback to make more generally-applicable improvements while
> avoiding explicitly marking maps as important. Thus the benchmark
> reports both general and important-map-focused metrics, so effect of
> future work on both is clear.
> 
> Regarding the benchmark results. On a powerful system (Skylake, 20
> cores, 256gb ram):
> 
> Local Storage
> =============
>         num_maps: 10
> local_storage cache sequential  get:  hits throughput: 20.013 ± 0.818 M ops/s, hits latency: 49.967 ns/op, important_hits throughput: 2.001 ± 0.082 M ops/s
> local_storage cache interleaved get:  hits throughput: 23.149 ± 0.342 M ops/s, hits latency: 43.198 ns/op, important_hits throughput: 8.268 ± 0.122 M ops/s
> 
>         num_maps: 100
> local_storage cache sequential  get:  hits throughput: 6.149 ± 0.220 M ops/s, hits latency: 162.630 ns/op, important_hits throughput: 0.061 ± 0.002 M ops/s
> local_storage cache interleaved get:  hits throughput: 7.659 ± 0.177 M ops/s, hits latency: 130.565 ns/op, important_hits throughput: 2.243 ± 0.052 M ops/s
> 
>         num_maps: 1000
> local_storage cache sequential  get:  hits throughput: 0.917 ± 0.029 M ops/s, hits latency: 1090.711 ns/op, important_hits throughput: 0.002 ± 0.000 M ops/s
> local_storage cache interleaved get:  hits throughput: 1.121 ± 0.016 M ops/s, hits latency: 892.299 ns/op, important_hits throughput: 0.322 ± 0.005 M ops/s

Thanks for crafting a benchmark. It certainly helps to understand the cliff.
Is there a way to make it more configurable?
10,100,1000 are hard coded and not easy to change.
In particular I'm interested in the numbers:
1, 16, 17, 32, 100.
If my understanding of implementation is correct 1 and 16 should have
pretty much the same performance.
17 should see the cliff which should linearly increase in 32 and in 100.
Between just two points 100 and 1000 there is no easy way
to compute the linear degradation.
Also could you add a hash map with key=tid.
It would be interesting to compare local storage with hash map.
iirc when local storage was introduced it was 2.5 times faster than large hashmap.
Since then both have changed.