Re: [PATCH bpf-next] selftests/bpf: Add benchmark for local_storage get

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 12 May 2022 20:43:19 -0700

On Thu, May 12, 2022 at 6:42 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
>
> On 5/11/22 1:33 PM, Alexei Starovoitov wrote:
> > On Sun, May 08, 2022 at 02:53:01PM -0700, Dave Marchevsky wrote:
> >> Add a benchmarks to demonstrate the performance cliff for local_storage
> >> get as the number of local_storage maps increases beyond current
> >> local_storage implementation's cache size.
> >>
> >> "sequential get" and "interleaved get" benchmarks are added, both of
> >> which do many bpf_task_storage_get calls on a set of {10, 100, 1000}
> >> task local_storage maps, while considering a single specific map to be
> >> 'important' and counting task_storage_gets to the important map
> >> separately in addition to normal 'hits' count of all gets. Goal here is
> >> to mimic scenario where a particular program using one map - the
> >> important one - is running on a system where many other local_storage
> >> maps exist and are accessed often.
> >>
> >> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
> >> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
> >> bpf_task_storage_gets for the important map for every 10 map gets. This
> >> is meant to highlight performance differences when important map is
> >> accessed far more frequently than non-important maps.
> >>
> >> Addition of this benchmark is inspired by conversation with Alexei in a
> >> previous patchset's thread [0], which highlighted the need for such a
> >> benchmark to motivate and validate improvements to local_storage
> >> implementation. My approach in that series focused on improving
> >> performance for explicitly-marked 'important' maps and was rejected
> >> with feedback to make more generally-applicable improvements while
> >> avoiding explicitly marking maps as important. Thus the benchmark
> >> reports both general and important-map-focused metrics, so effect of
> >> future work on both is clear.
> >>
> >> Regarding the benchmark results. On a powerful system (Skylake, 20
> >> cores, 256gb ram):
> >>
> >> Local Storage
> >> =============
> >>         num_maps: 10
> >> local_storage cache sequential  get:  hits throughput: 20.013 ± 0.818 M ops/s, hits latency: 49.967 ns/op, important_hits throughput: 2.001 ± 0.082 M ops/s
> >> local_storage cache interleaved get:  hits throughput: 23.149 ± 0.342 M ops/s, hits latency: 43.198 ns/op, important_hits throughput: 8.268 ± 0.122 M ops/s
> >>
> >>         num_maps: 100
> >> local_storage cache sequential  get:  hits throughput: 6.149 ± 0.220 M ops/s, hits latency: 162.630 ns/op, important_hits throughput: 0.061 ± 0.002 M ops/s
> >> local_storage cache interleaved get:  hits throughput: 7.659 ± 0.177 M ops/s, hits latency: 130.565 ns/op, important_hits throughput: 2.243 ± 0.052 M ops/s
> >>
> >>         num_maps: 1000
> >> local_storage cache sequential  get:  hits throughput: 0.917 ± 0.029 M ops/s, hits latency: 1090.711 ns/op, important_hits throughput: 0.002 ± 0.000 M ops/s
> >> local_storage cache interleaved get:  hits throughput: 1.121 ± 0.016 M ops/s, hits latency: 892.299 ns/op, important_hits throughput: 0.322 ± 0.005 M ops/s
> >
> > Thanks for crafting a benchmark. It certainly helps to understand the cliff.
> > Is there a way to make it more configurable?
> > 10,100,1000 are hard coded and not easy to change.
> > In particular I'm interested in the numbers:
> > 1, 16, 17, 32, 100.
> > If my understanding of implementation is correct 1 and 16 should have
> > pretty much the same performance.
> > 17 should see the cliff which should linearly increase in 32 and in 100.
> > Between just two points 100 and 1000 there is no easy way
> > to compute the linear degradation.
>
> Agreed that being able to choose an arbitrary number of local_storage maps for
> the benchmark would be ideal. I tried to do this in an earlier iteration of the
> patch, but abandoned it as it would require a bcc-style approach, something like
> generating .c bpf program with python at runtime, then running through libbpf
> loader.
>
> The easiest path forward is probably just generating bytecode and using raw API.
> Will give that a shot, hopefully the test prog is still comprehensible in that
> form.

Can this be avoided by using ARRAY_OF_MAPS and doing bounded loops? I
see that we have inlining for ARRAY_OF_MAPS, just like for ARRAY, so
it shouldn't hurt performance (not significantly at least).

>
> > Also could you add a hash map with key=tid.
> > It would be interesting to compare local storage with hash map.
> > iirc when local storage was introduced it was 2.5 times faster than large hashmap.
> > Since then both have changed.
>
> Will do.