On Thu, May 12, 2022 at 6:42 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote: > > On 5/11/22 1:33 PM, Alexei Starovoitov wrote: > > On Sun, May 08, 2022 at 02:53:01PM -0700, Dave Marchevsky wrote: > >> Add a benchmarks to demonstrate the performance cliff for local_storage > >> get as the number of local_storage maps increases beyond current > >> local_storage implementation's cache size. > >> > >> "sequential get" and "interleaved get" benchmarks are added, both of > >> which do many bpf_task_storage_get calls on a set of {10, 100, 1000} > >> task local_storage maps, while considering a single specific map to be > >> 'important' and counting task_storage_gets to the important map > >> separately in addition to normal 'hits' count of all gets. Goal here is > >> to mimic scenario where a particular program using one map - the > >> important one - is running on a system where many other local_storage > >> maps exist and are accessed often. > >> > >> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, > >> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 > >> bpf_task_storage_gets for the important map for every 10 map gets. This > >> is meant to highlight performance differences when important map is > >> accessed far more frequently than non-important maps. > >> > >> Addition of this benchmark is inspired by conversation with Alexei in a > >> previous patchset's thread [0], which highlighted the need for such a > >> benchmark to motivate and validate improvements to local_storage > >> implementation. My approach in that series focused on improving > >> performance for explicitly-marked 'important' maps and was rejected > >> with feedback to make more generally-applicable improvements while > >> avoiding explicitly marking maps as important. Thus the benchmark > >> reports both general and important-map-focused metrics, so effect of > >> future work on both is clear. > >> > >> Regarding the benchmark results. On a powerful system (Skylake, 20 > >> cores, 256gb ram): > >> > >> Local Storage > >> ============= > >> num_maps: 10 > >> local_storage cache sequential get: hits throughput: 20.013 ± 0.818 M ops/s, hits latency: 49.967 ns/op, important_hits throughput: 2.001 ± 0.082 M ops/s > >> local_storage cache interleaved get: hits throughput: 23.149 ± 0.342 M ops/s, hits latency: 43.198 ns/op, important_hits throughput: 8.268 ± 0.122 M ops/s > >> > >> num_maps: 100 > >> local_storage cache sequential get: hits throughput: 6.149 ± 0.220 M ops/s, hits latency: 162.630 ns/op, important_hits throughput: 0.061 ± 0.002 M ops/s > >> local_storage cache interleaved get: hits throughput: 7.659 ± 0.177 M ops/s, hits latency: 130.565 ns/op, important_hits throughput: 2.243 ± 0.052 M ops/s > >> > >> num_maps: 1000 > >> local_storage cache sequential get: hits throughput: 0.917 ± 0.029 M ops/s, hits latency: 1090.711 ns/op, important_hits throughput: 0.002 ± 0.000 M ops/s > >> local_storage cache interleaved get: hits throughput: 1.121 ± 0.016 M ops/s, hits latency: 892.299 ns/op, important_hits throughput: 0.322 ± 0.005 M ops/s > > > > Thanks for crafting a benchmark. It certainly helps to understand the cliff. > > Is there a way to make it more configurable? > > 10,100,1000 are hard coded and not easy to change. > > In particular I'm interested in the numbers: > > 1, 16, 17, 32, 100. > > If my understanding of implementation is correct 1 and 16 should have > > pretty much the same performance. > > 17 should see the cliff which should linearly increase in 32 and in 100. > > Between just two points 100 and 1000 there is no easy way > > to compute the linear degradation. > > Agreed that being able to choose an arbitrary number of local_storage maps for > the benchmark would be ideal. I tried to do this in an earlier iteration of the > patch, but abandoned it as it would require a bcc-style approach, something like > generating .c bpf program with python at runtime, then running through libbpf > loader. > > The easiest path forward is probably just generating bytecode and using raw API. > Will give that a shot, hopefully the test prog is still comprehensible in that > form. Can this be avoided by using ARRAY_OF_MAPS and doing bounded loops? I see that we have inlining for ARRAY_OF_MAPS, just like for ARRAY, so it shouldn't hurt performance (not significantly at least). > > > Also could you add a hash map with key=tid. > > It would be interesting to compare local storage with hash map. > > iirc when local storage was introduced it was 2.5 times faster than large hashmap. > > Since then both have changed. > > Will do.