On 6/3/22 5:26 PM, Andrii Nakryiko wrote: > On Mon, May 30, 2022 at 1:27 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote: >> >> Add a benchmarks to demonstrate the performance cliff for local_storage >> get as the number of local_storage maps increases beyond current >> local_storage implementation's cache size. >> >> "sequential get" and "interleaved get" benchmarks are added, both of >> which do many bpf_task_storage_get calls on sets of task local_storage >> maps of various counts, while considering a single specific map to be >> 'important' and counting task_storage_gets to the important map >> separately in addition to normal 'hits' count of all gets. Goal here is >> to mimic scenario where a particular program using one map - the >> important one - is running on a system where many other local_storage >> maps exist and are accessed often. >> >> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, >> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 >> bpf_task_storage_gets for the important map for every 10 map gets. This >> is meant to highlight performance differences when important map is >> accessed far more frequently than non-important maps. >> >> A "hashmap control" benchmark is also included for easy comparison of >> standard bpf hashmap lookup vs local_storage get. The benchmark is >> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH >> instead of local storage. >> >> Addition of this benchmark is inspired by conversation with Alexei in a >> previous patchset's thread [0], which highlighted the need for such a >> benchmark to motivate and validate improvements to local_storage >> implementation. My approach in that series focused on improving >> performance for explicitly-marked 'important' maps and was rejected >> with feedback to make more generally-applicable improvements while >> avoiding explicitly marking maps as important. Thus the benchmark >> reports both general and important-map-focused metrics, so effect of >> future work on both is clear. >> >> Regarding the benchmark results. On a powerful system (Skylake, 20 >> cores, 256gb ram): >> >> Local Storage >> ============= >> Hashmap Control w/ 500 maps >> hashmap (control) sequential get: hits throughput: 48.338 ± 2.366 M ops/s, hits latency: 20.688 ns/op, important_hits throughput: 0.097 ± 0.005 M ops/s >> >> num_maps: 1 >> local_storage cache sequential get: hits throughput: 44.503 ± 1.080 M ops/s, hits latency: 22.470 ns/op, important_hits throughput: 44.503 ± 1.080 M ops/s >> local_storage cache interleaved get: hits throughput: 54.963 ± 0.586 M ops/s, hits latency: 18.194 ns/op, important_hits throughput: 54.963 ± 0.586 M ops/s >> >> num_maps: 10 >> local_storage cache sequential get: hits throughput: 43.743 ± 0.418 M ops/s, hits latency: 22.861 ns/op, important_hits throughput: 4.374 ± 0.042 M ops/s >> local_storage cache interleaved get: hits throughput: 50.073 ± 0.609 M ops/s, hits latency: 19.971 ns/op, important_hits throughput: 17.883 ± 0.217 M ops/s >> >> num_maps: 16 >> local_storage cache sequential get: hits throughput: 43.962 ± 0.525 M ops/s, hits latency: 22.747 ns/op, important_hits throughput: 2.748 ± 0.033 M ops/s >> local_storage cache interleaved get: hits throughput: 48.166 ± 0.825 M ops/s, hits latency: 20.761 ns/op, important_hits throughput: 15.326 ± 0.263 M ops/s >> >> num_maps: 17 >> local_storage cache sequential get: hits throughput: 33.207 ± 0.461 M ops/s, hits latency: 30.114 ns/op, important_hits throughput: 1.956 ± 0.027 M ops/s >> local_storage cache interleaved get: hits throughput: 43.540 ± 0.265 M ops/s, hits latency: 22.968 ns/op, important_hits throughput: 13.255 ± 0.081 M ops/s >> >> num_maps: 24 >> local_storage cache sequential get: hits throughput: 19.402 ± 0.348 M ops/s, hits latency: 51.542 ns/op, important_hits throughput: 0.809 ± 0.015 M ops/s >> local_storage cache interleaved get: hits throughput: 22.981 ± 0.487 M ops/s, hits latency: 43.514 ns/op, important_hits throughput: 6.465 ± 0.137 M ops/s >> >> num_maps: 32 >> local_storage cache sequential get: hits throughput: 13.378 ± 0.220 M ops/s, hits latency: 74.748 ns/op, important_hits throughput: 0.419 ± 0.007 M ops/s >> local_storage cache interleaved get: hits throughput: 16.894 ± 0.172 M ops/s, hits latency: 59.193 ns/op, important_hits throughput: 4.716 ± 0.048 M ops/s >> >> num_maps: 100 >> local_storage cache sequential get: hits throughput: 6.070 ± 0.140 M ops/s, hits latency: 164.745 ns/op, important_hits throughput: 0.061 ± 0.001 M ops/s >> local_storage cache interleaved get: hits throughput: 7.323 ± 0.149 M ops/s, hits latency: 136.554 ns/op, important_hits throughput: 1.913 ± 0.039 M ops/s >> >> num_maps: 1000 >> local_storage cache sequential get: hits throughput: 0.438 ± 0.012 M ops/s, hits latency: 2281.369 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s >> local_storage cache interleaved get: hits throughput: 0.522 ± 0.010 M ops/s, hits latency: 1913.937 ns/op, important_hits throughput: 0.131 ± 0.003 M ops/s >> >> Looking at the "sequential get" results, it's clear that as the >> number of task local_storage maps grows beyond the current cache size >> (16), there's a significant reduction in hits throughput. Note that >> current local_storage implementation assigns a cache_idx to maps as they >> are created. Since "sequential get" is creating maps 0..n in order and >> then doing bpf_task_storage_get calls in the same order, the benchmark >> is effectively ensuring that a map will not be in cache when the program >> tries to access it. >> >> For "interleaved get" results, important-map hits throughput is greatly >> increased as the important map is more likely to be in cache by virtue >> of being accessed far more frequently. Throughput still reduces as # >> maps increases, though. >> >> To get a sense of the overhead of the benchmark program, I >> commented out bpf_task_storage_get/bpf_map_lookup_elem in >> local_storage_bench.c and ran the benchmark on the same host as the >> 'real' run. Results: >> >> Local Storage >> ============= >> Hashmap Control w/ 500 maps >> hashmap (control) sequential get: hits throughput: 96.965 ± 1.346 M ops/s, hits latency: 10.313 ns/op, important_hits throughput: 0.194 ± 0.003 M ops/s >> >> num_maps: 1 >> local_storage cache sequential get: hits throughput: 105.792 ± 1.860 M ops/s, hits latency: 9.453 ns/op, important_hits throughput: 105.792 ± 1.860 M ops/s >> local_storage cache interleaved get: hits throughput: 185.847 ± 4.014 M ops/s, hits latency: 5.381 ns/op, important_hits throughput: 185.847 ± 4.014 M ops/s >> >> num_maps: 10 >> local_storage cache sequential get: hits throughput: 109.867 ± 1.358 M ops/s, hits latency: 9.102 ns/op, important_hits throughput: 10.987 ± 0.136 M ops/s >> local_storage cache interleaved get: hits throughput: 144.165 ± 1.256 M ops/s, hits latency: 6.936 ns/op, important_hits throughput: 51.487 ± 0.449 M ops/s >> >> num_maps: 16 >> local_storage cache sequential get: hits throughput: 109.258 ± 1.902 M ops/s, hits latency: 9.153 ns/op, important_hits throughput: 6.829 ± 0.119 M ops/s >> local_storage cache interleaved get: hits throughput: 140.248 ± 1.836 M ops/s, hits latency: 7.130 ns/op, important_hits throughput: 44.624 ± 0.584 M ops/s >> >> num_maps: 17 >> local_storage cache sequential get: hits throughput: 116.397 ± 7.679 M ops/s, hits latency: 8.591 ns/op, important_hits throughput: 6.856 ± 0.452 M ops/s >> local_storage cache interleaved get: hits throughput: 128.411 ± 4.927 M ops/s, hits latency: 7.787 ns/op, important_hits throughput: 39.093 ± 1.500 M ops/s >> >> num_maps: 24 >> local_storage cache sequential get: hits throughput: 110.890 ± 0.976 M ops/s, hits latency: 9.018 ns/op, important_hits throughput: 4.624 ± 0.041 M ops/s >> local_storage cache interleaved get: hits throughput: 133.316 ± 1.889 M ops/s, hits latency: 7.501 ns/op, important_hits throughput: 37.503 ± 0.531 M ops/s >> >> num_maps: 32 >> local_storage cache sequential get: hits throughput: 112.900 ± 1.171 M ops/s, hits latency: 8.857 ns/op, important_hits throughput: 3.534 ± 0.037 M ops/s >> local_storage cache interleaved get: hits throughput: 132.844 ± 1.207 M ops/s, hits latency: 7.528 ns/op, important_hits throughput: 37.081 ± 0.337 M ops/s >> >> num_maps: 100 >> local_storage cache sequential get: hits throughput: 110.025 ± 4.714 M ops/s, hits latency: 9.089 ns/op, important_hits throughput: 1.100 ± 0.047 M ops/s >> local_storage cache interleaved get: hits throughput: 131.979 ± 5.013 M ops/s, hits latency: 7.577 ns/op, important_hits throughput: 34.472 ± 1.309 M ops/s >> >> num_maps: 1000 >> local_storage cache sequential get: hits throughput: 117.850 ± 2.423 M ops/s, hits latency: 8.485 ns/op, important_hits throughput: 0.118 ± 0.002 M ops/s >> local_storage cache interleaved get: hits throughput: 141.268 ± 9.658 M ops/s, hits latency: 7.079 ns/op, important_hits throughput: 35.476 ± 2.425 M ops/s >> >> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: >> >> hashmap_control: ~10.4ns >> sequential_get_1: ~13.0ns > > So what this benchmark doesn't demonstrate is why one would use local > storage at all if hashmap is so fast :) > > I think at least partially it's because of your choice to do fixed > hashmap lookup with zero key. Think about how you'd replace > local_storage with hashmap. You'd have task/socket/whatever ID as look > up key. For different tasks you'd be looking up different pids. For > your benchmark you have the same task all the time, but local_storage > look up still does all the work to find local storage instance in a > list of local storages for current task, so you don't have to use many > tasks to simulate realistic lookup overhead (well, at least to some > extent). But it seems not realistic for testing hashmap as an > alternative to local_storage, for that I think we'd need to randomize > key look up a bit. Unless I misunderstand what we are testing for > hashmap use case. > > But other than that LGTM. This makes more sense than what I'm doing now, and I think better fulfills Alexei's request from v1 of this series: Also could you add a hash map with key=tid. It would be interesting to compare local storage with hash map. iirc when local storage was introduced it was 2.5 times faster than large hashmap. Since then both have changed. I tried changing the "hashmap control" benchmark to just have 1 inner hashmap, with max_entries 4,194,304 (PID_MAX_LIMIT on my system). use_hashmap branch of bpf prog does bpf_get_prandom_u32 to select key. Results are more promising: Overhead: ~19.5ns (prandom call + additional modulo) 4,194,304 possible keys: ~150.5ns (~131ns w/o overhead) 100,000 possible keys: ~55ns (~35.5ns w/o overhead) 1,000 possible keys: ~31ns (~11.5ns w/o overhead) 10 possible keys: ~28ns (~8.5ns w/o overhead) So big hashmap with only a few keys used is faster than local_storage but far slower when many keys used. Somewhere between "100k keys" and "1k keys" is probably what performance would look like if a program which needed an entry per running task used hashmap instead of local_storage. Close to Alexei's memory of 2.5x difference. Will send in v5. > >> sequential_get_10: ~13.8ns >> sequential_get_16: ~13.6ns >> sequential_get_17: ~21.5ns >> sequential_get_24: ~42.5ns >> sequential_get_32: ~65.9ns >> sequential_get_100: ~155.7ns >> sequential_get_1000: ~2270ns >> >> Clearly demonstrating a cliff. >> >> When running the benchmarks it may be necessary to bump 'open files' >> ulimit for a successful run. >> >> [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@xxxxxx >> >> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx> >> --- >> Changelog: >> >> v3 -> v4: >> * Remove ifs guarding increments in measure fn (Andrii) >> * Refactor to use 1 bpf prog for all 3 benchmarks w/ global vars set >> from userspace before load to control behavior (Andrii) >> * Greatly reduce variance in overhead by having benchmark bpf prog >> loop 10k times regardless of map count (Andrii) >> * Also, move sync_fetch_and_incr out of do_lookup as the guaranteed >> second sync_fetch_and_incr call for num_maps = 1 was adding >> overhead >> * Add second patch refactoring bench.c's mean/stddev calculations >> in reporting helper fns >> >> v2 -> v3: >> * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit, >> so just use 1 program (Alexei) >> >> v1 -> v2: >> * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly >> configurable # of maps (Andrii) >> * Add hashmap benchmark (Alexei) >> * Add discussion of overhead >> >> tools/testing/selftests/bpf/Makefile | 4 +- >> tools/testing/selftests/bpf/bench.c | 55 ++++ >> tools/testing/selftests/bpf/bench.h | 4 + >> .../bpf/benchs/bench_local_storage.c | 250 ++++++++++++++++++ >> .../bpf/benchs/run_bench_local_storage.sh | 21 ++ >> .../selftests/bpf/benchs/run_common.sh | 17 ++ >> .../selftests/bpf/progs/local_storage_bench.c | 99 +++++++ >> 7 files changed, 449 insertions(+), 1 deletion(-) >> create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c >> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh >> create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.c >> > > [...] > >> + >> +static void hashmap_setup(void) >> +{ >> + struct local_storage_bench *skel; >> + >> + setup_libbpf(); >> + >> + skel = local_storage_bench__open(); >> + ctx.skel = skel; >> + ctx.bpf_obj = skel->obj; > > nit: ctx.skel->obj is the same as ctx.bpf_obj, so bpf_obj is probably > not needed? Will get rid of it in v5. > > >> + ctx.array_of_maps = skel->maps.array_of_hash_maps; >> + skel->rodata->use_hashmap = 1; >> + skel->rodata->interleave = 0; >> + >> + __setup(skel->progs.get_local, true); >> +} >> + > > [...]