On Tue, May 17, 2022 at 8:51 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote: > > Add a benchmarks to demonstrate the performance cliff for local_storage > get as the number of local_storage maps increases beyond current > local_storage implementation's cache size. > > "sequential get" and "interleaved get" benchmarks are added, both of > which do many bpf_task_storage_get calls on sets of task local_storage > maps of various counts, while considering a single specific map to be > 'important' and counting task_storage_gets to the important map > separately in addition to normal 'hits' count of all gets. Goal here is > to mimic scenario where a particular program using one map - the > important one - is running on a system where many other local_storage > maps exist and are accessed often. > > While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, > ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 > bpf_task_storage_gets for the important map for every 10 map gets. This > is meant to highlight performance differences when important map is > accessed far more frequently than non-important maps. > > A "hashmap control" benchmark is also included for easy comparison of > standard bpf hashmap lookup vs local_storage get. The benchmark is > identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH > instead of local storage. > > Addition of this benchmark is inspired by conversation with Alexei in a > previous patchset's thread [0], which highlighted the need for such a > benchmark to motivate and validate improvements to local_storage > implementation. My approach in that series focused on improving > performance for explicitly-marked 'important' maps and was rejected > with feedback to make more generally-applicable improvements while > avoiding explicitly marking maps as important. Thus the benchmark > reports both general and important-map-focused metrics, so effect of > future work on both is clear. > > Regarding the benchmark results. On a powerful system (Skylake, 20 > cores, 256gb ram): > > Local Storage > ============= [...] > > Looking at the "sequential get" results, it's clear that as the > number of task local_storage maps grows beyond the current cache size > (16), there's a significant reduction in hits throughput. Note that > current local_storage implementation assigns a cache_idx to maps as they > are created. Since "sequential get" is creating maps 0..n in order and > then doing bpf_task_storage_get calls in the same order, the benchmark > is effectively ensuring that a map will not be in cache when the program > tries to access it. > > For "interleaved get" results, important-map hits throughput is greatly > increased as the important map is more likely to be in cache by virtue > of being accessed far more frequently. Throughput still reduces as # > maps increases, though. > > Note that the test programs need to split task_storage_get calls across > multiple programs to work around the verifier's MAX_USED_MAPS > limitations. As evidenced by the unintuitive-looking results for smaller > num_maps benchmark runs, overhead which is amortized across larger > num_maps in other runs dominates when there are fewer maps. To get a > sense of the overhead, I commented out > bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.h and > ran the benchmark on the same host as 'real' run. Results: > > Local Storage > ============= [...] > > Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: > > hashmap_control: ~6.8ns > sequential_get_1: ~15.5ns > sequential_get_10: ~20ns > sequential_get_16: ~17.8ns > sequential_get_17: ~21.8ns > sequential_get_24: ~45.2ns > sequential_get_32: ~69.7ns > sequential_get_100: ~153.3ns > sequential_get_1000: ~2300ns > > Clearly demonstrating a cliff. > > When running the benchmarks it may be necessary to bump 'open files' > ulimit for a successful run. > > [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@xxxxxx > > Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx> > --- > Changelog: > > v1 -> v2: > * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly > configurable # of maps (Andrii) > * Add hashmap benchmark (Alexei) > * Add discussion of overhead > [...] > + > +/* Keep in sync w/ array of maps in bpf */ > +#define MAX_NR_MAPS 1000 > +/* Keep in sync w/ number of progs in bpf app */ > +#define MAX_NR_PROGS 20 > + > +static struct { > + void (*destroy_skel)(void *obj); > + int (*load_skel)(void *obj); > + long *important_hits; > + long *hits; > + void *progs; > + void *skel; > + struct bpf_map *array_of_maps; > +} ctx; > + > +int created_maps[MAX_NR_MAPS]; > +struct bpf_link *attached_links[MAX_NR_PROGS]; > + static? > +static void teardown(void) > +{ > + int i; > + > + for (i = 0; i < MAX_NR_PROGS; i++) { > + if (!attached_links[i]) > + break; > + bpf_link__detach(attached_links[i]); > + } > + > + if (ctx.destroy_skel && ctx.skel) > + ctx.destroy_skel(ctx.skel); > + > + for (i = 0; i < MAX_NR_MAPS; i++) { > + if (!created_maps[i]) > + break; > + close(created_maps[i]); > + } > +} > + Wouldn't all this be cleaned up on bench exit anyway?.. We've been less strict about proper clean up for bench to keep code simpler. [...] > diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench.h b/tools/testing/selftests/bpf/progs/local_storage_bench.h > new file mode 100644 > index 000000000000..b5e358dee245 > --- /dev/null > +++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h > @@ -0,0 +1,69 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ > + > +struct { > + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); > + __uint(max_entries, 1000); > + __type(key, int); > + __type(value, int); > +} array_of_maps SEC(".maps"); you don't need setup_inner_map_and_load and load_btf, you can just declaratively have two ARRAY_OF_MAPS, one using inner hashmap and another using inner task_local_storage. Grep for __array in selftests to see how to declaratively define inner map prototype, e.g., see test_ringbuf_multi.c. With the below suggestion one of do_lookup flavors will use array_of_hashes and another will use array_of_storages explicitly. From user-space you can create and setup as many inner maps as needed. If you need btf_id for value_type_id for inner map, see if bpf_map__inner_map() would be useful. > + > +long important_hits; > +long hits; > + > +#ifdef LOOKUP_HASHMAP why #ifdef'ing if you can have do_lookup_hashmap and do_lookup_task_storage and choose which one to call using read-only variable: const volatile bool use_hashmap; just set it before load and verifier will know that one of do_lookup flavors is dead code > +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */) > +{ > + void *map; > + int zero = 0; > + > + map = bpf_map_lookup_elem(&array_of_maps, &elem); > + if (!map) > + return -1; > + > + bpf_map_lookup_elem(map, &zero); > + __sync_add_and_fetch(&hits, 1); > + if (!elem) > + __sync_add_and_fetch(&important_hits, 1); > + return 0; > +} > +#else > +static int do_lookup(unsigned int elem, struct task_struct *task) > +{ > + void *map; > + > + map = bpf_map_lookup_elem(&array_of_maps, &elem); > + if (!map) > + return -1; > + > + bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE); > + __sync_add_and_fetch(&hits, 1); > + if (!elem) > + __sync_add_and_fetch(&important_hits, 1); > + return 0; > +} > +#endif /* LOOKUP_HASHMAP */ > + > +#define __TASK_STORAGE_GET_LOOP_PROG(array, start, interleave) \ > +SEC("fentry/" SYS_PREFIX "sys_getpgid") \ > +int get_local_ ## start(void *ctx) \ > +{ \ > + struct task_struct *task; \ > + unsigned int i, elem; \ > + void *map; \ > + \ > + task = bpf_get_current_task_btf(); \ > + for (i = 0; i < 50; i++) { \ I'm trying to understand why you didn't just do for (i = 0; i < 1000; i++) { ... } and avoid all the macro stuff? what didn't work? > + elem = start + i; \ > + if (do_lookup(elem, task)) \ > + return 0; \ > + if (interleave && i % 3 == 0) \ nit % 3 will be slow(-ish), why not pick some power of 2? > + do_lookup(0, task); \ > + } \ > + return 0; \ > +} > + > +#define TASK_STORAGE_GET_LOOP_PROG_SEQ(array, start) \ > + __TASK_STORAGE_GET_LOOP_PROG(array, start, false) > +#define TASK_STORAGE_GET_LOOP_PROG_INT(array, start) \ > + __TASK_STORAGE_GET_LOOP_PROG(array, start, true) [...] > + > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900); > +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950); > + all these macro make me sad, I'd like to understand why it has to be done this way > +char _license[] SEC("license") = "GPL"; > -- > 2.30.2 >