Re: [PATCH v2 bpf-next] selftests/bpf: Add benchmark for local_storage get

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 20 May 2022 17:41:09 -0700

On Tue, May 17, 2022 at 8:51 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
>
> Add a benchmarks to demonstrate the performance cliff for local_storage
> get as the number of local_storage maps increases beyond current
> local_storage implementation's cache size.
>
> "sequential get" and "interleaved get" benchmarks are added, both of
> which do many bpf_task_storage_get calls on sets of task local_storage
> maps of various counts, while considering a single specific map to be
> 'important' and counting task_storage_gets to the important map
> separately in addition to normal 'hits' count of all gets. Goal here is
> to mimic scenario where a particular program using one map - the
> important one - is running on a system where many other local_storage
> maps exist and are accessed often.
>
> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
> bpf_task_storage_gets for the important map for every 10 map gets. This
> is meant to highlight performance differences when important map is
> accessed far more frequently than non-important maps.
>
> A "hashmap control" benchmark is also included for easy comparison of
> standard bpf hashmap lookup vs local_storage get. The benchmark is
> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
> instead of local storage.
>
> Addition of this benchmark is inspired by conversation with Alexei in a
> previous patchset's thread [0], which highlighted the need for such a
> benchmark to motivate and validate improvements to local_storage
> implementation. My approach in that series focused on improving
> performance for explicitly-marked 'important' maps and was rejected
> with feedback to make more generally-applicable improvements while
> avoiding explicitly marking maps as important. Thus the benchmark
> reports both general and important-map-focused metrics, so effect of
> future work on both is clear.
>
> Regarding the benchmark results. On a powerful system (Skylake, 20
> cores, 256gb ram):
>
> Local Storage
> =============

[...]

>
> Looking at the "sequential get" results, it's clear that as the
> number of task local_storage maps grows beyond the current cache size
> (16), there's a significant reduction in hits throughput. Note that
> current local_storage implementation assigns a cache_idx to maps as they
> are created. Since "sequential get" is creating maps 0..n in order and
> then doing bpf_task_storage_get calls in the same order, the benchmark
> is effectively ensuring that a map will not be in cache when the program
> tries to access it.
>
> For "interleaved get" results, important-map hits throughput is greatly
> increased as the important map is more likely to be in cache by virtue
> of being accessed far more frequently. Throughput still reduces as #
> maps increases, though.
>
> Note that the test programs need to split task_storage_get calls across
> multiple programs to work around the verifier's MAX_USED_MAPS
> limitations. As evidenced by the unintuitive-looking results for smaller
> num_maps benchmark runs, overhead which is amortized across larger
> num_maps in other runs dominates when there are fewer maps. To get a
> sense of the overhead, I commented out
> bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.h and
> ran the benchmark on the same host as 'real' run. Results:
>
> Local Storage
> =============

[...]

>
> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
>
> hashmap_control:     ~6.8ns
> sequential_get_1:    ~15.5ns
> sequential_get_10:   ~20ns
> sequential_get_16:   ~17.8ns
> sequential_get_17:   ~21.8ns
> sequential_get_24:   ~45.2ns
> sequential_get_32:   ~69.7ns
> sequential_get_100:  ~153.3ns
> sequential_get_1000: ~2300ns
>
> Clearly demonstrating a cliff.
>
> When running the benchmarks it may be necessary to bump 'open files'
> ulimit for a successful run.
>
>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@xxxxxx
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx>
> ---
> Changelog:
>
> v1 -> v2:
>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
>     configurable # of maps (Andrii)
>   * Add hashmap benchmark (Alexei)
>         * Add discussion of overhead
>

[...]

> +
> +/* Keep in sync w/ array of maps in bpf */
> +#define MAX_NR_MAPS 1000
> +/* Keep in sync w/ number of progs in bpf app */
> +#define MAX_NR_PROGS 20
> +
> +static struct {
> +       void (*destroy_skel)(void *obj);
> +       int (*load_skel)(void *obj);
> +       long *important_hits;
> +       long *hits;
> +       void *progs;
> +       void *skel;
> +       struct bpf_map *array_of_maps;
> +} ctx;
> +
> +int created_maps[MAX_NR_MAPS];
> +struct bpf_link *attached_links[MAX_NR_PROGS];
> +

static?

> +static void teardown(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < MAX_NR_PROGS; i++) {
> +               if (!attached_links[i])
> +                       break;
> +               bpf_link__detach(attached_links[i]);
> +       }
> +
> +       if (ctx.destroy_skel && ctx.skel)
> +               ctx.destroy_skel(ctx.skel);
> +
> +       for (i = 0; i < MAX_NR_MAPS; i++) {
> +               if (!created_maps[i])
> +                       break;
> +               close(created_maps[i]);
> +       }
> +}
> +

Wouldn't all this be cleaned up on bench exit anyway?.. We've been
less strict about proper clean up for bench to keep code simpler.

[...]

> diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench.h b/tools/testing/selftests/bpf/progs/local_storage_bench.h
> new file mode 100644
> index 000000000000..b5e358dee245
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h
> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
> +       __uint(max_entries, 1000);
> +       __type(key, int);
> +       __type(value, int);
> +} array_of_maps SEC(".maps");

you don't need setup_inner_map_and_load and load_btf, you can just
declaratively have two ARRAY_OF_MAPS, one using inner hashmap and
another using inner task_local_storage. Grep for __array in selftests
to see how to declaratively define inner map prototype, e.g., see
test_ringbuf_multi.c. With the below suggestion one of do_lookup
flavors will use array_of_hashes and another will use
array_of_storages explicitly. From user-space you can create and setup
as many inner maps as needed. If you need btf_id for value_type_id for
inner map, see if bpf_map__inner_map() would be useful.

> +
> +long important_hits;
> +long hits;
> +
> +#ifdef LOOKUP_HASHMAP

why #ifdef'ing if you can have do_lookup_hashmap and
do_lookup_task_storage and choose which one to call using read-only
variable:

const volatile bool use_hashmap;

just set it before load and verifier will know that one of do_lookup
flavors is dead code

> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
> +{
> +       void *map;
> +       int zero = 0;
> +
> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
> +       if (!map)
> +               return -1;
> +
> +       bpf_map_lookup_elem(map, &zero);
> +       __sync_add_and_fetch(&hits, 1);
> +       if (!elem)
> +               __sync_add_and_fetch(&important_hits, 1);
> +       return 0;
> +}
> +#else
> +static int do_lookup(unsigned int elem, struct task_struct *task)
> +{
> +       void *map;
> +
> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
> +       if (!map)
> +               return -1;
> +
> +       bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
> +       __sync_add_and_fetch(&hits, 1);
> +       if (!elem)
> +               __sync_add_and_fetch(&important_hits, 1);
> +       return 0;
> +}
> +#endif /* LOOKUP_HASHMAP */
> +
> +#define __TASK_STORAGE_GET_LOOP_PROG(array, start, interleave) \
> +SEC("fentry/" SYS_PREFIX "sys_getpgid")                        \
> +int get_local_ ## start(void *ctx)                             \
> +{                                                              \
> +       struct task_struct *task;                               \
> +       unsigned int i, elem;                                   \
> +       void *map;                                              \
> +                                                               \
> +       task = bpf_get_current_task_btf();                      \
> +       for (i = 0; i < 50; i++) {                              \

I'm trying to understand why you didn't just do

for (i = 0; i < 1000; i++) { ... }

and avoid all the macro stuff? what didn't work?

> +               elem = start + i;                               \
> +               if (do_lookup(elem, task))                      \
> +                       return 0;                               \
> +               if (interleave && i % 3 == 0)                   \

nit % 3 will be slow(-ish), why not pick some power of 2?

> +                       do_lookup(0, task);                     \
> +       }                                                       \
> +       return 0;                                               \
> +}
> +
> +#define TASK_STORAGE_GET_LOOP_PROG_SEQ(array, start) \
> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, false)
> +#define TASK_STORAGE_GET_LOOP_PROG_INT(array, start) \
> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, true)

[...]

> +
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950);
> +

all these macro make me sad, I'd like to understand why it has to be
done this way

> +char _license[] SEC("license") = "GPL";
> --
> 2.30.2
>