Re: [PATCH v2 bpf-next] selftests/bpf: Add benchmark for local_storage get

Dave Marchevsky <davemarchevsky@xxxxxx> · Sat, 21 May 2022 01:22:20 -0400

On 5/20/22 8:41 PM, Andrii Nakryiko wrote:   
> On Tue, May 17, 2022 at 8:51 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
>>
>> Add a benchmarks to demonstrate the performance cliff for local_storage
>> get as the number of local_storage maps increases beyond current
>> local_storage implementation's cache size.
>>
>> "sequential get" and "interleaved get" benchmarks are added, both of
>> which do many bpf_task_storage_get calls on sets of task local_storage
>> maps of various counts, while considering a single specific map to be
>> 'important' and counting task_storage_gets to the important map
>> separately in addition to normal 'hits' count of all gets. Goal here is
>> to mimic scenario where a particular program using one map - the
>> important one - is running on a system where many other local_storage
>> maps exist and are accessed often.
>>
>> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
>> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
>> bpf_task_storage_gets for the important map for every 10 map gets. This
>> is meant to highlight performance differences when important map is
>> accessed far more frequently than non-important maps.
>>
>> A "hashmap control" benchmark is also included for easy comparison of
>> standard bpf hashmap lookup vs local_storage get. The benchmark is
>> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
>> instead of local storage.
>>
>> Addition of this benchmark is inspired by conversation with Alexei in a
>> previous patchset's thread [0], which highlighted the need for such a
>> benchmark to motivate and validate improvements to local_storage
>> implementation. My approach in that series focused on improving
>> performance for explicitly-marked 'important' maps and was rejected
>> with feedback to make more generally-applicable improvements while
>> avoiding explicitly marking maps as important. Thus the benchmark
>> reports both general and important-map-focused metrics, so effect of
>> future work on both is clear.
>>
>> Regarding the benchmark results. On a powerful system (Skylake, 20
>> cores, 256gb ram):
>>
>> Local Storage
>> =============
> 
> [...]
> 
>>
>> Looking at the "sequential get" results, it's clear that as the
>> number of task local_storage maps grows beyond the current cache size
>> (16), there's a significant reduction in hits throughput. Note that
>> current local_storage implementation assigns a cache_idx to maps as they
>> are created. Since "sequential get" is creating maps 0..n in order and
>> then doing bpf_task_storage_get calls in the same order, the benchmark
>> is effectively ensuring that a map will not be in cache when the program
>> tries to access it.
>>
>> For "interleaved get" results, important-map hits throughput is greatly
>> increased as the important map is more likely to be in cache by virtue
>> of being accessed far more frequently. Throughput still reduces as #
>> maps increases, though.
>>
>> Note that the test programs need to split task_storage_get calls across
>> multiple programs to work around the verifier's MAX_USED_MAPS
>> limitations. As evidenced by the unintuitive-looking results for smaller
>> num_maps benchmark runs, overhead which is amortized across larger
>> num_maps in other runs dominates when there are fewer maps. To get a
>> sense of the overhead, I commented out
>> bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.h and
>> ran the benchmark on the same host as 'real' run. Results:
>>
>> Local Storage
>> =============
> 
> [...]
> 
>>
>> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
>>
>> hashmap_control:     ~6.8ns
>> sequential_get_1:    ~15.5ns
>> sequential_get_10:   ~20ns
>> sequential_get_16:   ~17.8ns
>> sequential_get_17:   ~21.8ns
>> sequential_get_24:   ~45.2ns
>> sequential_get_32:   ~69.7ns
>> sequential_get_100:  ~153.3ns
>> sequential_get_1000: ~2300ns
>>
>> Clearly demonstrating a cliff.
>>
>> When running the benchmarks it may be necessary to bump 'open files'
>> ulimit for a successful run.
>>
>>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@xxxxxx
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx>
>> ---
>> Changelog:
>>
>> v1 -> v2:
>>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
>>     configurable # of maps (Andrii)
>>   * Add hashmap benchmark (Alexei)
>>         * Add discussion of overhead
>>
> 
> [...]
> 
>> +
>> +/* Keep in sync w/ array of maps in bpf */
>> +#define MAX_NR_MAPS 1000
>> +/* Keep in sync w/ number of progs in bpf app */
>> +#define MAX_NR_PROGS 20
>> +
>> +static struct {
>> +       void (*destroy_skel)(void *obj);
>> +       int (*load_skel)(void *obj);
>> +       long *important_hits;
>> +       long *hits;
>> +       void *progs;
>> +       void *skel;
>> +       struct bpf_map *array_of_maps;
>> +} ctx;
>> +
>> +int created_maps[MAX_NR_MAPS];
>> +struct bpf_link *attached_links[MAX_NR_PROGS];
>> +
> 
> static?

I sent v3 before seeing this email, but luckily most of your comments are
addressed already. 

attached_links is removed in v3, created_maps is moved to ctx

> 
> 
>> +static void teardown(void)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < MAX_NR_PROGS; i++) {
>> +               if (!attached_links[i])
>> +                       break;
>> +               bpf_link__detach(attached_links[i]);
>> +       }
>> +
>> +       if (ctx.destroy_skel && ctx.skel)
>> +               ctx.destroy_skel(ctx.skel);
>> +
>> +       for (i = 0; i < MAX_NR_MAPS; i++) {
>> +               if (!created_maps[i])
>> +                       break;
>> +               close(created_maps[i]);
>> +       }
>> +}
>> +
> 
> Wouldn't all this be cleaned up on bench exit anyway?.. We've been
> less strict about proper clean up for bench to keep code simpler.

It's important to explicitly clean up created_maps because local_storage maps
are assigned a cache slot when the map is created, and count of "how many maps
are assigned to this cache slot" is incr'd. On map free the count is decr'd.

So cache behavior of subsequently alloc'd maps can be affected if these are kept
around.

Not a big deal now since just 1 bench is run and process exits, but if that
changes in the future I don't want the benchmark to silently give odd results.

> 
> 
> [...]
> 
>> diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench.h b/tools/testing/selftests/bpf/progs/local_storage_bench.h
>> new file mode 100644
>> index 000000000000..b5e358dee245
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h
>> @@ -0,0 +1,69 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
>> +
>> +struct {
>> +       __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
>> +       __uint(max_entries, 1000);
>> +       __type(key, int);
>> +       __type(value, int);
>> +} array_of_maps SEC(".maps");
> 
> you don't need setup_inner_map_and_load and load_btf, you can just
> declaratively have two ARRAY_OF_MAPS, one using inner hashmap and
> another using inner task_local_storage. Grep for __array in selftests
> to see how to declaratively define inner map prototype, e.g., see
> test_ringbuf_multi.c. With the below suggestion one of do_lookup
> flavors will use array_of_hashes and another will use
> array_of_storages explicitly. From user-space you can create and setup
> as many inner maps as needed. If you need btf_id for value_type_id for
> inner map, see if bpf_map__inner_map() would be useful.
> 

Declaratively specifying an inner task_local_storage map will result in libbpf
creating such a map to pass as the inner_map_fd, no? This will have same
effect on cache_idx assignment as my previous comment. 

>> +
>> +long important_hits;
>> +long hits;
>> +
>> +#ifdef LOOKUP_HASHMAP
> 
> why #ifdef'ing if you can have do_lookup_hashmap and
> do_lookup_task_storage and choose which one to call using read-only
> variable:
> 
> const volatile bool use_hashmap;
> 
> just set it before load and verifier will know that one of do_lookup
> flavors is dead code

I want it to be obvious that the hashmap part of the benchmark is not a flavor
of local_storage, the distinct separation of do_lookups here makes this easier
to notice.

Can do a v4 addressing this if you feel strongly about it.

> 
>> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
>> +{
>> +       void *map;
>> +       int zero = 0;
>> +
>> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
>> +       if (!map)
>> +               return -1;
>> +
>> +       bpf_map_lookup_elem(map, &zero);
>> +       __sync_add_and_fetch(&hits, 1);
>> +       if (!elem)
>> +               __sync_add_and_fetch(&important_hits, 1);
>> +       return 0;
>> +}
>> +#else
>> +static int do_lookup(unsigned int elem, struct task_struct *task)
>> +{
>> +       void *map;
>> +
>> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
>> +       if (!map)
>> +               return -1;
>> +
>> +       bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +       __sync_add_and_fetch(&hits, 1);
>> +       if (!elem)
>> +               __sync_add_and_fetch(&important_hits, 1);
>> +       return 0;
>> +}
>> +#endif /* LOOKUP_HASHMAP */
>> +
>> +#define __TASK_STORAGE_GET_LOOP_PROG(array, start, interleave) \
>> +SEC("fentry/" SYS_PREFIX "sys_getpgid")                        \
>> +int get_local_ ## start(void *ctx)                             \
>> +{                                                              \
>> +       struct task_struct *task;                               \
>> +       unsigned int i, elem;                                   \
>> +       void *map;                                              \
>> +                                                               \
>> +       task = bpf_get_current_task_btf();                      \
>> +       for (i = 0; i < 50; i++) {                              \
> 
> I'm trying to understand why you didn't just do
> 
> 
> for (i = 0; i < 1000; i++) { ... }
> 
> and avoid all the macro stuff? what didn't work?
> 
> 
>> +               elem = start + i;                               \
>> +               if (do_lookup(elem, task))                      \
>> +                       return 0;                               \
>> +               if (interleave && i % 3 == 0)                   \
> 
> nit % 3 will be slow(-ish), why not pick some power of 2?

Few reasons:

1) This results in a ratio of "get important map" to "get any other map" closest
to v1 of this patchset, and similar interleaving pattern. Made it easy to
compare results after big refactor and be reasonably sure I didn't break the
benchmark.

2) The current local_storage cache has 16 entries and the current
cache slot assignment algorithm will always give the important map cache_idx
0 since it's created first. Second created map - idx 1 in the ARRAY_OF_MAPS -
will get cache_idx 1, etc, so for this benchmark
cache_idx = map_of_maps_idx % 16. (Assuming no other task_local_storage maps
have been created on the system).

We want to pick a small number because in 'important map' scenario the
important map is accessed significantly more often than other maps. So < 16.
Considering current implementation with fixed cache_idx, if we pick a power
of 2 that's < 16, there will always be the same 'gap' between important
map get and other map gets with same cache_idx. We care about these specifically
since they'll be knocking each other out of cache slot. 

If an odd number is used the gaps will vary, resulting in a benchmark more 
closely mimicing "bunch of unrelated progs accessing maps in arbitrary order,
with one important prog accessing its map very frequently".

Probably moving to bpf_get_prandom or some userspace nondeterminism to feed
list of indices to interleave gets of important map is the best solution,
but I'm hoping to keep things simple for now.

> 
>> +                       do_lookup(0, task);                     \
>> +       }                                                       \
>> +       return 0;                                               \
>> +}
>> +
>> +#define TASK_STORAGE_GET_LOOP_PROG_SEQ(array, start) \
>> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, false)
>> +#define TASK_STORAGE_GET_LOOP_PROG_INT(array, start) \
>> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, true)
> 
> [...]
> 
>> +
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950);
>> +
> 
> all these macro make me sad, I'd like to understand why it has to be
> done this way
> 

Macro and "why not i < 1000" are addressed in v3. v1 of this patch - before your
ARRAY_OF_MAPS suggestion - was hitting MAX_USED_MAPS limit and failing
verification when trying to access all 1k maps in a single program. I assumed
that accessing maps from within an ARRAY_OF_MAPS would trigger similar limit,
but this is not the case.

>> +char _license[] SEC("license") = "GPL";
>> --
>> 2.30.2
>>