Re: [PATCH v3 bpf-next] selftests/bpf: Add benchmark for local_storage get

Dave Marchevsky <davemarchevsky@xxxxxx> · Mon, 30 May 2022 16:41:47 -0400

On 5/23/22 5:11 PM, Andrii Nakryiko wrote:   
> On Fri, May 20, 2022 at 10:00 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
>>
>> Add a benchmarks to demonstrate the performance cliff for local_storage
>> get as the number of local_storage maps increases beyond current
>> local_storage implementation's cache size.
>>
>> "sequential get" and "interleaved get" benchmarks are added, both of
>> which do many bpf_task_storage_get calls on sets of task local_storage
>> maps of various counts, while considering a single specific map to be
>> 'important' and counting task_storage_gets to the important map
>> separately in addition to normal 'hits' count of all gets. Goal here is
>> to mimic scenario where a particular program using one map - the
>> important one - is running on a system where many other local_storage
>> maps exist and are accessed often.
>>
>> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
>> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
>> bpf_task_storage_gets for the important map for every 10 map gets. This
>> is meant to highlight performance differences when important map is
>> accessed far more frequently than non-important maps.
>>
>> A "hashmap control" benchmark is also included for easy comparison of
>> standard bpf hashmap lookup vs local_storage get. The benchmark is
>> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
>> instead of local storage.
>>
>> Addition of this benchmark is inspired by conversation with Alexei in a
>> previous patchset's thread [0], which highlighted the need for such a
>> benchmark to motivate and validate improvements to local_storage
>> implementation. My approach in that series focused on improving
>> performance for explicitly-marked 'important' maps and was rejected
>> with feedback to make more generally-applicable improvements while
>> avoiding explicitly marking maps as important. Thus the benchmark
>> reports both general and important-map-focused metrics, so effect of
>> future work on both is clear.
>>
>> Regarding the benchmark results. On a powerful system (Skylake, 20
>> cores, 256gb ram):
>>
>> Local Storage
>> =============
>>         Hashmap Control w/ 500 maps
>> hashmap (control) sequential    get:  hits throughput: 69.649 ± 1.207 M ops/s, hits latency: 14.358 ns/op, important_hits throughput: 0.139 ± 0.002 M ops/s
>>
>>         num_maps: 1
>> local_storage cache sequential  get:  hits throughput: 3.849 ± 0.035 M ops/s, hits latency: 259.803 ns/op, important_hits throughput: 3.849 ± 0.035 M ops/s
>> local_storage cache interleaved get:  hits throughput: 6.881 ± 0.110 M ops/s, hits latency: 145.324 ns/op, important_hits throughput: 6.881 ± 0.110 M ops/s
> 
> this is huge drop in performance for num_maps is due to syscall and
> fentry overhead, is that right? How about making each syscall
> invocation do (roughly) the same amount of map/storage lookups per
> invocation to neutralize this overhead? Something like:
> 
> 
> const volatile int map_cnt;
> const volatile int iter_cnt;
> 
> ...
> 
> 
> for (i = 0; i < iter_cnt; i++) {
>     int map_idx = i % map_cnt;
> 
>     do_lookup(map_idx, task);
> 
> ...
> 
> }
> 
> User-space can calculate iter_cnt to be closest exact multiple of
> map_cnt or you can just hard-code iter_cnt to fixed number (something
> like 10000 or some high enough value) and just leave with slightly
> uneven pattern for last round of looping.
> 
> 
> But this way you make syscall/fentry overhead essentially fixed, which
> will avoid these counter-intuitive numbers.
> 

This worked! But had to use bpf_loop as verifier didn't like various attempts 
to loop 10k times. Overhead is much more consistent now (see v4 for numbers).

>>
>>         num_maps: 10
>> local_storage cache sequential  get:  hits throughput: 20.339 ± 0.442 M ops/s, hits latency: 49.167 ns/op, important_hits throughput: 2.034 ± 0.044 M ops/s
>> local_storage cache interleaved get:  hits throughput: 22.408 ± 0.606 M ops/s, hits latency: 44.627 ns/op, important_hits throughput: 8.003 ± 0.217 M ops/s
>>
>>         num_maps: 16
>> local_storage cache sequential  get:  hits throughput: 24.428 ± 1.120 M ops/s, hits latency: 40.937 ns/op, important_hits throughput: 1.527 ± 0.070 M ops/s
>> local_storage cache interleaved get:  hits throughput: 26.853 ± 0.825 M ops/s, hits latency: 37.240 ns/op, important_hits throughput: 8.544 ± 0.262 M ops/s
>>
>>         num_maps: 17
>> local_storage cache sequential  get:  hits throughput: 24.158 ± 0.222 M ops/s, hits latency: 41.394 ns/op, important_hits throughput: 1.421 ± 0.013 M ops/s
>> local_storage cache interleaved get:  hits throughput: 26.223 ± 0.201 M ops/s, hits latency: 38.134 ns/op, important_hits throughput: 7.981 ± 0.061 M ops/s
>>
>>         num_maps: 24
>> local_storage cache sequential  get:  hits throughput: 16.820 ± 0.294 M ops/s, hits latency: 59.451 ns/op, important_hits throughput: 0.701 ± 0.012 M ops/s
>> local_storage cache interleaved get:  hits throughput: 19.185 ± 0.212 M ops/s, hits latency: 52.125 ns/op, important_hits throughput: 5.396 ± 0.060 M ops/s
>>
>>         num_maps: 32
>> local_storage cache sequential  get:  hits throughput: 11.998 ± 0.310 M ops/s, hits latency: 83.347 ns/op, important_hits throughput: 0.375 ± 0.010 M ops/s
>> local_storage cache interleaved get:  hits throughput: 14.233 ± 0.265 M ops/s, hits latency: 70.259 ns/op, important_hits throughput: 3.972 ± 0.074 M ops/s
>>
>>         num_maps: 100
>> local_storage cache sequential  get:  hits throughput: 5.780 ± 0.250 M ops/s, hits latency: 173.003 ns/op, important_hits throughput: 0.058 ± 0.002 M ops/s
>> local_storage cache interleaved get:  hits throughput: 7.175 ± 0.312 M ops/s, hits latency: 139.381 ns/op, important_hits throughput: 1.874 ± 0.081 M ops/s
>>
>>         num_maps: 1000
>> local_storage cache sequential  get:  hits throughput: 0.456 ± 0.011 M ops/s, hits latency: 2192.982 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
>> local_storage cache interleaved get:  hits throughput: 0.539 ± 0.005 M ops/s, hits latency: 1855.508 ns/op, important_hits throughput: 0.135 ± 0.001 M ops/s
>>
>> Looking at the "sequential get" results, it's clear that as the
>> number of task local_storage maps grows beyond the current cache size
>> (16), there's a significant reduction in hits throughput. Note that
>> current local_storage implementation assigns a cache_idx to maps as they
>> are created. Since "sequential get" is creating maps 0..n in order and
>> then doing bpf_task_storage_get calls in the same order, the benchmark
>> is effectively ensuring that a map will not be in cache when the program
>> tries to access it.
>>
>> For "interleaved get" results, important-map hits throughput is greatly
>> increased as the important map is more likely to be in cache by virtue
>> of being accessed far more frequently. Throughput still reduces as #
>> maps increases, though.
>>
>> As evidenced by the unintuitive-looking results for smaller num_maps
>> benchmark runs, overhead which is amortized across larger num_maps runs
>> dominates when there are fewer maps. To get a sense of the overhead, I
>> commented out bpf_task_storage_get/bpf_map_lookup_elem in
>> local_storage_bench.h and ran the benchmark on the same host as the
>> 'real' run. Results:
>>
>> Local Storage
>> =============
>>         Hashmap Control w/ 500 maps
>> hashmap (control) sequential    get:  hits throughput: 128.699 ± 1.267 M ops/s, hits latency: 7.770 ns/op, important_hits throughput: 0.257 ± 0.003 M ops/s
>>
> 
> [...]
> 
>>
>> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
>>
>> hashmap_control:     ~6.6ns
>> sequential_get_1:    ~17.9ns
>> sequential_get_10:   ~18.9ns
>> sequential_get_16:   ~19.0ns
>> sequential_get_17:   ~20.2ns
>> sequential_get_24:   ~42.2ns
>> sequential_get_32:   ~68.7ns
>> sequential_get_100:  ~163.3ns
>> sequential_get_1000: ~2200ns
>>
>> Clearly demonstrating a cliff.
>>
>> When running the benchmarks it may be necessary to bump 'open files'
>> ulimit for a successful run.
>>
>>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@xxxxxx
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx>
>> ---
>> Changelog:
>>
>> v2 -> v3:
>>   * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit,
>>           so just use 1 program (Alexei)
>>
>> v1 -> v2:
>>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
>>     configurable # of maps (Andrii)
>>   * Add hashmap benchmark (Alexei)
>>         * Add discussion of overhead
>>
>>  tools/testing/selftests/bpf/Makefile          |   6 +-
>>  tools/testing/selftests/bpf/bench.c           |  57 +++
>>  tools/testing/selftests/bpf/bench.h           |   5 +
>>  .../bpf/benchs/bench_local_storage.c          | 332 ++++++++++++++++++
>>  .../bpf/benchs/run_bench_local_storage.sh     |  21 ++
>>  .../selftests/bpf/benchs/run_common.sh        |  17 +
>>  .../selftests/bpf/progs/local_storage_bench.h |  63 ++++
>>  .../bpf/progs/local_storage_bench__get_int.c  |  12 +
>>  .../bpf/progs/local_storage_bench__get_seq.c  |  12 +
>>  .../bpf/progs/local_storage_bench__hashmap.c  |  13 +
>>  10 files changed, 537 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c
>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
>>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.h
>>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c
>>
>> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
>> index 4030dd6cbc34..6095f6af2ad1 100644
>> --- a/tools/testing/selftests/bpf/Makefile
>> +++ b/tools/testing/selftests/bpf/Makefile
>> @@ -560,6 +560,9 @@ $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
>>  $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
>>  $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
>>  $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
>> +$(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench__get_seq.skel.h \
>> +                                 $(OUTPUT)/local_storage_bench__get_int.skel.h \
>> +                                 $(OUTPUT)/local_storage_bench__hashmap.skel.h
> 
> You really don't need 3 skeletons for this, you can parameterize
> everything with 2-3 .rodata variables and have fixed code and single
> skeleton header. It will also simplify your setup code, you won't need
> need those callbacks that abstract specific skeleton away. Much
> cleaner and simpler, IMO.
> 
> Please, try to simplify this and make it easier to maintain.
> 
> 
>>  $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
>>  $(OUTPUT)/bench: LDLIBS += -lm
>>  $(OUTPUT)/bench: $(OUTPUT)/bench.o \
>> @@ -571,7 +574,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
>>                  $(OUTPUT)/bench_ringbufs.o \
>>                  $(OUTPUT)/bench_bloom_filter_map.o \
>>                  $(OUTPUT)/bench_bpf_loop.o \
>> -                $(OUTPUT)/bench_strncmp.o
>> +                $(OUTPUT)/bench_strncmp.o \
>> +                $(OUTPUT)/bench_local_storage.o
>>         $(call msg,BINARY,,$@)
>>         $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
>>
>> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
>> index f061cc20e776..71271062f68d 100644
>> --- a/tools/testing/selftests/bpf/bench.c
>> +++ b/tools/testing/selftests/bpf/bench.c
>> @@ -150,6 +150,53 @@ void ops_report_final(struct bench_res res[], int res_cnt)
>>         printf("latency %8.3lf ns/op\n", 1000.0 / hits_mean * env.producer_cnt);
>>  }
>>
>> +void local_storage_report_progress(int iter, struct bench_res *res,
>> +                                  long delta_ns)
>> +{
>> +       double important_hits_per_sec, hits_per_sec;
>> +       double delta_sec = delta_ns / 1000000000.0;
>> +
>> +       hits_per_sec = res->hits / 1000000.0 / delta_sec;
>> +       important_hits_per_sec = res->important_hits / 1000000.0 / delta_sec;
>> +
>> +       printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
>> +
>> +       printf("hits %8.3lfM/s ", hits_per_sec);
>> +       printf("important_hits %8.3lfM/s\n", important_hits_per_sec);
>> +}
>> +
>> +void local_storage_report_final(struct bench_res res[], int res_cnt)
>> +{
>> +       double important_hits_mean = 0.0, important_hits_stddev = 0.0;
>> +       double hits_mean = 0.0, hits_stddev = 0.0;
>> +       int i;
>> +
>> +       for (i = 0; i < res_cnt; i++) {
>> +               hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
>> +               important_hits_mean += res[i].important_hits / 1000000.0 / (0.0 + res_cnt);
>> +       }
>> +
>> +       if (res_cnt > 1)  {
>> +               for (i = 0; i < res_cnt; i++) {
>> +                       hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
>> +                                      (hits_mean - res[i].hits / 1000000.0) /
>> +                                      (res_cnt - 1.0);
>> +                       important_hits_stddev +=
>> +                                      (important_hits_mean - res[i].important_hits / 1000000.0) *
>> +                                      (important_hits_mean - res[i].important_hits / 1000000.0) /
>> +                                      (res_cnt - 1.0);
>> +               }
>> +
>> +               hits_stddev = sqrt(hits_stddev);
>> +               important_hits_stddev = sqrt(important_hits_stddev);
>> +       }
>> +       printf("Summary: hits throughput %8.3lf \u00B1 %5.3lf M ops/s, ",
>> +              hits_mean, hits_stddev);
>> +       printf("hits latency %8.3lf ns/op, ", 1000.0 / hits_mean);
>> +       printf("important_hits throughput %8.3lf \u00B1 %5.3lf M ops/s\n",
>> +              important_hits_mean, important_hits_stddev);
>> +}
>> +
> 
> We have hits_drops_report_progress/hits_drops_report_final which uses
> "hit" and "drop" terminology (admittedly confusing for this set of
> benchmarks), but if you ignore the "drop" part, it's exactly what you
> need - to track two independent values (in your case hit and important
> hit). You'll get rid of a good chunk of repetitive code with some
> statistics in it. You post-processing scripts will further hide this
> detail.
> 

I considered doing this when working on v1 of the patch, but decided against it
because 'drop' and 'hit' are independent while 'important_hit' and 'hit' are
not. Every important_hit is also a hit.

The repetitive mean / stddev calculations are annoying, though. Added a 2nd
commit to v4 which refactors them.

>>  const char *argp_program_version = "benchmark";
>>  const char *argp_program_bug_address = "<bpf@xxxxxxxxxxxxxxx>";
>>  const char argp_program_doc[] =
>> @@ -188,12 +235,14 @@ static const struct argp_option opts[] = {
>>  extern struct argp bench_ringbufs_argp;
>>  extern struct argp bench_bloom_map_argp;
>>  extern struct argp bench_bpf_loop_argp;
>> +extern struct argp bench_local_storage_argp;
>>  extern struct argp bench_strncmp_argp;
>>
> 
> [...]
> 
>> +
>> +static int setup_inner_map_and_load(int inner_fd)
>> +{
>> +       int err, mim_fd;
>> +
>> +       err = bpf_map__set_inner_map_fd(ctx.array_of_maps, inner_fd);
>> +       if (err)
>> +               return -1;
>> +
>> +       err = ctx.load_skel(ctx.skel);
>> +       if (err)
>> +               return -1;
>> +
>> +       mim_fd = bpf_map__fd(ctx.array_of_maps);
>> +       if (mim_fd < 0)
>> +               return -1;
>> +
>> +       return mim_fd;
>> +}
>> +
>> +static int load_btf(void)
>> +{
>> +       static const char btf_str_sec[] = "\0";
>> +       __u32 btf_raw_types[] = {
>> +               /* int */
>> +               BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
>> +       };
>> +       struct btf_header btf_hdr = {
>> +               .magic = BTF_MAGIC,
>> +               .version = BTF_VERSION,
>> +               .hdr_len = sizeof(struct btf_header),
>> +               .type_len = sizeof(btf_raw_types),
>> +               .str_off = sizeof(btf_raw_types),
>> +               .str_len = sizeof(btf_str_sec),
>> +       };
>> +       __u8 raw_btf[sizeof(struct btf_header) + sizeof(btf_raw_types) +
>> +                               sizeof(btf_str_sec)];
>> +
>> +       memcpy(raw_btf, &btf_hdr, sizeof(btf_hdr));
>> +       memcpy(raw_btf + sizeof(btf_hdr), btf_raw_types, sizeof(btf_raw_types));
>> +       memcpy(raw_btf + sizeof(btf_hdr) + sizeof(btf_raw_types),
>> +              btf_str_sec, sizeof(btf_str_sec));
>> +
>> +       return bpf_btf_load(raw_btf, sizeof(raw_btf), NULL);
>> +}
>> +
> 
> please try using declarative map-in-map definition, hopefully it
> doesn't influence benchmark results. It will allow to avoid this
> low-level setup code completely.
> 

Forgot to call this out in v4 changelog, but made this change.

It was more frustrating than expected, though, as it was necessary to grab
btf_key_type_id / value_type_id from the inner map still, and the helpers would
only return valid type_id if called before prog load.

>> +static void __setup(struct bpf_program *prog, bool hashmap)
>> +{
>> +       int i, fd, mim_fd, err;
>> +       int btf_fd = 0;
>> +
>> +       LIBBPF_OPTS(bpf_map_create_opts, create_opts);
>> +
> 
> [...]
> 
>> +
>> +static void measure(struct bench_res *res)
>> +{
>> +       if (ctx.hits)
>> +               res->hits = atomic_swap(ctx.hits, 0);
>> +       if (ctx.important_hits)
> 
> why these ifs? just swap, measure is called once a second, there is no
> need to optimize this
> 
>> +               res->important_hits = atomic_swap(ctx.important_hits, 0);
>> +}
>> +
>> +static inline void trigger_bpf_program(void)
>> +{
>> +       syscall(__NR_getpgid);
>> +}
>> +
> 
> [...]
> 
>> +#ifdef LOOKUP_HASHMAP
>> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
>> +{
>> +       void *map;
>> +       int zero = 0;
>> +
>> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
>> +       if (!map)
>> +               return -1;
>> +
>> +       bpf_map_lookup_elem(map, &zero);
> 
> shouldn't you use elem here as well to make it a bit more in line with
> bpf_task_storage_get()? This fixed zero is too optimistic and
> minimizes CPU cache usage, skewing results towards hashmap. It's
> cheaper to go access same location in hashmap over and over again, vs
> randomly jumping over N elements
> 

Both hashmap and local_storage benchmarks are always grabbing key 0 from the
map. v4 makes this more clear. Intent is to measure how long it takes to access
local_storage map, not random element within it.

Every other comment here that wasn't responded to was addressed in v4.

>> +       __sync_add_and_fetch(&hits, 1);
>> +       if (!elem)
>> +               __sync_add_and_fetch(&important_hits, 1);
>> +       return 0;
>> +}
>> +#else
>> +static int do_lookup(unsigned int elem, struct task_struct *task)
>> +{
>> +       void *map;
>> +
>> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
>> +       if (!map)
>> +               return -1;
>> +
>> +       bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +       __sync_add_and_fetch(&hits, 1);
>> +       if (!elem)
>> +               __sync_add_and_fetch(&important_hits, 1);
>> +       return 0;
>> +}
>> +#endif /* LOOKUP_HASHMAP */
>> +
>> +#define TASK_STORAGE_GET_LOOP_PROG(interleave)                 \
>> +SEC("fentry/" SYS_PREFIX "sys_getpgid")                        \
>> +int get_local(void *ctx)                                       \
>> +{                                                              \
>> +       struct task_struct *task;                               \
>> +       unsigned int i;                                         \
>> +       void *map;                                              \
>> +                                                               \
>> +       task = bpf_get_current_task_btf();                      \
>> +       for (i = 0; i < 1000; i++) {                            \
>> +               if (do_lookup(i, task))                         \
>> +                       return 0;                               \
>> +               if (interleave && i % 3 == 0)                   \
>> +                       do_lookup(0, task);                     \
>> +       }                                                       \
>> +       return 0;                                               \
>> +}
> 
> I think
> 
> 
> const volatile use_local_storage; /* set from user-space */
> 
> 
> if (use_local_storage) {
>     do_lookup_storage()
> } else {
>     do_lookup_hashmap()
> }
> 
> is as clear (if not clearer) than having three separate skeletons
> built from single #include header parameterized by extra #defines.
> 
>> diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
>> new file mode 100644
> 
> [...]