Add a benchmarks to demonstrate the performance cliff for local_storage get as the number of local_storage maps increases beyond current local_storage implementation's cache size. "sequential get" and "interleaved get" benchmarks are added, both of which do many bpf_task_storage_get calls on sets of task local_storage maps of various counts, while considering a single specific map to be 'important' and counting task_storage_gets to the important map separately in addition to normal 'hits' count of all gets. Goal here is to mimic scenario where a particular program using one map - the important one - is running on a system where many other local_storage maps exist and are accessed often. While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 bpf_task_storage_gets for the important map for every 10 map gets. This is meant to highlight performance differences when important map is accessed far more frequently than non-important maps. A "hashmap control" benchmark is also included for easy comparison of standard bpf hashmap lookup vs local_storage get. The benchmark is identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH instead of local storage. Addition of this benchmark is inspired by conversation with Alexei in a previous patchset's thread [0], which highlighted the need for such a benchmark to motivate and validate improvements to local_storage implementation. My approach in that series focused on improving performance for explicitly-marked 'important' maps and was rejected with feedback to make more generally-applicable improvements while avoiding explicitly marking maps as important. Thus the benchmark reports both general and important-map-focused metrics, so effect of future work on both is clear. Regarding the benchmark results. On a powerful system (Skylake, 20 cores, 256gb ram): Local Storage ============= Hashmap Control w/ 500 maps hashmap (control) sequential get: hits throughput: 64.689 ± 2.806 M ops/s, hits latency: 15.459 ns/op, important_hits throughput: 0.129 ± 0.006 M ops/s num_maps: 1 local_storage cache sequential get: hits throughput: 3.793 ± 0.101 M ops/s, hits latency: 263.623 ns/op, important_hits throughput: 3.793 ± 0.101 M ops/s local_storage cache interleaved get: hits throughput: 6.539 ± 0.209 M ops/s, hits latency: 152.938 ns/op, important_hits throughput: 6.539 ± 0.209 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 20.237 ± 0.439 M ops/s, hits latency: 49.415 ns/op, important_hits throughput: 2.024 ± 0.044 M ops/s local_storage cache interleaved get: hits throughput: 22.421 ± 0.874 M ops/s, hits latency: 44.601 ns/op, important_hits throughput: 8.007 ± 0.312 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 25.582 ± 0.346 M ops/s, hits latency: 39.090 ns/op, important_hits throughput: 1.599 ± 0.022 M ops/s local_storage cache interleaved get: hits throughput: 26.615 ± 0.601 M ops/s, hits latency: 37.573 ns/op, important_hits throughput: 8.468 ± 0.191 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 22.932 ± 0.436 M ops/s, hits latency: 43.606 ns/op, important_hits throughput: 1.349 ± 0.026 M ops/s local_storage cache interleaved get: hits throughput: 24.005 ± 0.462 M ops/s, hits latency: 41.658 ns/op, important_hits throughput: 7.306 ± 0.140 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 16.025 ± 0.821 M ops/s, hits latency: 62.402 ns/op, important_hits throughput: 0.668 ± 0.034 M ops/s local_storage cache interleaved get: hits throughput: 17.691 ± 0.744 M ops/s, hits latency: 56.526 ns/op, important_hits throughput: 4.976 ± 0.209 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 11.865 ± 0.180 M ops/s, hits latency: 84.279 ns/op, important_hits throughput: 0.371 ± 0.006 M ops/s local_storage cache interleaved get: hits throughput: 14.383 ± 0.108 M ops/s, hits latency: 69.525 ns/op, important_hits throughput: 4.014 ± 0.030 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 6.105 ± 0.190 M ops/s, hits latency: 163.798 ns/op, important_hits throughput: 0.061 ± 0.002 M ops/s local_storage cache interleaved get: hits throughput: 7.055 ± 0.129 M ops/s, hits latency: 141.746 ns/op, important_hits throughput: 1.843 ± 0.034 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 0.433 ± 0.010 M ops/s, hits latency: 2309.469 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s local_storage cache interleaved get: hits throughput: 0.499 ± 0.026 M ops/s, hits latency: 2002.510 ns/op, important_hits throughput: 0.127 ± 0.007 M ops/s Looking at the "sequential get" results, it's clear that as the number of task local_storage maps grows beyond the current cache size (16), there's a significant reduction in hits throughput. Note that current local_storage implementation assigns a cache_idx to maps as they are created. Since "sequential get" is creating maps 0..n in order and then doing bpf_task_storage_get calls in the same order, the benchmark is effectively ensuring that a map will not be in cache when the program tries to access it. For "interleaved get" results, important-map hits throughput is greatly increased as the important map is more likely to be in cache by virtue of being accessed far more frequently. Throughput still reduces as # maps increases, though. Note that the test programs need to split task_storage_get calls across multiple programs to work around the verifier's MAX_USED_MAPS limitations. As evidenced by the unintuitive-looking results for smaller num_maps benchmark runs, overhead which is amortized across larger num_maps in other runs dominates when there are fewer maps. To get a sense of the overhead, I commented out bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.h and ran the benchmark on the same host as 'real' run. Results: Local Storage ============= Hashmap Control w/ 500 maps hashmap (control) sequential get: hits throughput: 115.812 ± 2.513 M ops/s, hits latency: 8.635 ns/op, important_hits throughput: 0.232 ± 0.005 M ops/s num_maps: 1 local_storage cache sequential get: hits throughput: 4.031 ± 0.033 M ops/s, hits latency: 248.094 ns/op, important_hits throughput: 4.031 ± 0.033 M ops/s local_storage cache interleaved get: hits throughput: 7.997 ± 0.088 M ops/s, hits latency: 125.046 ns/op, important_hits throughput: 7.997 ± 0.088 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 34.000 ± 0.077 M ops/s, hits latency: 29.412 ns/op, important_hits throughput: 3.400 ± 0.008 M ops/s local_storage cache interleaved get: hits throughput: 37.895 ± 0.670 M ops/s, hits latency: 26.389 ns/op, important_hits throughput: 13.534 ± 0.239 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 46.947 ± 0.283 M ops/s, hits latency: 21.300 ns/op, important_hits throughput: 2.934 ± 0.018 M ops/s local_storage cache interleaved get: hits throughput: 47.301 ± 1.027 M ops/s, hits latency: 21.141 ns/op, important_hits throughput: 15.050 ± 0.327 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 45.871 ± 0.414 M ops/s, hits latency: 21.800 ns/op, important_hits throughput: 2.698 ± 0.024 M ops/s local_storage cache interleaved get: hits throughput: 46.591 ± 1.969 M ops/s, hits latency: 21.463 ns/op, important_hits throughput: 14.180 ± 0.599 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 58.053 ± 1.043 M ops/s, hits latency: 17.226 ns/op, important_hits throughput: 2.419 ± 0.043 M ops/s local_storage cache interleaved get: hits throughput: 58.115 ± 0.377 M ops/s, hits latency: 17.207 ns/op, important_hits throughput: 16.345 ± 0.106 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 68.548 ± 0.820 M ops/s, hits latency: 14.588 ns/op, important_hits throughput: 2.142 ± 0.026 M ops/s local_storage cache interleaved get: hits throughput: 63.015 ± 0.963 M ops/s, hits latency: 15.869 ns/op, important_hits throughput: 17.586 ± 0.269 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 95.375 ± 1.286 M ops/s, hits latency: 10.485 ns/op, important_hits throughput: 0.954 ± 0.013 M ops/s local_storage cache interleaved get: hits throughput: 76.996 ± 2.614 M ops/s, hits latency: 12.988 ns/op, important_hits throughput: 20.111 ± 0.683 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 119.965 ± 1.386 M ops/s, hits latency: 8.336 ns/op, important_hits throughput: 0.120 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 92.665 ± 0.788 M ops/s, hits latency: 10.792 ns/op, important_hits throughput: 23.581 ± 0.200 M ops/s Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: hashmap_control: ~6.8ns sequential_get_1: ~15.5ns sequential_get_10: ~20ns sequential_get_16: ~17.8ns sequential_get_17: ~21.8ns sequential_get_24: ~45.2ns sequential_get_32: ~69.7ns sequential_get_100: ~153.3ns sequential_get_1000: ~2300ns Clearly demonstrating a cliff. When running the benchmarks it may be necessary to bump 'open files' ulimit for a successful run. [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@xxxxxx Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx> --- Changelog: v1 -> v2: * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly configurable # of maps (Andrii) * Add hashmap benchmark (Alexei) * Add discussion of overhead tools/testing/selftests/bpf/Makefile | 6 +- tools/testing/selftests/bpf/bench.c | 57 +++ tools/testing/selftests/bpf/bench.h | 5 + .../bpf/benchs/bench_local_storage.c | 345 ++++++++++++++++++ .../bpf/benchs/run_bench_local_storage.sh | 21 ++ .../selftests/bpf/benchs/run_common.sh | 17 + .../selftests/bpf/progs/local_storage_bench.h | 69 ++++ .../bpf/progs/local_storage_bench__get_int.c | 31 ++ .../bpf/progs/local_storage_bench__get_seq.c | 31 ++ .../bpf/progs/local_storage_bench__hashmap.c | 32 ++ 10 files changed, 613 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.h create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 4030dd6cbc34..6095f6af2ad1 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -560,6 +560,9 @@ $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \ $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h +$(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench__get_seq.skel.h \ + $(OUTPUT)/local_storage_bench__get_int.skel.h \ + $(OUTPUT)/local_storage_bench__hashmap.skel.h $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ) $(OUTPUT)/bench: LDLIBS += -lm $(OUTPUT)/bench: $(OUTPUT)/bench.o \ @@ -571,7 +574,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \ $(OUTPUT)/bench_ringbufs.o \ $(OUTPUT)/bench_bloom_filter_map.o \ $(OUTPUT)/bench_bpf_loop.o \ - $(OUTPUT)/bench_strncmp.o + $(OUTPUT)/bench_strncmp.o \ + $(OUTPUT)/bench_local_storage.o $(call msg,BINARY,,$@) $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@ diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c index f061cc20e776..71271062f68d 100644 --- a/tools/testing/selftests/bpf/bench.c +++ b/tools/testing/selftests/bpf/bench.c @@ -150,6 +150,53 @@ void ops_report_final(struct bench_res res[], int res_cnt) printf("latency %8.3lf ns/op\n", 1000.0 / hits_mean * env.producer_cnt); } +void local_storage_report_progress(int iter, struct bench_res *res, + long delta_ns) +{ + double important_hits_per_sec, hits_per_sec; + double delta_sec = delta_ns / 1000000000.0; + + hits_per_sec = res->hits / 1000000.0 / delta_sec; + important_hits_per_sec = res->important_hits / 1000000.0 / delta_sec; + + printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0); + + printf("hits %8.3lfM/s ", hits_per_sec); + printf("important_hits %8.3lfM/s\n", important_hits_per_sec); +} + +void local_storage_report_final(struct bench_res res[], int res_cnt) +{ + double important_hits_mean = 0.0, important_hits_stddev = 0.0; + double hits_mean = 0.0, hits_stddev = 0.0; + int i; + + for (i = 0; i < res_cnt; i++) { + hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt); + important_hits_mean += res[i].important_hits / 1000000.0 / (0.0 + res_cnt); + } + + if (res_cnt > 1) { + for (i = 0; i < res_cnt; i++) { + hits_stddev += (hits_mean - res[i].hits / 1000000.0) * + (hits_mean - res[i].hits / 1000000.0) / + (res_cnt - 1.0); + important_hits_stddev += + (important_hits_mean - res[i].important_hits / 1000000.0) * + (important_hits_mean - res[i].important_hits / 1000000.0) / + (res_cnt - 1.0); + } + + hits_stddev = sqrt(hits_stddev); + important_hits_stddev = sqrt(important_hits_stddev); + } + printf("Summary: hits throughput %8.3lf \u00B1 %5.3lf M ops/s, ", + hits_mean, hits_stddev); + printf("hits latency %8.3lf ns/op, ", 1000.0 / hits_mean); + printf("important_hits throughput %8.3lf \u00B1 %5.3lf M ops/s\n", + important_hits_mean, important_hits_stddev); +} + const char *argp_program_version = "benchmark"; const char *argp_program_bug_address = "<bpf@xxxxxxxxxxxxxxx>"; const char argp_program_doc[] = @@ -188,12 +235,14 @@ static const struct argp_option opts[] = { extern struct argp bench_ringbufs_argp; extern struct argp bench_bloom_map_argp; extern struct argp bench_bpf_loop_argp; +extern struct argp bench_local_storage_argp; extern struct argp bench_strncmp_argp; static const struct argp_child bench_parsers[] = { { &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 }, { &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 }, { &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 }, + { &bench_local_storage_argp, 0, "local_storage benchmark", 0 }, { &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 }, {}, }; @@ -396,6 +445,9 @@ extern const struct bench bench_hashmap_with_bloom; extern const struct bench bench_bpf_loop; extern const struct bench bench_strncmp_no_helper; extern const struct bench bench_strncmp_helper; +extern const struct bench bench_local_storage_cache_seq_get; +extern const struct bench bench_local_storage_cache_interleaved_get; +extern const struct bench bench_local_storage_cache_hashmap_control; static const struct bench *benchs[] = { &bench_count_global, @@ -430,6 +482,9 @@ static const struct bench *benchs[] = { &bench_bpf_loop, &bench_strncmp_no_helper, &bench_strncmp_helper, + &bench_local_storage_cache_seq_get, + &bench_local_storage_cache_interleaved_get, + &bench_local_storage_cache_hashmap_control, }; static void setup_benchmark() @@ -547,5 +602,7 @@ int main(int argc, char **argv) bench->report_final(state.results + env.warmup_sec, state.res_cnt - env.warmup_sec); + if (bench->teardown) + bench->teardown(); return 0; } diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h index fb3e213df3dc..0a137eedc959 100644 --- a/tools/testing/selftests/bpf/bench.h +++ b/tools/testing/selftests/bpf/bench.h @@ -34,12 +34,14 @@ struct bench_res { long hits; long drops; long false_hits; + long important_hits; }; struct bench { const char *name; void (*validate)(void); void (*setup)(void); + void (*teardown)(void); void *(*producer_thread)(void *ctx); void *(*consumer_thread)(void *ctx); void (*measure)(struct bench_res* res); @@ -61,6 +63,9 @@ void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns); void false_hits_report_final(struct bench_res res[], int res_cnt); void ops_report_progress(int iter, struct bench_res *res, long delta_ns); void ops_report_final(struct bench_res res[], int res_cnt); +void local_storage_report_progress(int iter, struct bench_res *res, + long delta_ns); +void local_storage_report_final(struct bench_res res[], int res_cnt); static inline __u64 get_time_ns(void) { diff --git a/tools/testing/selftests/bpf/benchs/bench_local_storage.c b/tools/testing/selftests/bpf/benchs/bench_local_storage.c new file mode 100644 index 000000000000..1cf041b9448a --- /dev/null +++ b/tools/testing/selftests/bpf/benchs/bench_local_storage.c @@ -0,0 +1,345 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ + +#include <argp.h> +#include <linux/btf.h> + +#include "local_storage_bench__get_int.skel.h" +#include "local_storage_bench__get_seq.skel.h" +#include "local_storage_bench__hashmap.skel.h" +#include "bench.h" + +#include <test_btf.h> + +static struct { + __u32 nr_maps; +} args = { + .nr_maps = 100, +}; + +enum { + ARG_NR_MAPS = 6000, +}; + +static const struct argp_option opts[] = { + { "nr_maps", ARG_NR_MAPS, "NR_MAPS", 0, + "Set number of local_storage maps"}, + {}, +}; + +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + long ret; + + switch (key) { + case ARG_NR_MAPS: + ret = strtol(arg, NULL, 10); + if (ret < 1 || ret > UINT_MAX) { + fprintf(stderr, "invalid nr_maps"); + argp_usage(state); + } + args.nr_maps = ret; + break; + default: + return ARGP_ERR_UNKNOWN; + } + + return 0; +} + +const struct argp bench_local_storage_argp = { + .options = opts, + .parser = parse_arg, +}; + +static void validate(void) +{ + if (env.producer_cnt != 1) { + fprintf(stderr, "benchmark doesn't support multi-producer!\n"); + exit(1); + } + if (env.consumer_cnt != 1) { + fprintf(stderr, "benchmark doesn't support multi-consumer!\n"); + exit(1); + } + + if (args.nr_maps > 1000) { + fprintf(stderr, "nr_maps must be <= 1000\n"); + exit(1); + } +} + +/* Keep in sync w/ array of maps in bpf */ +#define MAX_NR_MAPS 1000 +/* Keep in sync w/ number of progs in bpf app */ +#define MAX_NR_PROGS 20 + +static struct { + void (*destroy_skel)(void *obj); + int (*load_skel)(void *obj); + long *important_hits; + long *hits; + void *progs; + void *skel; + struct bpf_map *array_of_maps; +} ctx; + +int created_maps[MAX_NR_MAPS]; +struct bpf_link *attached_links[MAX_NR_PROGS]; + +static void teardown(void) +{ + int i; + + for (i = 0; i < MAX_NR_PROGS; i++) { + if (!attached_links[i]) + break; + bpf_link__detach(attached_links[i]); + } + + if (ctx.destroy_skel && ctx.skel) + ctx.destroy_skel(ctx.skel); + + for (i = 0; i < MAX_NR_MAPS; i++) { + if (!created_maps[i]) + break; + close(created_maps[i]); + } +} + +static int setup_inner_map_and_load(int inner_fd) +{ + int err, mim_fd; + + err = bpf_map__set_inner_map_fd(ctx.array_of_maps, inner_fd); + if (err) + return -1; + + err = ctx.load_skel(ctx.skel); + if (err) + return -1; + + mim_fd = bpf_map__fd(ctx.array_of_maps); + if (mim_fd < 0) + return -1; + + return mim_fd; +} + +static int load_btf(void) +{ + static const char btf_str_sec[] = "\0"; + __u32 btf_raw_types[] = { + /* int */ + BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), /* [1] */ + }; + struct btf_header btf_hdr = { + .magic = BTF_MAGIC, + .version = BTF_VERSION, + .hdr_len = sizeof(struct btf_header), + .type_len = sizeof(btf_raw_types), + .str_off = sizeof(btf_raw_types), + .str_len = sizeof(btf_str_sec), + }; + __u8 raw_btf[sizeof(struct btf_header) + sizeof(btf_raw_types) + + sizeof(btf_str_sec)]; + + memcpy(raw_btf, &btf_hdr, sizeof(btf_hdr)); + memcpy(raw_btf + sizeof(btf_hdr), btf_raw_types, sizeof(btf_raw_types)); + memcpy(raw_btf + sizeof(btf_hdr) + sizeof(btf_raw_types), + btf_str_sec, sizeof(btf_str_sec)); + + return bpf_btf_load(raw_btf, sizeof(raw_btf), NULL); +} + +static void __setup(bool hashmap) +{ + struct bpf_program **prog; + uint32_t progs_to_attach; + int i, fd, mim_fd, err; + int btf_fd = 0; + + LIBBPF_OPTS(bpf_map_create_opts, create_opts); + + memset(&created_maps, 0, MAX_NR_MAPS * sizeof(int)); + memset(&attached_links, 0, MAX_NR_PROGS * sizeof(void *)); + + btf_fd = load_btf(); + create_opts.btf_fd = btf_fd; + create_opts.btf_key_type_id = 1; + create_opts.btf_value_type_id = 1; + if (!hashmap) + create_opts.map_flags = BPF_F_NO_PREALLOC; + + mim_fd = 0; + for (i = 0; i < args.nr_maps; i++) { + if (hashmap) + fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(int), + sizeof(int), 65536, &create_opts); + else + fd = bpf_map_create(BPF_MAP_TYPE_TASK_STORAGE, NULL, sizeof(int), + sizeof(int), 0, &create_opts); + if (fd < 0) { + fprintf(stderr, "Error creating map %d\n", i); + goto err_out; + } + + if (i == 0) { + mim_fd = setup_inner_map_and_load(fd); + if (mim_fd < 0) { + fprintf(stderr, "Error doing setup_inner_map_and_load\n"); + goto err_out; + } + } + + err = bpf_map_update_elem(mim_fd, &i, &fd, 0); + if (err) { + fprintf(stderr, "Error updating array-of-maps w/ map %d\n", i); + goto err_out; + } + created_maps[i] = fd; + } + close(btf_fd); + + progs_to_attach = (args.nr_maps / 50); + if (args.nr_maps % 50) + progs_to_attach++; + + for (i = 0; i < progs_to_attach; i++) { + prog = ctx.progs + i * sizeof(void *); + attached_links[i] = bpf_program__attach(*prog); + } + + return; +err_out: + if (btf_fd) + close(btf_fd); + teardown(); + exit(1); +} + +static void hashmap_setup(void) +{ + struct local_storage_bench__hashmap *skel; + + setup_libbpf(); + + skel = local_storage_bench__hashmap__open(); + ctx.skel = skel; + ctx.hits = &skel->bss->hits; + ctx.important_hits = &skel->bss->important_hits; + ctx.load_skel = (int (*)(void *))local_storage_bench__hashmap__load; + ctx.progs = (void *)&skel->progs; + ctx.destroy_skel = (void (*)(void *))local_storage_bench__hashmap__destroy; + ctx.array_of_maps = skel->maps.array_of_maps; + + __setup(true); +} + +static void local_storage_cache_get_setup(void) +{ + struct local_storage_bench__get_seq *skel; + + setup_libbpf(); + + skel = local_storage_bench__get_seq__open(); + ctx.skel = skel; + ctx.hits = &skel->bss->hits; + ctx.important_hits = &skel->bss->important_hits; + ctx.load_skel = (int (*)(void *))local_storage_bench__get_seq__load; + ctx.progs = (void *)&skel->progs; + ctx.destroy_skel = (void (*)(void *))local_storage_bench__get_seq__destroy; + ctx.array_of_maps = skel->maps.array_of_maps; + + __setup(false); +} + +static void local_storage_cache_get_interleaved_setup(void) +{ + struct local_storage_bench__get_int *skel; + + setup_libbpf(); + + skel = local_storage_bench__get_int__open(); + ctx.skel = skel; + ctx.hits = &skel->bss->hits; + ctx.important_hits = &skel->bss->important_hits; + ctx.load_skel = (int (*)(void *))local_storage_bench__get_int__load; + ctx.progs = (void *)&skel->progs; + ctx.destroy_skel = (void (*)(void *))local_storage_bench__get_int__destroy; + ctx.array_of_maps = skel->maps.array_of_maps; + + __setup(false); +} + +static void measure(struct bench_res *res) +{ + if (ctx.hits) + res->hits = atomic_swap(ctx.hits, 0); + if (ctx.important_hits) + res->important_hits = atomic_swap(ctx.important_hits, 0); +} + +static inline void trigger_bpf_program(void) +{ + syscall(__NR_getpgid); +} + +static void *consumer(void *input) +{ + return NULL; +} + +static void *producer(void *input) +{ + while (true) + trigger_bpf_program(); + + return NULL; +} + +/* cache sequential and interleaved get benchs test local_storage get + * performance, specifically they demonstrate performance cliff of + * current list-plus-cache local_storage model. + * + * cache sequential get: call bpf_task_storage_get on n maps in order + * cache interleaved get: like "sequential get", but interleave 4 calls to the + * 'important' map (idx 0 in array_of_maps) for every 10 calls. Goal + * is to mimic environment where many progs are accessing their local_storage + * maps, with 'our' prog needing to access its map more often than others + */ +const struct bench bench_local_storage_cache_seq_get = { + .name = "local-storage-cache-seq-get", + .validate = validate, + .setup = local_storage_cache_get_setup, + .producer_thread = producer, + .consumer_thread = consumer, + .measure = measure, + .report_progress = local_storage_report_progress, + .report_final = local_storage_report_final, + .teardown = teardown, +}; + +const struct bench bench_local_storage_cache_interleaved_get = { + .name = "local-storage-cache-int-get", + .validate = validate, + .setup = local_storage_cache_get_interleaved_setup, + .producer_thread = producer, + .consumer_thread = consumer, + .measure = measure, + .report_progress = local_storage_report_progress, + .report_final = local_storage_report_final, + .teardown = teardown, +}; + +const struct bench bench_local_storage_cache_hashmap_control = { + .name = "local-storage-cache-hashmap-control", + .validate = validate, + .setup = hashmap_setup, + .producer_thread = producer, + .consumer_thread = consumer, + .measure = measure, + .report_progress = local_storage_report_progress, + .report_final = local_storage_report_final, + .teardown = teardown, +}; diff --git a/tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh b/tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh new file mode 100755 index 000000000000..479096c47c93 --- /dev/null +++ b/tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh @@ -0,0 +1,21 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +source ./benchs/run_common.sh + +set -eufo pipefail + +header "Local Storage" +subtitle "Hashmap Control w/ 500 maps" + summarize_local_storage "hashmap (control) sequential get: "\ + "$(./bench --nr_maps 500 local-storage-cache-hashmap-control)" + printf "\n" + +for i in 1 10 16 17 24 32 100 1000; do +subtitle "num_maps: $i" + summarize_local_storage "local_storage cache sequential get: "\ + "$(./bench --nr_maps $i local-storage-cache-seq-get)" + summarize_local_storage "local_storage cache interleaved get: "\ + "$(./bench --nr_maps $i local-storage-cache-int-get)" + printf "\n" +done diff --git a/tools/testing/selftests/bpf/benchs/run_common.sh b/tools/testing/selftests/bpf/benchs/run_common.sh index 6c5e6023a69f..d9f40af82006 100644 --- a/tools/testing/selftests/bpf/benchs/run_common.sh +++ b/tools/testing/selftests/bpf/benchs/run_common.sh @@ -41,6 +41,16 @@ function ops() echo "$*" | sed -E "s/.*latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/" } +function local_storage() +{ + echo -n "hits throughput: " + echo -n "$*" | sed -E "s/.* hits throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/" + echo -n -e ", hits latency: " + echo -n "$*" | sed -E "s/.* hits latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/" + echo -n ", important_hits throughput: " + echo "$*" | sed -E "s/.*important_hits throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/" +} + function total() { echo "$*" | sed -E "s/.*total operations\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/" @@ -67,6 +77,13 @@ function summarize_ops() printf "%-20s %s\n" "$bench" "$(ops $summary)" } +function summarize_local_storage() +{ + bench="$1" + summary=$(echo $2 | tail -n1) + printf "%-20s %s\n" "$bench" "$(local_storage $summary)" +} + function summarize_total() { bench="$1" diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench.h b/tools/testing/selftests/bpf/progs/local_storage_bench.h new file mode 100644 index 000000000000..b5e358dee245 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h @@ -0,0 +1,69 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); + __uint(max_entries, 1000); + __type(key, int); + __type(value, int); +} array_of_maps SEC(".maps"); + +long important_hits; +long hits; + +#ifdef LOOKUP_HASHMAP +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */) +{ + void *map; + int zero = 0; + + map = bpf_map_lookup_elem(&array_of_maps, &elem); + if (!map) + return -1; + + bpf_map_lookup_elem(map, &zero); + __sync_add_and_fetch(&hits, 1); + if (!elem) + __sync_add_and_fetch(&important_hits, 1); + return 0; +} +#else +static int do_lookup(unsigned int elem, struct task_struct *task) +{ + void *map; + + map = bpf_map_lookup_elem(&array_of_maps, &elem); + if (!map) + return -1; + + bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE); + __sync_add_and_fetch(&hits, 1); + if (!elem) + __sync_add_and_fetch(&important_hits, 1); + return 0; +} +#endif /* LOOKUP_HASHMAP */ + +#define __TASK_STORAGE_GET_LOOP_PROG(array, start, interleave) \ +SEC("fentry/" SYS_PREFIX "sys_getpgid") \ +int get_local_ ## start(void *ctx) \ +{ \ + struct task_struct *task; \ + unsigned int i, elem; \ + void *map; \ + \ + task = bpf_get_current_task_btf(); \ + for (i = 0; i < 50; i++) { \ + elem = start + i; \ + if (do_lookup(elem, task)) \ + return 0; \ + if (interleave && i % 3 == 0) \ + do_lookup(0, task); \ + } \ + return 0; \ +} + +#define TASK_STORAGE_GET_LOOP_PROG_SEQ(array, start) \ + __TASK_STORAGE_GET_LOOP_PROG(array, start, false) +#define TASK_STORAGE_GET_LOOP_PROG_INT(array, start) \ + __TASK_STORAGE_GET_LOOP_PROG(array, start, true) diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c new file mode 100644 index 000000000000..498f0c81014b --- /dev/null +++ b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c @@ -0,0 +1,31 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include "bpf_misc.h" + +#include "local_storage_bench.h" + +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 0); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 50); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 100); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 150); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 200); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 250); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 300); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 350); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 400); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 450); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 500); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 550); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 600); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 650); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 700); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 750); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 800); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 850); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 900); +TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 950); + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c new file mode 100644 index 000000000000..71514576a05d --- /dev/null +++ b/tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c @@ -0,0 +1,31 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include "bpf_misc.h" + +#include "local_storage_bench.h" + +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950); + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c b/tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c new file mode 100644 index 000000000000..15e4fe13d0b5 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include "bpf_misc.h" + +#define LOOKUP_HASHMAP +#include "local_storage_bench.h" + +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900); +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950); + +char _license[] SEC("license") = "GPL"; -- 2.30.2