On Fri, May 20, 2022 at 10:00 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote: > > Add a benchmarks to demonstrate the performance cliff for local_storage > get as the number of local_storage maps increases beyond current > local_storage implementation's cache size. > > "sequential get" and "interleaved get" benchmarks are added, both of > which do many bpf_task_storage_get calls on sets of task local_storage > maps of various counts, while considering a single specific map to be > 'important' and counting task_storage_gets to the important map > separately in addition to normal 'hits' count of all gets. Goal here is > to mimic scenario where a particular program using one map - the > important one - is running on a system where many other local_storage > maps exist and are accessed often. > > While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, > ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 > bpf_task_storage_gets for the important map for every 10 map gets. This > is meant to highlight performance differences when important map is > accessed far more frequently than non-important maps. > > A "hashmap control" benchmark is also included for easy comparison of > standard bpf hashmap lookup vs local_storage get. The benchmark is > identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH > instead of local storage. > > Addition of this benchmark is inspired by conversation with Alexei in a > previous patchset's thread [0], which highlighted the need for such a > benchmark to motivate and validate improvements to local_storage > implementation. My approach in that series focused on improving > performance for explicitly-marked 'important' maps and was rejected > with feedback to make more generally-applicable improvements while > avoiding explicitly marking maps as important. Thus the benchmark > reports both general and important-map-focused metrics, so effect of > future work on both is clear. > > Regarding the benchmark results. On a powerful system (Skylake, 20 > cores, 256gb ram): > > Local Storage > ============= > Hashmap Control w/ 500 maps > hashmap (control) sequential get: hits throughput: 69.649 ± 1.207 M ops/s, hits latency: 14.358 ns/op, important_hits throughput: 0.139 ± 0.002 M ops/s > > num_maps: 1 > local_storage cache sequential get: hits throughput: 3.849 ± 0.035 M ops/s, hits latency: 259.803 ns/op, important_hits throughput: 3.849 ± 0.035 M ops/s > local_storage cache interleaved get: hits throughput: 6.881 ± 0.110 M ops/s, hits latency: 145.324 ns/op, important_hits throughput: 6.881 ± 0.110 M ops/s this is huge drop in performance for num_maps is due to syscall and fentry overhead, is that right? How about making each syscall invocation do (roughly) the same amount of map/storage lookups per invocation to neutralize this overhead? Something like: const volatile int map_cnt; const volatile int iter_cnt; ... for (i = 0; i < iter_cnt; i++) { int map_idx = i % map_cnt; do_lookup(map_idx, task); ... } User-space can calculate iter_cnt to be closest exact multiple of map_cnt or you can just hard-code iter_cnt to fixed number (something like 10000 or some high enough value) and just leave with slightly uneven pattern for last round of looping. But this way you make syscall/fentry overhead essentially fixed, which will avoid these counter-intuitive numbers. > > num_maps: 10 > local_storage cache sequential get: hits throughput: 20.339 ± 0.442 M ops/s, hits latency: 49.167 ns/op, important_hits throughput: 2.034 ± 0.044 M ops/s > local_storage cache interleaved get: hits throughput: 22.408 ± 0.606 M ops/s, hits latency: 44.627 ns/op, important_hits throughput: 8.003 ± 0.217 M ops/s > > num_maps: 16 > local_storage cache sequential get: hits throughput: 24.428 ± 1.120 M ops/s, hits latency: 40.937 ns/op, important_hits throughput: 1.527 ± 0.070 M ops/s > local_storage cache interleaved get: hits throughput: 26.853 ± 0.825 M ops/s, hits latency: 37.240 ns/op, important_hits throughput: 8.544 ± 0.262 M ops/s > > num_maps: 17 > local_storage cache sequential get: hits throughput: 24.158 ± 0.222 M ops/s, hits latency: 41.394 ns/op, important_hits throughput: 1.421 ± 0.013 M ops/s > local_storage cache interleaved get: hits throughput: 26.223 ± 0.201 M ops/s, hits latency: 38.134 ns/op, important_hits throughput: 7.981 ± 0.061 M ops/s > > num_maps: 24 > local_storage cache sequential get: hits throughput: 16.820 ± 0.294 M ops/s, hits latency: 59.451 ns/op, important_hits throughput: 0.701 ± 0.012 M ops/s > local_storage cache interleaved get: hits throughput: 19.185 ± 0.212 M ops/s, hits latency: 52.125 ns/op, important_hits throughput: 5.396 ± 0.060 M ops/s > > num_maps: 32 > local_storage cache sequential get: hits throughput: 11.998 ± 0.310 M ops/s, hits latency: 83.347 ns/op, important_hits throughput: 0.375 ± 0.010 M ops/s > local_storage cache interleaved get: hits throughput: 14.233 ± 0.265 M ops/s, hits latency: 70.259 ns/op, important_hits throughput: 3.972 ± 0.074 M ops/s > > num_maps: 100 > local_storage cache sequential get: hits throughput: 5.780 ± 0.250 M ops/s, hits latency: 173.003 ns/op, important_hits throughput: 0.058 ± 0.002 M ops/s > local_storage cache interleaved get: hits throughput: 7.175 ± 0.312 M ops/s, hits latency: 139.381 ns/op, important_hits throughput: 1.874 ± 0.081 M ops/s > > num_maps: 1000 > local_storage cache sequential get: hits throughput: 0.456 ± 0.011 M ops/s, hits latency: 2192.982 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s > local_storage cache interleaved get: hits throughput: 0.539 ± 0.005 M ops/s, hits latency: 1855.508 ns/op, important_hits throughput: 0.135 ± 0.001 M ops/s > > Looking at the "sequential get" results, it's clear that as the > number of task local_storage maps grows beyond the current cache size > (16), there's a significant reduction in hits throughput. Note that > current local_storage implementation assigns a cache_idx to maps as they > are created. Since "sequential get" is creating maps 0..n in order and > then doing bpf_task_storage_get calls in the same order, the benchmark > is effectively ensuring that a map will not be in cache when the program > tries to access it. > > For "interleaved get" results, important-map hits throughput is greatly > increased as the important map is more likely to be in cache by virtue > of being accessed far more frequently. Throughput still reduces as # > maps increases, though. > > As evidenced by the unintuitive-looking results for smaller num_maps > benchmark runs, overhead which is amortized across larger num_maps runs > dominates when there are fewer maps. To get a sense of the overhead, I > commented out bpf_task_storage_get/bpf_map_lookup_elem in > local_storage_bench.h and ran the benchmark on the same host as the > 'real' run. Results: > > Local Storage > ============= > Hashmap Control w/ 500 maps > hashmap (control) sequential get: hits throughput: 128.699 ± 1.267 M ops/s, hits latency: 7.770 ns/op, important_hits throughput: 0.257 ± 0.003 M ops/s > [...] > > Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: > > hashmap_control: ~6.6ns > sequential_get_1: ~17.9ns > sequential_get_10: ~18.9ns > sequential_get_16: ~19.0ns > sequential_get_17: ~20.2ns > sequential_get_24: ~42.2ns > sequential_get_32: ~68.7ns > sequential_get_100: ~163.3ns > sequential_get_1000: ~2200ns > > Clearly demonstrating a cliff. > > When running the benchmarks it may be necessary to bump 'open files' > ulimit for a successful run. > > [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@xxxxxx > > Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx> > --- > Changelog: > > v2 -> v3: > * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit, > so just use 1 program (Alexei) > > v1 -> v2: > * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly > configurable # of maps (Andrii) > * Add hashmap benchmark (Alexei) > * Add discussion of overhead > > tools/testing/selftests/bpf/Makefile | 6 +- > tools/testing/selftests/bpf/bench.c | 57 +++ > tools/testing/selftests/bpf/bench.h | 5 + > .../bpf/benchs/bench_local_storage.c | 332 ++++++++++++++++++ > .../bpf/benchs/run_bench_local_storage.sh | 21 ++ > .../selftests/bpf/benchs/run_common.sh | 17 + > .../selftests/bpf/progs/local_storage_bench.h | 63 ++++ > .../bpf/progs/local_storage_bench__get_int.c | 12 + > .../bpf/progs/local_storage_bench__get_seq.c | 12 + > .../bpf/progs/local_storage_bench__hashmap.c | 13 + > 10 files changed, 537 insertions(+), 1 deletion(-) > create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c > create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh > create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.h > create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c > create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c > create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c > > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile > index 4030dd6cbc34..6095f6af2ad1 100644 > --- a/tools/testing/selftests/bpf/Makefile > +++ b/tools/testing/selftests/bpf/Makefile > @@ -560,6 +560,9 @@ $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \ > $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h > $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h > $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h > +$(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench__get_seq.skel.h \ > + $(OUTPUT)/local_storage_bench__get_int.skel.h \ > + $(OUTPUT)/local_storage_bench__hashmap.skel.h You really don't need 3 skeletons for this, you can parameterize everything with 2-3 .rodata variables and have fixed code and single skeleton header. It will also simplify your setup code, you won't need need those callbacks that abstract specific skeleton away. Much cleaner and simpler, IMO. Please, try to simplify this and make it easier to maintain. > $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ) > $(OUTPUT)/bench: LDLIBS += -lm > $(OUTPUT)/bench: $(OUTPUT)/bench.o \ > @@ -571,7 +574,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \ > $(OUTPUT)/bench_ringbufs.o \ > $(OUTPUT)/bench_bloom_filter_map.o \ > $(OUTPUT)/bench_bpf_loop.o \ > - $(OUTPUT)/bench_strncmp.o > + $(OUTPUT)/bench_strncmp.o \ > + $(OUTPUT)/bench_local_storage.o > $(call msg,BINARY,,$@) > $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@ > > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c > index f061cc20e776..71271062f68d 100644 > --- a/tools/testing/selftests/bpf/bench.c > +++ b/tools/testing/selftests/bpf/bench.c > @@ -150,6 +150,53 @@ void ops_report_final(struct bench_res res[], int res_cnt) > printf("latency %8.3lf ns/op\n", 1000.0 / hits_mean * env.producer_cnt); > } > > +void local_storage_report_progress(int iter, struct bench_res *res, > + long delta_ns) > +{ > + double important_hits_per_sec, hits_per_sec; > + double delta_sec = delta_ns / 1000000000.0; > + > + hits_per_sec = res->hits / 1000000.0 / delta_sec; > + important_hits_per_sec = res->important_hits / 1000000.0 / delta_sec; > + > + printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0); > + > + printf("hits %8.3lfM/s ", hits_per_sec); > + printf("important_hits %8.3lfM/s\n", important_hits_per_sec); > +} > + > +void local_storage_report_final(struct bench_res res[], int res_cnt) > +{ > + double important_hits_mean = 0.0, important_hits_stddev = 0.0; > + double hits_mean = 0.0, hits_stddev = 0.0; > + int i; > + > + for (i = 0; i < res_cnt; i++) { > + hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt); > + important_hits_mean += res[i].important_hits / 1000000.0 / (0.0 + res_cnt); > + } > + > + if (res_cnt > 1) { > + for (i = 0; i < res_cnt; i++) { > + hits_stddev += (hits_mean - res[i].hits / 1000000.0) * > + (hits_mean - res[i].hits / 1000000.0) / > + (res_cnt - 1.0); > + important_hits_stddev += > + (important_hits_mean - res[i].important_hits / 1000000.0) * > + (important_hits_mean - res[i].important_hits / 1000000.0) / > + (res_cnt - 1.0); > + } > + > + hits_stddev = sqrt(hits_stddev); > + important_hits_stddev = sqrt(important_hits_stddev); > + } > + printf("Summary: hits throughput %8.3lf \u00B1 %5.3lf M ops/s, ", > + hits_mean, hits_stddev); > + printf("hits latency %8.3lf ns/op, ", 1000.0 / hits_mean); > + printf("important_hits throughput %8.3lf \u00B1 %5.3lf M ops/s\n", > + important_hits_mean, important_hits_stddev); > +} > + We have hits_drops_report_progress/hits_drops_report_final which uses "hit" and "drop" terminology (admittedly confusing for this set of benchmarks), but if you ignore the "drop" part, it's exactly what you need - to track two independent values (in your case hit and important hit). You'll get rid of a good chunk of repetitive code with some statistics in it. You post-processing scripts will further hide this detail. > const char *argp_program_version = "benchmark"; > const char *argp_program_bug_address = "<bpf@xxxxxxxxxxxxxxx>"; > const char argp_program_doc[] = > @@ -188,12 +235,14 @@ static const struct argp_option opts[] = { > extern struct argp bench_ringbufs_argp; > extern struct argp bench_bloom_map_argp; > extern struct argp bench_bpf_loop_argp; > +extern struct argp bench_local_storage_argp; > extern struct argp bench_strncmp_argp; > [...] > + > +static int setup_inner_map_and_load(int inner_fd) > +{ > + int err, mim_fd; > + > + err = bpf_map__set_inner_map_fd(ctx.array_of_maps, inner_fd); > + if (err) > + return -1; > + > + err = ctx.load_skel(ctx.skel); > + if (err) > + return -1; > + > + mim_fd = bpf_map__fd(ctx.array_of_maps); > + if (mim_fd < 0) > + return -1; > + > + return mim_fd; > +} > + > +static int load_btf(void) > +{ > + static const char btf_str_sec[] = "\0"; > + __u32 btf_raw_types[] = { > + /* int */ > + BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), /* [1] */ > + }; > + struct btf_header btf_hdr = { > + .magic = BTF_MAGIC, > + .version = BTF_VERSION, > + .hdr_len = sizeof(struct btf_header), > + .type_len = sizeof(btf_raw_types), > + .str_off = sizeof(btf_raw_types), > + .str_len = sizeof(btf_str_sec), > + }; > + __u8 raw_btf[sizeof(struct btf_header) + sizeof(btf_raw_types) + > + sizeof(btf_str_sec)]; > + > + memcpy(raw_btf, &btf_hdr, sizeof(btf_hdr)); > + memcpy(raw_btf + sizeof(btf_hdr), btf_raw_types, sizeof(btf_raw_types)); > + memcpy(raw_btf + sizeof(btf_hdr) + sizeof(btf_raw_types), > + btf_str_sec, sizeof(btf_str_sec)); > + > + return bpf_btf_load(raw_btf, sizeof(raw_btf), NULL); > +} > + please try using declarative map-in-map definition, hopefully it doesn't influence benchmark results. It will allow to avoid this low-level setup code completely. > +static void __setup(struct bpf_program *prog, bool hashmap) > +{ > + int i, fd, mim_fd, err; > + int btf_fd = 0; > + > + LIBBPF_OPTS(bpf_map_create_opts, create_opts); > + [...] > + > +static void measure(struct bench_res *res) > +{ > + if (ctx.hits) > + res->hits = atomic_swap(ctx.hits, 0); > + if (ctx.important_hits) why these ifs? just swap, measure is called once a second, there is no need to optimize this > + res->important_hits = atomic_swap(ctx.important_hits, 0); > +} > + > +static inline void trigger_bpf_program(void) > +{ > + syscall(__NR_getpgid); > +} > + [...] > +#ifdef LOOKUP_HASHMAP > +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */) > +{ > + void *map; > + int zero = 0; > + > + map = bpf_map_lookup_elem(&array_of_maps, &elem); > + if (!map) > + return -1; > + > + bpf_map_lookup_elem(map, &zero); shouldn't you use elem here as well to make it a bit more in line with bpf_task_storage_get()? This fixed zero is too optimistic and minimizes CPU cache usage, skewing results towards hashmap. It's cheaper to go access same location in hashmap over and over again, vs randomly jumping over N elements > + __sync_add_and_fetch(&hits, 1); > + if (!elem) > + __sync_add_and_fetch(&important_hits, 1); > + return 0; > +} > +#else > +static int do_lookup(unsigned int elem, struct task_struct *task) > +{ > + void *map; > + > + map = bpf_map_lookup_elem(&array_of_maps, &elem); > + if (!map) > + return -1; > + > + bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE); > + __sync_add_and_fetch(&hits, 1); > + if (!elem) > + __sync_add_and_fetch(&important_hits, 1); > + return 0; > +} > +#endif /* LOOKUP_HASHMAP */ > + > +#define TASK_STORAGE_GET_LOOP_PROG(interleave) \ > +SEC("fentry/" SYS_PREFIX "sys_getpgid") \ > +int get_local(void *ctx) \ > +{ \ > + struct task_struct *task; \ > + unsigned int i; \ > + void *map; \ > + \ > + task = bpf_get_current_task_btf(); \ > + for (i = 0; i < 1000; i++) { \ > + if (do_lookup(i, task)) \ > + return 0; \ > + if (interleave && i % 3 == 0) \ > + do_lookup(0, task); \ > + } \ > + return 0; \ > +} I think const volatile use_local_storage; /* set from user-space */ if (use_local_storage) { do_lookup_storage() } else { do_lookup_hashmap() } is as clear (if not clearer) than having three separate skeletons built from single #include header parameterized by extra #defines. > diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c > new file mode 100644 [...]