Re: [PATCH v4 bpf-next 1/2] selftests/bpf: Add benchmark for local_storage get

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 3 Jun 2022 14:26:10 -0700

On Mon, May 30, 2022 at 1:27 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
>
> Add a benchmarks to demonstrate the performance cliff for local_storage
> get as the number of local_storage maps increases beyond current
> local_storage implementation's cache size.
>
> "sequential get" and "interleaved get" benchmarks are added, both of
> which do many bpf_task_storage_get calls on sets of task local_storage
> maps of various counts, while considering a single specific map to be
> 'important' and counting task_storage_gets to the important map
> separately in addition to normal 'hits' count of all gets. Goal here is
> to mimic scenario where a particular program using one map - the
> important one - is running on a system where many other local_storage
> maps exist and are accessed often.
>
> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
> bpf_task_storage_gets for the important map for every 10 map gets. This
> is meant to highlight performance differences when important map is
> accessed far more frequently than non-important maps.
>
> A "hashmap control" benchmark is also included for easy comparison of
> standard bpf hashmap lookup vs local_storage get. The benchmark is
> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
> instead of local storage.
>
> Addition of this benchmark is inspired by conversation with Alexei in a
> previous patchset's thread [0], which highlighted the need for such a
> benchmark to motivate and validate improvements to local_storage
> implementation. My approach in that series focused on improving
> performance for explicitly-marked 'important' maps and was rejected
> with feedback to make more generally-applicable improvements while
> avoiding explicitly marking maps as important. Thus the benchmark
> reports both general and important-map-focused metrics, so effect of
> future work on both is clear.
>
> Regarding the benchmark results. On a powerful system (Skylake, 20
> cores, 256gb ram):
>
> Local Storage
> =============
>         Hashmap Control w/ 500 maps
> hashmap (control) sequential    get:  hits throughput: 48.338 ± 2.366 M ops/s, hits latency: 20.688 ns/op, important_hits throughput: 0.097 ± 0.005 M ops/s
>
>         num_maps: 1
> local_storage cache sequential  get:  hits throughput: 44.503 ± 1.080 M ops/s, hits latency: 22.470 ns/op, important_hits throughput: 44.503 ± 1.080 M ops/s
> local_storage cache interleaved get:  hits throughput: 54.963 ± 0.586 M ops/s, hits latency: 18.194 ns/op, important_hits throughput: 54.963 ± 0.586 M ops/s
>
>         num_maps: 10
> local_storage cache sequential  get:  hits throughput: 43.743 ± 0.418 M ops/s, hits latency: 22.861 ns/op, important_hits throughput: 4.374 ± 0.042 M ops/s
> local_storage cache interleaved get:  hits throughput: 50.073 ± 0.609 M ops/s, hits latency: 19.971 ns/op, important_hits throughput: 17.883 ± 0.217 M ops/s
>
>         num_maps: 16
> local_storage cache sequential  get:  hits throughput: 43.962 ± 0.525 M ops/s, hits latency: 22.747 ns/op, important_hits throughput: 2.748 ± 0.033 M ops/s
> local_storage cache interleaved get:  hits throughput: 48.166 ± 0.825 M ops/s, hits latency: 20.761 ns/op, important_hits throughput: 15.326 ± 0.263 M ops/s
>
>         num_maps: 17
> local_storage cache sequential  get:  hits throughput: 33.207 ± 0.461 M ops/s, hits latency: 30.114 ns/op, important_hits throughput: 1.956 ± 0.027 M ops/s
> local_storage cache interleaved get:  hits throughput: 43.540 ± 0.265 M ops/s, hits latency: 22.968 ns/op, important_hits throughput: 13.255 ± 0.081 M ops/s
>
>         num_maps: 24
> local_storage cache sequential  get:  hits throughput: 19.402 ± 0.348 M ops/s, hits latency: 51.542 ns/op, important_hits throughput: 0.809 ± 0.015 M ops/s
> local_storage cache interleaved get:  hits throughput: 22.981 ± 0.487 M ops/s, hits latency: 43.514 ns/op, important_hits throughput: 6.465 ± 0.137 M ops/s
>
>         num_maps: 32
> local_storage cache sequential  get:  hits throughput: 13.378 ± 0.220 M ops/s, hits latency: 74.748 ns/op, important_hits throughput: 0.419 ± 0.007 M ops/s
> local_storage cache interleaved get:  hits throughput: 16.894 ± 0.172 M ops/s, hits latency: 59.193 ns/op, important_hits throughput: 4.716 ± 0.048 M ops/s
>
>         num_maps: 100
> local_storage cache sequential  get:  hits throughput: 6.070 ± 0.140 M ops/s, hits latency: 164.745 ns/op, important_hits throughput: 0.061 ± 0.001 M ops/s
> local_storage cache interleaved get:  hits throughput: 7.323 ± 0.149 M ops/s, hits latency: 136.554 ns/op, important_hits throughput: 1.913 ± 0.039 M ops/s
>
>         num_maps: 1000
> local_storage cache sequential  get:  hits throughput: 0.438 ± 0.012 M ops/s, hits latency: 2281.369 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
> local_storage cache interleaved get:  hits throughput: 0.522 ± 0.010 M ops/s, hits latency: 1913.937 ns/op, important_hits throughput: 0.131 ± 0.003 M ops/s
>
> Looking at the "sequential get" results, it's clear that as the
> number of task local_storage maps grows beyond the current cache size
> (16), there's a significant reduction in hits throughput. Note that
> current local_storage implementation assigns a cache_idx to maps as they
> are created. Since "sequential get" is creating maps 0..n in order and
> then doing bpf_task_storage_get calls in the same order, the benchmark
> is effectively ensuring that a map will not be in cache when the program
> tries to access it.
>
> For "interleaved get" results, important-map hits throughput is greatly
> increased as the important map is more likely to be in cache by virtue
> of being accessed far more frequently. Throughput still reduces as #
> maps increases, though.
>
> To get a sense of the overhead of the benchmark program, I
> commented out bpf_task_storage_get/bpf_map_lookup_elem in
> local_storage_bench.c and ran the benchmark on the same host as the
> 'real' run. Results:
>
> Local Storage
> =============
>         Hashmap Control w/ 500 maps
> hashmap (control) sequential    get:  hits throughput: 96.965 ± 1.346 M ops/s, hits latency: 10.313 ns/op, important_hits throughput: 0.194 ± 0.003 M ops/s
>
>         num_maps: 1
> local_storage cache sequential  get:  hits throughput: 105.792 ± 1.860 M ops/s, hits latency: 9.453 ns/op, important_hits throughput: 105.792 ± 1.860 M ops/s
> local_storage cache interleaved get:  hits throughput: 185.847 ± 4.014 M ops/s, hits latency: 5.381 ns/op, important_hits throughput: 185.847 ± 4.014 M ops/s
>
>         num_maps: 10
> local_storage cache sequential  get:  hits throughput: 109.867 ± 1.358 M ops/s, hits latency: 9.102 ns/op, important_hits throughput: 10.987 ± 0.136 M ops/s
> local_storage cache interleaved get:  hits throughput: 144.165 ± 1.256 M ops/s, hits latency: 6.936 ns/op, important_hits throughput: 51.487 ± 0.449 M ops/s
>
>         num_maps: 16
> local_storage cache sequential  get:  hits throughput: 109.258 ± 1.902 M ops/s, hits latency: 9.153 ns/op, important_hits throughput: 6.829 ± 0.119 M ops/s
> local_storage cache interleaved get:  hits throughput: 140.248 ± 1.836 M ops/s, hits latency: 7.130 ns/op, important_hits throughput: 44.624 ± 0.584 M ops/s
>
>         num_maps: 17
> local_storage cache sequential  get:  hits throughput: 116.397 ± 7.679 M ops/s, hits latency: 8.591 ns/op, important_hits throughput: 6.856 ± 0.452 M ops/s
> local_storage cache interleaved get:  hits throughput: 128.411 ± 4.927 M ops/s, hits latency: 7.787 ns/op, important_hits throughput: 39.093 ± 1.500 M ops/s
>
>         num_maps: 24
> local_storage cache sequential  get:  hits throughput: 110.890 ± 0.976 M ops/s, hits latency: 9.018 ns/op, important_hits throughput: 4.624 ± 0.041 M ops/s
> local_storage cache interleaved get:  hits throughput: 133.316 ± 1.889 M ops/s, hits latency: 7.501 ns/op, important_hits throughput: 37.503 ± 0.531 M ops/s
>
>         num_maps: 32
> local_storage cache sequential  get:  hits throughput: 112.900 ± 1.171 M ops/s, hits latency: 8.857 ns/op, important_hits throughput: 3.534 ± 0.037 M ops/s
> local_storage cache interleaved get:  hits throughput: 132.844 ± 1.207 M ops/s, hits latency: 7.528 ns/op, important_hits throughput: 37.081 ± 0.337 M ops/s
>
>         num_maps: 100
> local_storage cache sequential  get:  hits throughput: 110.025 ± 4.714 M ops/s, hits latency: 9.089 ns/op, important_hits throughput: 1.100 ± 0.047 M ops/s
> local_storage cache interleaved get:  hits throughput: 131.979 ± 5.013 M ops/s, hits latency: 7.577 ns/op, important_hits throughput: 34.472 ± 1.309 M ops/s
>
>         num_maps: 1000
> local_storage cache sequential  get:  hits throughput: 117.850 ± 2.423 M ops/s, hits latency: 8.485 ns/op, important_hits throughput: 0.118 ± 0.002 M ops/s
> local_storage cache interleaved get:  hits throughput: 141.268 ± 9.658 M ops/s, hits latency: 7.079 ns/op, important_hits throughput: 35.476 ± 2.425 M ops/s
>
> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
>
> hashmap_control:     ~10.4ns
> sequential_get_1:    ~13.0ns

So what this benchmark doesn't demonstrate is why one would use local
storage at all if hashmap is so fast :)

I think at least partially it's because of your choice to do fixed
hashmap lookup with zero key. Think about how you'd replace
local_storage with hashmap. You'd have task/socket/whatever ID as look
up key. For different tasks you'd be looking up different pids. For
your benchmark you have the same task all the time, but local_storage
look up still does all the work to find local storage instance in a
list of local storages for current task, so you don't have to use many
tasks to simulate realistic lookup overhead (well, at least to some
extent). But it seems not realistic for testing hashmap as an
alternative to local_storage, for that I think we'd need to randomize
key look up a bit. Unless I misunderstand what we are testing for
hashmap use case.

But other than that LGTM.

> sequential_get_10:   ~13.8ns
> sequential_get_16:   ~13.6ns
> sequential_get_17:   ~21.5ns
> sequential_get_24:   ~42.5ns
> sequential_get_32:   ~65.9ns
> sequential_get_100:  ~155.7ns
> sequential_get_1000: ~2270ns
>
> Clearly demonstrating a cliff.
>
> When running the benchmarks it may be necessary to bump 'open files'
> ulimit for a successful run.
>
>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@xxxxxx
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx>
> ---
> Changelog:
>
> v3 -> v4:
>         * Remove ifs guarding increments in measure fn (Andrii)
>         * Refactor to use 1 bpf prog for all 3 benchmarks w/ global vars set
>           from userspace before load to control behavior (Andrii)
>         * Greatly reduce variance in overhead by having benchmark bpf prog
>           loop 10k times regardless of map count (Andrii)
>                 * Also, move sync_fetch_and_incr out of do_lookup as the guaranteed
>                   second sync_fetch_and_incr call for num_maps = 1 was adding
>                   overhead
>         * Add second patch refactoring bench.c's mean/stddev calculations
>           in reporting helper fns
>
> v2 -> v3:
>   * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit,
>     so just use 1 program (Alexei)
>
> v1 -> v2:
>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
>     configurable # of maps (Andrii)
>   * Add hashmap benchmark (Alexei)
>   * Add discussion of overhead
>
>  tools/testing/selftests/bpf/Makefile          |   4 +-
>  tools/testing/selftests/bpf/bench.c           |  55 ++++
>  tools/testing/selftests/bpf/bench.h           |   4 +
>  .../bpf/benchs/bench_local_storage.c          | 250 ++++++++++++++++++
>  .../bpf/benchs/run_bench_local_storage.sh     |  21 ++
>  .../selftests/bpf/benchs/run_common.sh        |  17 ++
>  .../selftests/bpf/progs/local_storage_bench.c |  99 +++++++
>  7 files changed, 449 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.c
>

[...]

> +
> +static void hashmap_setup(void)
> +{
> +       struct local_storage_bench *skel;
> +
> +       setup_libbpf();
> +
> +       skel = local_storage_bench__open();
> +       ctx.skel = skel;
> +       ctx.bpf_obj = skel->obj;

nit: ctx.skel->obj is the same as ctx.bpf_obj, so bpf_obj is probably
not needed?

> +       ctx.array_of_maps = skel->maps.array_of_hash_maps;
> +       skel->rodata->use_hashmap = 1;
> +       skel->rodata->interleave = 0;
> +
> +       __setup(skel->progs.get_local, true);
> +}
> +

[...]