Re: [PATCH v4 bpf-next 1/2] selftests/bpf: Add benchmark for local_storage get

Dave Marchevsky <davemarchevsky@xxxxxx> · Fri, 3 Jun 2022 18:41:06 -0400

On 6/3/22 5:26 PM, Andrii Nakryiko wrote:   
> On Mon, May 30, 2022 at 1:27 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
>>
>> Add a benchmarks to demonstrate the performance cliff for local_storage
>> get as the number of local_storage maps increases beyond current
>> local_storage implementation's cache size.
>>
>> "sequential get" and "interleaved get" benchmarks are added, both of
>> which do many bpf_task_storage_get calls on sets of task local_storage
>> maps of various counts, while considering a single specific map to be
>> 'important' and counting task_storage_gets to the important map
>> separately in addition to normal 'hits' count of all gets. Goal here is
>> to mimic scenario where a particular program using one map - the
>> important one - is running on a system where many other local_storage
>> maps exist and are accessed often.
>>
>> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
>> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
>> bpf_task_storage_gets for the important map for every 10 map gets. This
>> is meant to highlight performance differences when important map is
>> accessed far more frequently than non-important maps.
>>
>> A "hashmap control" benchmark is also included for easy comparison of
>> standard bpf hashmap lookup vs local_storage get. The benchmark is
>> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
>> instead of local storage.
>>
>> Addition of this benchmark is inspired by conversation with Alexei in a
>> previous patchset's thread [0], which highlighted the need for such a
>> benchmark to motivate and validate improvements to local_storage
>> implementation. My approach in that series focused on improving
>> performance for explicitly-marked 'important' maps and was rejected
>> with feedback to make more generally-applicable improvements while
>> avoiding explicitly marking maps as important. Thus the benchmark
>> reports both general and important-map-focused metrics, so effect of
>> future work on both is clear.
>>
>> Regarding the benchmark results. On a powerful system (Skylake, 20
>> cores, 256gb ram):
>>
>> Local Storage
>> =============
>>         Hashmap Control w/ 500 maps
>> hashmap (control) sequential    get:  hits throughput: 48.338 ± 2.366 M ops/s, hits latency: 20.688 ns/op, important_hits throughput: 0.097 ± 0.005 M ops/s
>>
>>         num_maps: 1
>> local_storage cache sequential  get:  hits throughput: 44.503 ± 1.080 M ops/s, hits latency: 22.470 ns/op, important_hits throughput: 44.503 ± 1.080 M ops/s
>> local_storage cache interleaved get:  hits throughput: 54.963 ± 0.586 M ops/s, hits latency: 18.194 ns/op, important_hits throughput: 54.963 ± 0.586 M ops/s
>>
>>         num_maps: 10
>> local_storage cache sequential  get:  hits throughput: 43.743 ± 0.418 M ops/s, hits latency: 22.861 ns/op, important_hits throughput: 4.374 ± 0.042 M ops/s
>> local_storage cache interleaved get:  hits throughput: 50.073 ± 0.609 M ops/s, hits latency: 19.971 ns/op, important_hits throughput: 17.883 ± 0.217 M ops/s
>>
>>         num_maps: 16
>> local_storage cache sequential  get:  hits throughput: 43.962 ± 0.525 M ops/s, hits latency: 22.747 ns/op, important_hits throughput: 2.748 ± 0.033 M ops/s
>> local_storage cache interleaved get:  hits throughput: 48.166 ± 0.825 M ops/s, hits latency: 20.761 ns/op, important_hits throughput: 15.326 ± 0.263 M ops/s
>>
>>         num_maps: 17
>> local_storage cache sequential  get:  hits throughput: 33.207 ± 0.461 M ops/s, hits latency: 30.114 ns/op, important_hits throughput: 1.956 ± 0.027 M ops/s
>> local_storage cache interleaved get:  hits throughput: 43.540 ± 0.265 M ops/s, hits latency: 22.968 ns/op, important_hits throughput: 13.255 ± 0.081 M ops/s
>>
>>         num_maps: 24
>> local_storage cache sequential  get:  hits throughput: 19.402 ± 0.348 M ops/s, hits latency: 51.542 ns/op, important_hits throughput: 0.809 ± 0.015 M ops/s
>> local_storage cache interleaved get:  hits throughput: 22.981 ± 0.487 M ops/s, hits latency: 43.514 ns/op, important_hits throughput: 6.465 ± 0.137 M ops/s
>>
>>         num_maps: 32
>> local_storage cache sequential  get:  hits throughput: 13.378 ± 0.220 M ops/s, hits latency: 74.748 ns/op, important_hits throughput: 0.419 ± 0.007 M ops/s
>> local_storage cache interleaved get:  hits throughput: 16.894 ± 0.172 M ops/s, hits latency: 59.193 ns/op, important_hits throughput: 4.716 ± 0.048 M ops/s
>>
>>         num_maps: 100
>> local_storage cache sequential  get:  hits throughput: 6.070 ± 0.140 M ops/s, hits latency: 164.745 ns/op, important_hits throughput: 0.061 ± 0.001 M ops/s
>> local_storage cache interleaved get:  hits throughput: 7.323 ± 0.149 M ops/s, hits latency: 136.554 ns/op, important_hits throughput: 1.913 ± 0.039 M ops/s
>>
>>         num_maps: 1000
>> local_storage cache sequential  get:  hits throughput: 0.438 ± 0.012 M ops/s, hits latency: 2281.369 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
>> local_storage cache interleaved get:  hits throughput: 0.522 ± 0.010 M ops/s, hits latency: 1913.937 ns/op, important_hits throughput: 0.131 ± 0.003 M ops/s
>>
>> Looking at the "sequential get" results, it's clear that as the
>> number of task local_storage maps grows beyond the current cache size
>> (16), there's a significant reduction in hits throughput. Note that
>> current local_storage implementation assigns a cache_idx to maps as they
>> are created. Since "sequential get" is creating maps 0..n in order and
>> then doing bpf_task_storage_get calls in the same order, the benchmark
>> is effectively ensuring that a map will not be in cache when the program
>> tries to access it.
>>
>> For "interleaved get" results, important-map hits throughput is greatly
>> increased as the important map is more likely to be in cache by virtue
>> of being accessed far more frequently. Throughput still reduces as #
>> maps increases, though.
>>
>> To get a sense of the overhead of the benchmark program, I
>> commented out bpf_task_storage_get/bpf_map_lookup_elem in
>> local_storage_bench.c and ran the benchmark on the same host as the
>> 'real' run. Results:
>>
>> Local Storage
>> =============
>>         Hashmap Control w/ 500 maps
>> hashmap (control) sequential    get:  hits throughput: 96.965 ± 1.346 M ops/s, hits latency: 10.313 ns/op, important_hits throughput: 0.194 ± 0.003 M ops/s
>>
>>         num_maps: 1
>> local_storage cache sequential  get:  hits throughput: 105.792 ± 1.860 M ops/s, hits latency: 9.453 ns/op, important_hits throughput: 105.792 ± 1.860 M ops/s
>> local_storage cache interleaved get:  hits throughput: 185.847 ± 4.014 M ops/s, hits latency: 5.381 ns/op, important_hits throughput: 185.847 ± 4.014 M ops/s
>>
>>         num_maps: 10
>> local_storage cache sequential  get:  hits throughput: 109.867 ± 1.358 M ops/s, hits latency: 9.102 ns/op, important_hits throughput: 10.987 ± 0.136 M ops/s
>> local_storage cache interleaved get:  hits throughput: 144.165 ± 1.256 M ops/s, hits latency: 6.936 ns/op, important_hits throughput: 51.487 ± 0.449 M ops/s
>>
>>         num_maps: 16
>> local_storage cache sequential  get:  hits throughput: 109.258 ± 1.902 M ops/s, hits latency: 9.153 ns/op, important_hits throughput: 6.829 ± 0.119 M ops/s
>> local_storage cache interleaved get:  hits throughput: 140.248 ± 1.836 M ops/s, hits latency: 7.130 ns/op, important_hits throughput: 44.624 ± 0.584 M ops/s
>>
>>         num_maps: 17
>> local_storage cache sequential  get:  hits throughput: 116.397 ± 7.679 M ops/s, hits latency: 8.591 ns/op, important_hits throughput: 6.856 ± 0.452 M ops/s
>> local_storage cache interleaved get:  hits throughput: 128.411 ± 4.927 M ops/s, hits latency: 7.787 ns/op, important_hits throughput: 39.093 ± 1.500 M ops/s
>>
>>         num_maps: 24
>> local_storage cache sequential  get:  hits throughput: 110.890 ± 0.976 M ops/s, hits latency: 9.018 ns/op, important_hits throughput: 4.624 ± 0.041 M ops/s
>> local_storage cache interleaved get:  hits throughput: 133.316 ± 1.889 M ops/s, hits latency: 7.501 ns/op, important_hits throughput: 37.503 ± 0.531 M ops/s
>>
>>         num_maps: 32
>> local_storage cache sequential  get:  hits throughput: 112.900 ± 1.171 M ops/s, hits latency: 8.857 ns/op, important_hits throughput: 3.534 ± 0.037 M ops/s
>> local_storage cache interleaved get:  hits throughput: 132.844 ± 1.207 M ops/s, hits latency: 7.528 ns/op, important_hits throughput: 37.081 ± 0.337 M ops/s
>>
>>         num_maps: 100
>> local_storage cache sequential  get:  hits throughput: 110.025 ± 4.714 M ops/s, hits latency: 9.089 ns/op, important_hits throughput: 1.100 ± 0.047 M ops/s
>> local_storage cache interleaved get:  hits throughput: 131.979 ± 5.013 M ops/s, hits latency: 7.577 ns/op, important_hits throughput: 34.472 ± 1.309 M ops/s
>>
>>         num_maps: 1000
>> local_storage cache sequential  get:  hits throughput: 117.850 ± 2.423 M ops/s, hits latency: 8.485 ns/op, important_hits throughput: 0.118 ± 0.002 M ops/s
>> local_storage cache interleaved get:  hits throughput: 141.268 ± 9.658 M ops/s, hits latency: 7.079 ns/op, important_hits throughput: 35.476 ± 2.425 M ops/s
>>
>> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
>>
>> hashmap_control:     ~10.4ns
>> sequential_get_1:    ~13.0ns
> 
> So what this benchmark doesn't demonstrate is why one would use local
> storage at all if hashmap is so fast :)
> 
> I think at least partially it's because of your choice to do fixed
> hashmap lookup with zero key. Think about how you'd replace
> local_storage with hashmap. You'd have task/socket/whatever ID as look
> up key. For different tasks you'd be looking up different pids. For
> your benchmark you have the same task all the time, but local_storage
> look up still does all the work to find local storage instance in a
> list of local storages for current task, so you don't have to use many
> tasks to simulate realistic lookup overhead (well, at least to some
> extent). But it seems not realistic for testing hashmap as an
> alternative to local_storage, for that I think we'd need to randomize
> key look up a bit. Unless I misunderstand what we are testing for
> hashmap use case.
> 
> But other than that LGTM.

This makes more sense than what I'm doing now, and I think better fulfills
Alexei's request from v1 of this series:

    Also could you add a hash map with key=tid.
    It would be interesting to compare local storage with hash map.
    iirc when local storage was introduced it was 2.5 times faster than large hashmap.
    Since then both have changed.

I tried changing the "hashmap control" benchmark to just have 1 inner hashmap,
with max_entries 4,194,304 (PID_MAX_LIMIT on my system). use_hashmap branch
of bpf prog does bpf_get_prandom_u32 to select key. Results are more promising:

               Overhead: ~19.5ns  (prandom call + additional modulo)
4,194,304 possible keys: ~150.5ns (~131ns w/o overhead)
  100,000 possible keys: ~55ns    (~35.5ns w/o overhead)
    1,000 possible keys: ~31ns    (~11.5ns w/o overhead)
       10 possible keys: ~28ns    (~8.5ns w/o overhead)

So big hashmap with only a few keys used is faster than local_storage but far
slower when many keys used. Somewhere between "100k keys" and "1k keys" is
probably what performance would look like if a program which needed an entry
per running task used hashmap instead of local_storage. Close to Alexei's
memory of 2.5x difference.

Will send in v5.

> 
>> sequential_get_10:   ~13.8ns
>> sequential_get_16:   ~13.6ns
>> sequential_get_17:   ~21.5ns
>> sequential_get_24:   ~42.5ns
>> sequential_get_32:   ~65.9ns
>> sequential_get_100:  ~155.7ns
>> sequential_get_1000: ~2270ns
>>
>> Clearly demonstrating a cliff.
>>
>> When running the benchmarks it may be necessary to bump 'open files'
>> ulimit for a successful run.
>>
>>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@xxxxxx
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx>
>> ---
>> Changelog:
>>
>> v3 -> v4:
>>         * Remove ifs guarding increments in measure fn (Andrii)
>>         * Refactor to use 1 bpf prog for all 3 benchmarks w/ global vars set
>>           from userspace before load to control behavior (Andrii)
>>         * Greatly reduce variance in overhead by having benchmark bpf prog
>>           loop 10k times regardless of map count (Andrii)
>>                 * Also, move sync_fetch_and_incr out of do_lookup as the guaranteed
>>                   second sync_fetch_and_incr call for num_maps = 1 was adding
>>                   overhead
>>         * Add second patch refactoring bench.c's mean/stddev calculations
>>           in reporting helper fns
>>
>> v2 -> v3:
>>   * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit,
>>     so just use 1 program (Alexei)
>>
>> v1 -> v2:
>>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
>>     configurable # of maps (Andrii)
>>   * Add hashmap benchmark (Alexei)
>>   * Add discussion of overhead
>>
>>  tools/testing/selftests/bpf/Makefile          |   4 +-
>>  tools/testing/selftests/bpf/bench.c           |  55 ++++
>>  tools/testing/selftests/bpf/bench.h           |   4 +
>>  .../bpf/benchs/bench_local_storage.c          | 250 ++++++++++++++++++
>>  .../bpf/benchs/run_bench_local_storage.sh     |  21 ++
>>  .../selftests/bpf/benchs/run_common.sh        |  17 ++
>>  .../selftests/bpf/progs/local_storage_bench.c |  99 +++++++
>>  7 files changed, 449 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c
>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
>>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.c
>>
> 
> [...]
> 
>> +
>> +static void hashmap_setup(void)
>> +{
>> +       struct local_storage_bench *skel;
>> +
>> +       setup_libbpf();
>> +
>> +       skel = local_storage_bench__open();
>> +       ctx.skel = skel;
>> +       ctx.bpf_obj = skel->obj;
> 
> nit: ctx.skel->obj is the same as ctx.bpf_obj, so bpf_obj is probably
> not needed?

Will get rid of it in v5.

> 
> 
>> +       ctx.array_of_maps = skel->maps.array_of_hash_maps;
>> +       skel->rodata->use_hashmap = 1;
>> +       skel->rodata->interleave = 0;
>> +
>> +       __setup(skel->progs.get_local, true);
>> +}
>> +
> 
> [...]