Re: [PATCH bpf-next 2/2] selftests/bpf: Add benchmark for bpf memory allocator

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Tue, 9 Jan 2024 09:16:35 +0800

On 1/9/2024 5:45 AM, Song Liu wrote:
> On Thu, Dec 21, 2023 at 6:14 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>
> [...]
>> The following is the test results conducted on a 8-CPU VM with 16GB memory:
>>
>> $ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a; done |grep Summary
>> Summary: per-prod alloc 11.29 ± 0.14M/s free 33.76 ± 0.33M/s, total memory usage    0.01 ± 0.00MiB
>> Summary: per-prod alloc  7.49 ± 0.12M/s free 34.42 ± 0.56M/s, total memory usage    0.03 ± 0.00MiB
>> Summary: per-prod alloc  6.66 ± 0.08M/s free 34.27 ± 0.41M/s, total memory usage    0.06 ± 0.00MiB
>>
>> $ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a --percpu; done |grep Summary
>> Summary: per-prod alloc 14.64 ± 0.60M/s free 36.94 ± 0.35M/s, total memory usage  188.02 ± 7.43MiB
>> Summary: per-prod alloc 12.39 ± 1.32M/s free 36.40 ± 0.38M/s, total memory usage  808.90 ± 25.56MiB
>> Summary: per-prod alloc 10.80 ± 0.17M/s free 35.45 ± 0.25M/s, total memory usage 2330.24 ± 480.56MiB
> This is not likely related to this patch, but do we expect this much
> memory usage?
> I guess the 2.3GiB number is from bigger ALLOC_OBJ_SIZE and
> ALLOC_BATCH_CNT? I am getting 0 MiB with this test on my VM.

I think the reduction of memory usage is due to the merge of patch set
"bpf: Reduce memory usage for bpf_global_percpu_ma", because I got the
similar result as you did after apply the patch set [1].

1:
https://lore.kernel.org/bpf/cb8edf4b-f585-4e3e-9bed-10f5b36e427c@xxxxxxxxxxxxxxx/

>
>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>> ---
>>  tools/testing/selftests/bpf/Makefile          |   2 +
>>  tools/testing/selftests/bpf/bench.c           |   4 +
>>  tools/testing/selftests/bpf/bench.h           |   7 +
>>  .../selftests/bpf/benchs/bench_bpf_ma.c       | 273 ++++++++++++++++++
>>  .../selftests/bpf/progs/bench_bpf_ma.c        | 222 ++++++++++++++
> Maybe add a run_bench_bpf_ma.sh script in selftests/bpf/benchs?

Forgot that. Will do in v2.
>
>>  5 files changed, 508 insertions(+)
>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_ma.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/bench_bpf_ma.c
>>
> [...]
>> diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
>> index a6fcf111221f..206cf3de5df2 100644
>> --- a/tools/testing/selftests/bpf/bench.h
>> +++ b/tools/testing/selftests/bpf/bench.h
>> @@ -53,6 +53,13 @@ struct bench_res {
>>                         unsigned long gp_ct;
>>                         unsigned int stime;
>>                 } rcu;
>> +               struct {
>> +                       unsigned long alloc;
>> +                       unsigned long free;
> nit: maybe add _ct or _cnt postfix to match "rcu" above or the skel?

Will do in v2.
>
>> +                       unsigned long alloc_ns;
>> +                       unsigned long free_ns;
>> +                       unsigned long mem_bytes;
>> +               } ma;
>>         };
>>  };
>>
> [...]
>> +
>> +static void bpf_ma_validate(void)
>> +{
>> +}
> Empty validate() function seems not necessary.

Yes. Will remove it.
>
> [...]
>
>> +
>> +static void bpf_ma_report_final(struct bench_res res[], int res_cnt)
>> +{
>> +       double mem_mean = 0.0, mem_stddev = 0.0;
>> +       double alloc_mean = 0.0, alloc_stddev = 0.0;
>> +       double free_mean = 0.0, free_stddev = 0.0;
>> +       double alloc_ns = 0.0, free_ns = 0.0;
>> +       int i;
>> +
>> +       for (i = 0; i < res_cnt; i++) {
>> +               alloc_ns += res[i].ma.alloc_ns;
>> +               free_ns += res[i].ma.free_ns;
>> +       }
>> +       for (i = 0; i < res_cnt; i++) {
>> +               if (alloc_ns)
>> +                       alloc_mean += res[i].ma.alloc * 1000.0 / alloc_ns;
>> +               if (free_ns)
>> +                       free_mean += res[i].ma.free * 1000.0 / free_ns;
>> +               mem_mean += res[i].ma.mem_bytes / 1048576.0 / (0.0 + res_cnt);
>> +       }
>> +       if (res_cnt > 1) {
>> +               for (i = 0; i < res_cnt; i++) {
>> +                       double sample;
>> +
>> +                       sample = res[i].ma.alloc_ns ? res[i].ma.alloc * 1000.0 /
>> +                                                     res[i].ma.alloc_ns : 0.0;
>> +                       alloc_stddev += (alloc_mean - sample) * (alloc_mean - sample) /
>> +                                       (res_cnt - 1.0);
>> +
>> +                       sample = res[i].ma.free_ns ? res[i].ma.free * 1000.0 /
>> +                                                    res[i].ma.free_ns : 0.0;
>> +                       free_stddev += (free_mean - sample) * (free_mean - sample) /
>> +                                      (res_cnt - 1.0);
>> +
>> +                       sample = res[i].ma.mem_bytes / 1048576.0;
>> +                       mem_stddev += (mem_mean - sample) * (mem_mean - sample) /
>> +                                     (res_cnt - 1.0);
>> +               }
> nit: We can probably refactor common code for stddev calculation into
> some helpers.

Will try. The calculation of free/alloc ratio is the same, but the names
of related fields are different. Maybe need to define alloc_cnt/free_cnt
as an array first.
>
>> +               alloc_stddev = sqrt(alloc_stddev);
>> +               free_stddev = sqrt(free_stddev);
>> +               mem_stddev = sqrt(mem_stddev);
>> +       }
>> +
>> +       printf("Summary: per-prod alloc %7.2lf \u00B1 %3.2lfM/s free %7.2lf \u00B1 %3.2lfM/s, "
>> +              "total memory usage %7.2lf \u00B1 %3.2lfMiB\n",
>> +              alloc_mean, alloc_stddev, free_mean, free_stddev,
>> +              mem_mean, mem_stddev);
>> +}
>> +
>> +const struct bench bench_bpf_mem_alloc = {
>> +       .name = "bpf_ma",
>> +       .argp = &bench_bpf_mem_alloc_argp,
>> +       .validate = bpf_ma_validate,
>> +       .setup = bpf_ma_setup,
>> +       .producer_thread = bpf_ma_producer,
>> +       .measure = bpf_ma_measure,
>> +       .report_progress = bpf_ma_report_progress,
>> +       .report_final = bpf_ma_report_final,
>> +};
>> diff --git a/tools/testing/selftests/bpf/progs/bench_bpf_ma.c b/tools/testing/selftests/bpf/progs/bench_bpf_ma.c
> [...]
>
>> +

SNIP
>> +/* Return the number of freed objects */
>> +static __always_inline unsigned int batch_percpu_free(struct bpf_map *map)
>> +{
>> +       struct percpu_map_value *value;
>> +       unsigned int i, key;
>> +       void *old;
>> +
>> +       for (i = 0; i < ALLOC_BATCH_CNT; i++) {
>> +               key = i;
>> +               value = bpf_map_lookup_elem(map, &key);
>> +               if (!value)
>> +                       return i;
>> +
>> +               old = bpf_kptr_xchg(&value->data, NULL);
>> +               if (!old)
>> +                       return i;
>> +               bpf_percpu_obj_drop(old);
>> +       }
>> +
>> +       return ALLOC_BATCH_CNT;
>> +}
> nit: These four functions have quite duplicated code. We can probably
> refactor them a bit.

Will do. The main difference is that these functions use different
helpers to allocate and free memory. I think we could pass a bool to the
common allocation and free inline functions.

>
>> +
>> +SEC("?fentry/" SYS_PREFIX "sys_getpgid")
>> +int bench_batch_alloc_free(void *ctx)
>> +{
>> +       u64 start, delta;
>> +       unsigned int cnt;
>> +       void *map;
> s/void */struct bpf_map */?

Will fix it in v2.
>
>> +       int key;
>> +
>> +       key = bpf_get_smp_processor_id();
>> +       map = bpf_map_lookup_elem((void *)&outer_array, &key);
>> +       if (!map)
>> +               return 0;
>> +
>> +       start = bpf_ktime_get_boot_ns();
>> +       cnt = batch_alloc(map);
>> +       delta = bpf_ktime_get_boot_ns() - start;
>> +       __sync_fetch_and_add(&alloc_cnt, cnt);
>> +       __sync_fetch_and_add(&alloc_ns, delta);
>> +
>> +       start = bpf_ktime_get_boot_ns();
>> +       cnt = batch_free(map);
>> +       delta = bpf_ktime_get_boot_ns() - start;
>> +       __sync_fetch_and_add(&free_cnt, cnt);
>> +       __sync_fetch_and_add(&free_ns, delta);
>> +
>> +       return 0;
>> +}
>> +
>> +SEC("?fentry/" SYS_PREFIX "sys_getpgid")
>> +int bench_batch_percpu_alloc_free(void *ctx)
>> +{
>> +       u64 start, delta;
>> +       unsigned int cnt;
>> +       void *map;
> ditto

Will update in v2.
>
>> +       int key;
>> +
>> +       key = bpf_get_smp_processor_id();
>> +       map = bpf_map_lookup_elem((void *)&percpu_outer_array, &key);
>> +       if (!map)
>> +               return 0;
>> +
>> +       start = bpf_ktime_get_boot_ns();
>> +       cnt = batch_percpu_alloc(map);
>> +       delta = bpf_ktime_get_boot_ns() - start;
>> +       __sync_fetch_and_add(&alloc_cnt, cnt);
>> +       __sync_fetch_and_add(&alloc_ns, delta);
>> +
>> +       start = bpf_ktime_get_boot_ns();
>> +       cnt = batch_percpu_free(map);
>> +       delta = bpf_ktime_get_boot_ns() - start;
>> +       __sync_fetch_and_add(&free_cnt, cnt);
>> +       __sync_fetch_and_add(&free_ns, delta);
>> +
>> +       return 0;
>> +}
> nit: ditto duplicated code.

Will factor it out as a common function. Thanks for all these suggestions.