RE: [PATCH bpf-next v8] selftests/bpf: Add benchmark for bpf memory allocator

John Fastabend <john.fastabend@xxxxxxxxx> · Mon, 03 Jul 2023 14:52:09 -0700

Hou Tao wrote:
> From: Hou Tao <houtao1@xxxxxxxxxx>
> 
> The benchmark could be used to compare the performance of hash map
> operations and the memory usage between different flavors of bpf memory
> allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
> could be used to check the performance improvement or the memory saving
> provided by optimization.
> 
> The benchmark creates a non-preallocated hash map which uses bpf memory
> allocator and shows the operation performance and the memory usage of
> the hash map under different use cases:
> (1) overwrite
> Each CPU overwrites nonoverlapping part of hash map. When each CPU
> completes overwriting of 64 elements in hash map, it increases the
> op_count.
> (2) batch_add_batch_del
> Each CPU adds then deletes nonoverlapping part of hash map in batch.
> When each CPU adds and deletes 64 elements in hash map, it increases
> the op_count twice.
> (3) add_del_on_diff_cpu
> Each two-CPUs pair adds and deletes nonoverlapping part of map
> cooperatively. When each CPU adds or deletes 64 elements in hash map,
> it will increase the op_count.
> 
> The following is the benchmark results when comparing between different
> flavors of bpf memory allocator. These tests are conducted on a KVM guest
> with 8 CPUs and 16 GB memory. The command line below is used to do all
> the following benchmarks:
> 
>   ./bench htab-mem --use-case $name ${OPTS} -w3 -d10 -a -p8
> 
> These results show that preallocated hash map has both better performance
> and smaller memory footprint.
> 
> (1) non-preallocated + no bpf memory allocator (v6.0.19)
> use kmalloc() + call_rcu
> 
> overwrite            per-prod-op: 11.24 ± 0.07k/s, avg mem: 82.64 ± 26.32MiB, peak mem: 119.18MiB
> batch_add_batch_del  per-prod-op: 18.45 ± 0.10k/s, avg mem: 50.47 ± 14.51MiB, peak mem: 94.96MiB
> add_del_on_diff_cpu  per-prod-op: 14.50 ± 0.03k/s, avg mem: 4.64 ± 0.73MiB, peak mem: 7.20MiB
> 
> (2) preallocated
> OPTS=--preallocated
> 
> overwrite            per-prod-op: 191.92 ± 0.07k/s, avg mem: 1.23 ± 0.00MiB, peak mem: 1.49MiB
> batch_add_batch_del  per-prod-op: 218.10 ± 0.25k/s, avg mem: 1.23 ± 0.00MiB, peak mem: 1.49MiB
> add_del_on_diff_cpu  per-prod-op: 39.59 ± 0.41k/s, avg mem: 1.48 ± 0.11MiB, peak mem: 1.74MiB
> 
> (3) normal bpf memory allocator
> 
> overwrite            per-prod-op: 134.81 ± 0.22k/s, avg mem: 1.67 ± 0.12MiB, peak mem: 2.74MiB
> batch_add_batch_del  per-prod-op: 90.44 ± 0.34k/s, avg mem: 2.27 ± 0.00MiB, peak mem: 2.74MiB
> add_del_on_diff_cpu  per-prod-op: 28.20 ± 0.15k/s, avg mem: 1.73 ± 0.17MiB, peak mem: 2.06MiB

Acked-by: John Fastabend <john.fastabend@xxxxxxxxx>

> +
> +static error_t htab_mem_parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +	switch (key) {
> +	case ARG_VALUE_SIZE:
> +		args.value_size = strtoul(arg, NULL, 10);
> +		if (args.value_size > 4096) {
> +			fprintf(stderr, "too big value size %u\n", args.value_size);
> +			argp_usage(state);
> +		}
> +		break;
> +	case ARG_USE_CASE:
> +		args.use_case = strdup(arg);

might be worth checking for null and returning an error? Only matters if we
run from CI or something and then this looks like a flake.

> +		break;
> +	case ARG_PREALLOCATED:
> +		args.preallocated = true;
> +		break;
> +	default:
> +		return ARGP_ERR_UNKNOWN;
> +	}
> +
> +	return 0;
> +}