[PATCH bpf-next 2/2] use prefetch function in bpf_map_lookup_batch()

Sagarika Sharma <sharmasagarika@xxxxxxxxxx> · Tue, 16 Aug 2022 15:34:31 +0000

This patch introduces the use of a module parameter n_prefetch
which enables prefetching within the bpf_map_lookup_batch function
for a faster lookup. Benefits depend on the platform, relative
density of the map, and the setting of the module parameter as
described below.

For multiprocessor machines, for a particular key in a bpf map,
each cpu has a value associated with that key. This patch’s
change is as follows: when copying each of these values to
userspace in bpf_map_lookup_batch, the value for a cpu
n_prefetch ahead is prefetched.

MEASUREMENTS:
The benchmark test added in this patch series was used to
measure the effect of prefetching as well as determine the
optimal setting of n_prefetch given the different parameters:
the test was run on many different platforms (with varying
number of cpus), with a range of settings of n_prefetch, and with
saturated, dense, and sparse maps (num_entries/capacity_of_map).
The benchmark test measures the average time for a single entry
lookup (t = num_entries_looked_up/total_time) given the varied
factors as mentioned above. The overhead of the
bpf_map_lookup_batch syscall introduces some error.

Here are the experimental results:

amd machine with 256 cores (rome zen 2)
Density of map	n_prefetch	single entry lookup time (ns/op)
--------------------------------------------------------------------
40k / 40k	0		16176.471
		1		13095.238
		5		7432.432
		12		5188.679
		20		9482.759

10k / 40k	0		13253.012
		5		7482.993
		12		5164.319
		20		9649.123

2.5k / 40k	0		7394.958
		5		7201.309
		13		4721.030
		20		8118.081

For denser maps, the experiments suggest that as n_prefetch
increases, there is a significant time benefit (~66% decrease)
until a certain point after which the time benefit begins to
decrease. For sparser maps, there is a less pronounced speedup
from prefetching. Additionally, this experiment seems to suggest
the optimal n_prefetch range on this particular machine is 12-13,
but a setting of n_prefetch = 5 can still improve the single
entry lookup time.

intel-skylake (with 112 cores)
Density of map	n_prefetch	single entry lookup time (ns/op)
------------------------------------------------------------------
40k / 40k	0		5729.167
		1		5092.593
		5		3395.062
		20		6875.000

10k / 40k	0		2029.520
		5		2989.130
		20		5820.106

2.5k / 40k	0		1598.256
		5		2935.290
		20		4867.257

For this particular machine, the experimental results suggest that
there is only a significant benefit in prefetching with denser maps.
Prefetching within bpf_map_lookup_batch can provide significant
benefit depending on the use case. Across the many different
platforms experiments were performed on, a setting of n_prefetch = 5,
although not the optimal setting, significantly decreased the single
entry lookup time for denser maps.

Signed-off-by: Sagarika Sharma <sharmasagarika@xxxxxxxxxx>
---
 kernel/bpf/hashtab.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8392f7f8a8ac..eb70c4bbe246 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -15,6 +15,9 @@
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
 
+static uint n_prefetch;
+module_param(n_prefetch, uint, 0644);
+
 #define HTAB_CREATE_FLAG_MASK						\
 	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |	\
 	 BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
@@ -1743,9 +1746,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 		if (is_percpu) {
 			int off = 0, cpu;
 			void __percpu *pptr;
+			int num_cpus = num_possible_cpus();
 
 			pptr = htab_elem_get_ptr(l, map->key_size);
 			for_each_possible_cpu(cpu) {
+				if (n_prefetch > 0 && (cpu + n_prefetch) <= num_cpus)
+					prefetch(per_cpu_ptr(pptr, cpu + n_prefetch));
+
 				bpf_long_memcpy(dst_val + off,
 						per_cpu_ptr(pptr, cpu), size);
 				off += size;
-- 
2.37.1.595.g718a3a8f04-goog