This patch introduces the use of a module parameter n_prefetch which enables prefetching within the bpf_map_lookup_batch function for a faster lookup. Benefits depend on the platform, relative density of the map, and the setting of the module parameter as described below. For multiprocessor machines, for a particular key in a bpf map, each cpu has a value associated with that key. This patch’s change is as follows: when copying each of these values to userspace in bpf_map_lookup_batch, the value for a cpu n_prefetch ahead is prefetched. MEASUREMENTS: The benchmark test added in this patch series was used to measure the effect of prefetching as well as determine the optimal setting of n_prefetch given the different parameters: the test was run on many different platforms (with varying number of cpus), with a range of settings of n_prefetch, and with saturated, dense, and sparse maps (num_entries/capacity_of_map). The benchmark test measures the average time for a single entry lookup (t = num_entries_looked_up/total_time) given the varied factors as mentioned above. The overhead of the bpf_map_lookup_batch syscall introduces some error. Here are the experimental results: amd machine with 256 cores (rome zen 2) Density of map n_prefetch single entry lookup time (ns/op) -------------------------------------------------------------------- 40k / 40k 0 16176.471 1 13095.238 5 7432.432 12 5188.679 20 9482.759 10k / 40k 0 13253.012 5 7482.993 12 5164.319 20 9649.123 2.5k / 40k 0 7394.958 5 7201.309 13 4721.030 20 8118.081 For denser maps, the experiments suggest that as n_prefetch increases, there is a significant time benefit (~66% decrease) until a certain point after which the time benefit begins to decrease. For sparser maps, there is a less pronounced speedup from prefetching. Additionally, this experiment seems to suggest the optimal n_prefetch range on this particular machine is 12-13, but a setting of n_prefetch = 5 can still improve the single entry lookup time. intel-skylake (with 112 cores) Density of map n_prefetch single entry lookup time (ns/op) ------------------------------------------------------------------ 40k / 40k 0 5729.167 1 5092.593 5 3395.062 20 6875.000 10k / 40k 0 2029.520 5 2989.130 20 5820.106 2.5k / 40k 0 1598.256 5 2935.290 20 4867.257 For this particular machine, the experimental results suggest that there is only a significant benefit in prefetching with denser maps. Prefetching within bpf_map_lookup_batch can provide significant benefit depending on the use case. Across the many different platforms experiments were performed on, a setting of n_prefetch = 5, although not the optimal setting, significantly decreased the single entry lookup time for denser maps. Signed-off-by: Sagarika Sharma <sharmasagarika@xxxxxxxxxx> --- kernel/bpf/hashtab.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 8392f7f8a8ac..eb70c4bbe246 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -15,6 +15,9 @@ #include "bpf_lru_list.h" #include "map_in_map.h" +static uint n_prefetch; +module_param(n_prefetch, uint, 0644); + #define HTAB_CREATE_FLAG_MASK \ (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \ BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED) @@ -1743,9 +1746,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, if (is_percpu) { int off = 0, cpu; void __percpu *pptr; + int num_cpus = num_possible_cpus(); pptr = htab_elem_get_ptr(l, map->key_size); for_each_possible_cpu(cpu) { + if (n_prefetch > 0 && (cpu + n_prefetch) <= num_cpus) + prefetch(per_cpu_ptr(pptr, cpu + n_prefetch)); + bpf_long_memcpy(dst_val + off, per_cpu_ptr(pptr, cpu), size); off += size; -- 2.37.1.595.g718a3a8f04-goog