On Tue, Aug 16, 2022 at 8:35 AM Sagarika Sharma <sharmasagarika@xxxxxxxxxx> wrote: > > This patch introduces the use of a module parameter n_prefetch > which enables prefetching within the bpf_map_lookup_batch function > for a faster lookup. Benefits depend on the platform, relative > density of the map, and the setting of the module parameter as > described below. > > For multiprocessor machines, for a particular key in a bpf map, > each cpu has a value associated with that key. This patch’s > change is as follows: when copying each of these values to > userspace in bpf_map_lookup_batch, the value for a cpu > n_prefetch ahead is prefetched. > > MEASUREMENTS: > The benchmark test added in this patch series was used to > measure the effect of prefetching as well as determine the > optimal setting of n_prefetch given the different parameters: > the test was run on many different platforms (with varying > number of cpus), with a range of settings of n_prefetch, and with > saturated, dense, and sparse maps (num_entries/capacity_of_map). > The benchmark test measures the average time for a single entry > lookup (t = num_entries_looked_up/total_time) given the varied > factors as mentioned above. The overhead of the > bpf_map_lookup_batch syscall introduces some error. > > Here are the experimental results: > > amd machine with 256 cores (rome zen 2) > Density of map n_prefetch single entry lookup time (ns/op) > -------------------------------------------------------------------- > 40k / 40k 0 16176.471 > 1 13095.238 > 5 7432.432 > 12 5188.679 > 20 9482.759 > > 10k / 40k 0 13253.012 > 5 7482.993 > 12 5164.319 > 20 9649.123 > > 2.5k / 40k 0 7394.958 > 5 7201.309 > 13 4721.030 > 20 8118.081 > > For denser maps, the experiments suggest that as n_prefetch > increases, there is a significant time benefit (~66% decrease) > until a certain point after which the time benefit begins to > decrease. For sparser maps, there is a less pronounced speedup > from prefetching. Additionally, this experiment seems to suggest > the optimal n_prefetch range on this particular machine is 12-13, > but a setting of n_prefetch = 5 can still improve the single > entry lookup time. > > intel-skylake (with 112 cores) > Density of map n_prefetch single entry lookup time (ns/op) > ------------------------------------------------------------------ > 40k / 40k 0 5729.167 > 1 5092.593 > 5 3395.062 > 20 6875.000 > > 10k / 40k 0 2029.520 > 5 2989.130 > 20 5820.106 > > 2.5k / 40k 0 1598.256 > 5 2935.290 > 20 4867.257 > > For this particular machine, the experimental results suggest that > there is only a significant benefit in prefetching with denser maps. > Prefetching within bpf_map_lookup_batch can provide significant > benefit depending on the use case. Across the many different > platforms experiments were performed on, a setting of n_prefetch = 5, > although not the optimal setting, significantly decreased the single > entry lookup time for denser maps. > > Signed-off-by: Sagarika Sharma <sharmasagarika@xxxxxxxxxx> > --- > kernel/bpf/hashtab.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index 8392f7f8a8ac..eb70c4bbe246 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -15,6 +15,9 @@ > #include "bpf_lru_list.h" > #include "map_in_map.h" > > +static uint n_prefetch; > +module_param(n_prefetch, uint, 0644); module_param is no go. sorry. > + > #define HTAB_CREATE_FLAG_MASK \ > (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \ > BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED) > @@ -1743,9 +1746,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, > if (is_percpu) { > int off = 0, cpu; > void __percpu *pptr; > + int num_cpus = num_possible_cpus(); > > pptr = htab_elem_get_ptr(l, map->key_size); > for_each_possible_cpu(cpu) { > + if (n_prefetch > 0 && (cpu + n_prefetch) <= num_cpus) > + prefetch(per_cpu_ptr(pptr, cpu + n_prefetch)); > + prefetch is a decent technique, but doesn't look like it helps in all cases. Your numbers suggest that it may hurt too. Whatever you're doing with map lookups you need a different strategy than micro-optimizing this loop with prefetch. Have you considering using for_each bpf side helper or user space side map iterator to aggregate and copy values? Or iterate over all map elements on one cpu first and then on other cpus ? Essentially swapping hlist_nulls_for_each_entry_safe and for_each_possible_cpu ?