Hi, On 2/6/2025 12:57 AM, Yan Zhai wrote: > The generic_map_lookup_batch currently returns EINTR if it fails with > ENOENT and retries several times on bpf_map_copy_value. The next batch > would start from the same location, presuming it's a transient issue. > This is incorrect if a map can actually have "holes", i.e. > "get_next_key" can return a key that does not point to a valid value. At > least the array of maps type may contain such holes legitly. Right now > these holes show up, generic batch lookup cannot proceed any more. It > will always fail with EINTR errors. > > Rather, do not retry in generic_map_lookup_batch. If it finds a non > existing element, skip to the next key. > > Fixes: cb4d03ab499d ("bpf: Add generic support for lookup batch op") > Closes: https://lore.kernel.org/bpf/Z6JXtA1M5jAZx8xD@debian.debian/ > Signed-off-by: Yan Zhai <yan@xxxxxxxxxxxxxx> > --- > kernel/bpf/syscall.c | 16 ++---- > .../bpf/map_tests/map_in_map_batch_ops.c | 54 ++++++++++++++----- > 2 files changed, 45 insertions(+), 25 deletions(-) > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index c420edbfb7c8..5691fc0d370d 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -1979,7 +1979,7 @@ int generic_map_lookup_batch(struct bpf_map *map, > void __user *values = u64_to_user_ptr(attr->batch.values); > void __user *keys = u64_to_user_ptr(attr->batch.keys); > void *buf, *buf_prevkey, *prev_key, *key, *value; > - int err, retry = MAP_LOOKUP_RETRIES; > + int err; > u32 value_size, cp, max_count; > > if (attr->batch.elem_flags & ~BPF_F_LOCK) > @@ -2026,14 +2026,8 @@ int generic_map_lookup_batch(struct bpf_map *map, > err = bpf_map_copy_value(map, key, value, > attr->batch.elem_flags); > > - if (err == -ENOENT) { > - if (retry) { > - retry--; > - continue; > - } > - err = -EINTR; > - break; > - } > + if (err == -ENOENT) > + goto next_key; > > if (err) > goto free_buf; > @@ -2048,12 +2042,12 @@ int generic_map_lookup_batch(struct bpf_map *map, > goto free_buf; > } > > + cp++; > +next_key: > if (!prev_key) > prev_key = buf_prevkey; > > swap(prev_key, key); > - retry = MAP_LOOKUP_RETRIES; > - cp++; > cond_resched(); > } > Let's move the new thread for further discussion. >We are probably not on the same page. Let me clarify: >By "retry logic" I mean this code snippet: > if (err == -ENOENT) { > if (retry) { > retry--; > continue; > } > err = -EINTR; > break; > } Yes, the retry logic doesn't change the previous key. Thanks for the clarifying. > And by "skipping to the next key", it's simply > > if (err == -ENOENT) > goto next_key; > > Note the "next_key" label was not in the current codebase. It is only > in my posted patch. I don't think this would break lpm_trie unless I > missed something. I was trying to say that the proposed patch may break the lookup_batch operation for lpm_trie, and let me explain step by step: For LPM trie map: (1) ->map_get_next_key(map, prev_key, key) returns a valid key (2) bpf_map_copy_value() return -ENOMENT It means the key must be deleted concurrently. (3) goto next_key It swaps the prev_key and key (4) ->map_get_next_key(map, prev_key, key) again prev_key points to a non-existing key, for LPM trie it will treat just like prev_key=NULL case, the returned key will be duplicated. > diff --git a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c > index 66191ae9863c..b38be71f06be 100644 It is better to split the update of map_tests into a separated patch and it will be more friendly for stable backport. > --- a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c > +++ b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c > @@ -120,11 +120,12 @@ static void validate_fetch_results(int outer_map_fd, > > static void fetch_and_validate(int outer_map_fd, > struct bpf_map_batch_opts *opts, > - __u32 batch_size, bool delete_entries) > + __u32 batch_size, bool delete_entries, > + bool has_holes) > { > + int err, max_entries = OUTER_MAP_ENTRIES - !!has_holes; > __u32 *fetched_keys, *fetched_values, total_fetched = 0; > __u32 batch_key = 0, fetch_count, step_size; > - int err, max_entries = OUTER_MAP_ENTRIES; > __u32 value_size = sizeof(__u32); > > /* Total entries needs to be fetched */ > @@ -135,9 +136,9 @@ static void fetch_and_validate(int outer_map_fd, > "error=%s\n", strerror(errno)); > > for (step_size = batch_size; > - step_size <= max_entries; > + step_size < max_entries + batch_size; /* allow read partial */ > step_size += batch_size) { > - fetch_count = step_size; > + fetch_count = batch_size; The change "fetch_count = batch_size" may fail the lookup batch operation of hash table, because the element in one bucket may be greater than batch_size and it will return -ENOSPC constantly. And it seems the original implementation of fetch_and_validate() is buggy, because for hash map, the returned fetched_count may be less than the passed count when there are too many elements in the same bucket. I don't know the reason why the bug doesn't show up. > err = delete_entries > ? bpf_map_lookup_and_delete_batch(outer_map_fd, > total_fetched ? &batch_key : NULL, > @@ -184,18 +185,19 @@ static void fetch_and_validate(int outer_map_fd, > } > > static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type, > - enum bpf_map_type inner_map_type) > + enum bpf_map_type inner_map_type, > + bool has_holes) > { > + __u32 max_entries = OUTER_MAP_ENTRIES - !!has_holes; > __u32 *outer_map_keys, *inner_map_fds; > - __u32 max_entries = OUTER_MAP_ENTRIES; > LIBBPF_OPTS(bpf_map_batch_opts, opts); > __u32 value_size = sizeof(__u32); > int batch_size[2] = {5, 10}; > __u32 map_index, op_index; > int outer_map_fd, ret; > > - outer_map_keys = calloc(max_entries, value_size); > - inner_map_fds = calloc(max_entries, value_size); > + outer_map_keys = calloc(OUTER_MAP_ENTRIES, value_size); > + inner_map_fds = calloc(OUTER_MAP_ENTRIES, value_size); > CHECK((!outer_map_keys || !inner_map_fds), > "Memory allocation failed for outer_map_keys or inner_map_fds", > "error=%s\n", strerror(errno)); > @@ -209,6 +211,24 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type, > ((outer_map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) > ? 9 : 1000) - map_index; > > + /* This condition is only meaningful for array of maps. > + * > + * max_entries == OUTER_MAP_ENTRIES - 1 if it is true. Say > + * max_entries is short for n, then outer_map_keys looks like: > + * > + * [n, n-1, ... 2, 1] > + * > + * We change it to > + * > + * [n, n-1, ... 2, 0] > + * > + * So it will leave key 1 as a hole. It will serve to test the > + * correctness when batch on an array: a "non-exist" key might be > + * actually allocated and returned from key iteration. > + */ > + if (has_holes) > + outer_map_keys[max_entries - 1]--; > + > /* batch operation - map_update */ > ret = bpf_map_update_batch(outer_map_fd, outer_map_keys, > inner_map_fds, &max_entries, &opts); > @@ -219,12 +239,14 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type, > /* batch operation - map_lookup */ > for (op_index = 0; op_index < 2; ++op_index) > fetch_and_validate(outer_map_fd, &opts, > - batch_size[op_index], false); > + batch_size[op_index], false, > + has_holes); > > /* batch operation - map_lookup_delete */ > if (outer_map_type == BPF_MAP_TYPE_HASH_OF_MAPS) > fetch_and_validate(outer_map_fd, &opts, > - max_entries, true /*delete*/); > + max_entries, true /*delete*/, > + has_holes); > > /* close all map fds */ > for (map_index = 0; map_index < max_entries; map_index++) OUTER_MAP_ENTRIES instead of max_entries ? > @@ -237,16 +259,20 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type, > > void test_map_in_map_batch_ops_array(void) > { > - _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY); > + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY, false); > printf("%s:PASS with inner ARRAY map\n", __func__); > - _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH); > + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH, false); > printf("%s:PASS with inner HASH map\n", __func__); > + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY, true); > + printf("%s:PASS with inner ARRAY map with holes\n", __func__); > + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH, true); > + printf("%s:PASS with inner HASH map with holes\n", __func__); > } > > void test_map_in_map_batch_ops_hash(void) > { > - _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY); > + _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY, false); > printf("%s:PASS with inner ARRAY map\n", __func__); > - _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH); > + _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH, false); > printf("%s:PASS with inner HASH map\n", __func__); > }