On 2/7/2025 1:45 PM, Yan Zhai wrote: > The generic_map_lookup_batch currently returns EINTR if it fails with > ENOENT and retries several times on bpf_map_copy_value. The next batch > would start from the same location, presuming it's a transient issue. > This is incorrect if a map can actually have "holes", i.e. > "get_next_key" can return a key that does not point to a valid value. At > least the array of maps type may contain such holes legitly. Right now > these holes show up, generic batch lookup cannot proceed any more. It > will always fail with EINTR errors. > > Rather, do not retry in generic_map_lookup_batch. If it finds a non > existing element, skip to the next key. This simple solution comes with > a price that transient errors may not be recovered, and the iteration > might cycle back to the first key under parallel deletion. For example, > Hou Tao <houtao@xxxxxxxxxxxxxxx> pointed out a following scenario: > > For LPM trie map: > (1) ->map_get_next_key(map, prev_key, key) returns a valid key > > (2) bpf_map_copy_value() return -ENOMENT > It means the key must be deleted concurrently. > > (3) goto next_key > It swaps the prev_key and key > > (4) ->map_get_next_key(map, prev_key, key) again > prev_key points to a non-existing key, for LPM trie it will treat just > like prev_key=NULL case, the returned key will be duplicated. > > With the retry logic, the iteration can continue to the key next to the > deleted one. But if we directly skip to the next key, the iteration loop > would restart from the first key for the lpm_trie type. > > However, not all races may be recovered. For example, if current key is > deleted after instead of before bpf_map_copy_value, or if the prev_key > also gets deleted, then the loop will still restart from the first key > for lpm_tire anyway. For generic lookup it might be better to stay > simple, i.e. just skip to the next key. To guarantee that the output > keys are not duplicated, it is better to implement map type specific > batch operations, which can properly lock the trie and synchronize with > concurrent mutators. Make sense. > > Fixes: cb4d03ab499d ("bpf: Add generic support for lookup batch op") > Closes: https://lore.kernel.org/bpf/Z6JXtA1M5jAZx8xD@debian.debian/ > Signed-off-by: Yan Zhai <yan@xxxxxxxxxxxxxx> Acked-by: Hou Tao <houtao1@xxxxxxxxxx> > --- > v1->v2: incorporate more useful information inside commit message. > v1: https://lore.kernel.org/bpf/Z6OYbS4WqQnmzi2z@debian.debian/ > --- > kernel/bpf/syscall.c | 16 +++++----------- > 1 file changed, 5 insertions(+), 11 deletions(-) > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index c420edbfb7c8..5d0a4db6fb85 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -1979,8 +1979,8 @@ int generic_map_lookup_batch(struct bpf_map *map, > void __user *values = u64_to_user_ptr(attr->batch.values); > void __user *keys = u64_to_user_ptr(attr->batch.keys); > void *buf, *buf_prevkey, *prev_key, *key, *value; > - int err, retry = MAP_LOOKUP_RETRIES; > u32 value_size, cp, max_count; > + int err; > > if (attr->batch.elem_flags & ~BPF_F_LOCK) > return -EINVAL; > @@ -2026,14 +2026,8 @@ int generic_map_lookup_batch(struct bpf_map *map, > err = bpf_map_copy_value(map, key, value, > attr->batch.elem_flags); > > - if (err == -ENOENT) { > - if (retry) { > - retry--; > - continue; > - } > - err = -EINTR; > - break; > - } > + if (err == -ENOENT) > + goto next_key; > > if (err) > goto free_buf; > @@ -2048,12 +2042,12 @@ int generic_map_lookup_batch(struct bpf_map *map, > goto free_buf; > } > > + cp++; > +next_key: > if (!prev_key) > prev_key = buf_prevkey; > > swap(prev_key, key); > - retry = MAP_LOOKUP_RETRIES; > - cp++; > cond_resched(); > } >