Re: [PATCH bpf] bpf: skip non existing key in generic_map_lookup_batch

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Thu, 6 Feb 2025 14:39:52 +0800

Hi,

On 2/6/2025 12:57 AM, Yan Zhai wrote:
> The generic_map_lookup_batch currently returns EINTR if it fails with
> ENOENT and retries several times on bpf_map_copy_value. The next batch
> would start from the same location, presuming it's a transient issue.
> This is incorrect if a map can actually have "holes", i.e.
> "get_next_key" can return a key that does not point to a valid value. At
> least the array of maps type may contain such holes legitly. Right now
> these holes show up, generic batch lookup cannot proceed any more. It
> will always fail with EINTR errors.
>
> Rather, do not retry in generic_map_lookup_batch. If it finds a non
> existing element, skip to the next key.
>
> Fixes: cb4d03ab499d ("bpf: Add generic support for lookup batch op")
> Closes: https://lore.kernel.org/bpf/Z6JXtA1M5jAZx8xD@debian.debian/
> Signed-off-by: Yan Zhai <yan@xxxxxxxxxxxxxx>
> ---
>  kernel/bpf/syscall.c                          | 16 ++----
>  .../bpf/map_tests/map_in_map_batch_ops.c      | 54 ++++++++++++++-----
>  2 files changed, 45 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index c420edbfb7c8..5691fc0d370d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1979,7 +1979,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
>  	void __user *values = u64_to_user_ptr(attr->batch.values);
>  	void __user *keys = u64_to_user_ptr(attr->batch.keys);
>  	void *buf, *buf_prevkey, *prev_key, *key, *value;
> -	int err, retry = MAP_LOOKUP_RETRIES;
> +	int err;
>  	u32 value_size, cp, max_count;
>  
>  	if (attr->batch.elem_flags & ~BPF_F_LOCK)
> @@ -2026,14 +2026,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
>  		err = bpf_map_copy_value(map, key, value,
>  					 attr->batch.elem_flags);
>  
> -		if (err == -ENOENT) {
> -			if (retry) {
> -				retry--;
> -				continue;
> -			}
> -			err = -EINTR;
> -			break;
> -		}
> +		if (err == -ENOENT)
> +			goto next_key;
>  
>  		if (err)
>  			goto free_buf;
> @@ -2048,12 +2042,12 @@ int generic_map_lookup_batch(struct bpf_map *map,
>  			goto free_buf;
>  		}
>  
> +		cp++;
> +next_key:
>  		if (!prev_key)
>  			prev_key = buf_prevkey;
>  
>  		swap(prev_key, key);
> -		retry = MAP_LOOKUP_RETRIES;
> -		cp++;
>  		cond_resched();
>  	}
>  

Let's move the new thread for further discussion.

>We are probably not on the same page. Let me clarify:

>By "retry logic" I mean this code snippet:
>               if (err == -ENOENT) {
>                       if (retry) {
>                               retry--;
>                               continue;
>                       }
>                       err = -EINTR;
>                       break;
>               }

Yes, the retry logic doesn't change the previous key. Thanks for the
clarifying.
> And by "skipping to the next key", it's simply
>
>   if (err == -ENOENT)
>        goto next_key;
>
> Note the "next_key" label was not in the current codebase. It is only
> in my posted patch. I don't think this would break lpm_trie unless I
> missed something.

I was trying to say that the proposed patch may break the lookup_batch
operation for lpm_trie, and let me explain step by step:

For LPM trie map:
(1) ->map_get_next_key(map, prev_key, key) returns a valid key

(2) bpf_map_copy_value() return -ENOMENT
It means the key must be deleted concurrently.

(3) goto next_key
It swaps the prev_key and key

(4) ->map_get_next_key(map, prev_key, key) again
prev_key points to a non-existing key, for LPM trie it will treat just
like prev_key=NULL case, the returned key will be duplicated.

> diff --git a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
> index 66191ae9863c..b38be71f06be 100644

It is better to split the update of map_tests into a separated patch and
it will be more friendly for stable backport.
> --- a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
> +++ b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
> @@ -120,11 +120,12 @@ static void validate_fetch_results(int outer_map_fd,
>  
>  static void fetch_and_validate(int outer_map_fd,
>  			       struct bpf_map_batch_opts *opts,
> -			       __u32 batch_size, bool delete_entries)
> +			       __u32 batch_size, bool delete_entries,
> +			       bool has_holes)
>  {
> +	int err, max_entries = OUTER_MAP_ENTRIES - !!has_holes;
>  	__u32 *fetched_keys, *fetched_values, total_fetched = 0;
>  	__u32 batch_key = 0, fetch_count, step_size;
> -	int err, max_entries = OUTER_MAP_ENTRIES;
>  	__u32 value_size = sizeof(__u32);
>  
>  	/* Total entries needs to be fetched */
> @@ -135,9 +136,9 @@ static void fetch_and_validate(int outer_map_fd,
>  	      "error=%s\n", strerror(errno));
>  
>  	for (step_size = batch_size;
> -	     step_size <= max_entries;
> +	     step_size < max_entries + batch_size; /* allow read partial */
>  	     step_size += batch_size) {
> -		fetch_count = step_size;
> +		fetch_count = batch_size;

The change "fetch_count = batch_size" may fail the lookup batch
operation of hash table, because the element in one bucket may be
greater than batch_size and it will return -ENOSPC constantly. And it
seems the original implementation of fetch_and_validate() is buggy,
because for hash map, the returned fetched_count may be less than the
passed count when there are too many elements in the same bucket. I
don't know the reason why the bug doesn't show up.
>  		err = delete_entries
>  		      ? bpf_map_lookup_and_delete_batch(outer_map_fd,
>  				      total_fetched ? &batch_key : NULL,
> @@ -184,18 +185,19 @@ static void fetch_and_validate(int outer_map_fd,
>  }
>  
>  static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
> -				  enum bpf_map_type inner_map_type)
> +				  enum bpf_map_type inner_map_type,
> +				  bool has_holes)
>  {
> +	__u32 max_entries = OUTER_MAP_ENTRIES - !!has_holes;
>  	__u32 *outer_map_keys, *inner_map_fds;
> -	__u32 max_entries = OUTER_MAP_ENTRIES;
>  	LIBBPF_OPTS(bpf_map_batch_opts, opts);
>  	__u32 value_size = sizeof(__u32);
>  	int batch_size[2] = {5, 10};
>  	__u32 map_index, op_index;
>  	int outer_map_fd, ret;
>  
> -	outer_map_keys = calloc(max_entries, value_size);
> -	inner_map_fds = calloc(max_entries, value_size);
> +	outer_map_keys = calloc(OUTER_MAP_ENTRIES, value_size);
> +	inner_map_fds = calloc(OUTER_MAP_ENTRIES, value_size);
>  	CHECK((!outer_map_keys || !inner_map_fds),
>  	      "Memory allocation failed for outer_map_keys or inner_map_fds",
>  	      "error=%s\n", strerror(errno));
> @@ -209,6 +211,24 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
>  			((outer_map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
>  			 ? 9 : 1000) - map_index;
>  
> +	/* This condition is only meaningful for array of maps.
> +	 *
> +	 * max_entries == OUTER_MAP_ENTRIES - 1 if it is true. Say
> +	 * max_entries is short for n, then outer_map_keys looks like:
> +	 *
> +	 *   [n, n-1, ... 2, 1]
> +	 *
> +	 * We change it to
> +	 *
> +	 *   [n, n-1, ... 2, 0]
> +	 *
> +	 * So it will leave key 1 as a hole. It will serve to test the
> +	 * correctness when batch on an array: a "non-exist" key might be
> +	 * actually allocated and returned from key iteration.
> +	 */
> +	if (has_holes)
> +		outer_map_keys[max_entries - 1]--;
> +
>  	/* batch operation - map_update */
>  	ret = bpf_map_update_batch(outer_map_fd, outer_map_keys,
>  				   inner_map_fds, &max_entries, &opts);
> @@ -219,12 +239,14 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
>  	/* batch operation - map_lookup */
>  	for (op_index = 0; op_index < 2; ++op_index)
>  		fetch_and_validate(outer_map_fd, &opts,
> -				   batch_size[op_index], false);
> +				   batch_size[op_index], false,
> +				   has_holes);
>  
>  	/* batch operation - map_lookup_delete */
>  	if (outer_map_type == BPF_MAP_TYPE_HASH_OF_MAPS)
>  		fetch_and_validate(outer_map_fd, &opts,
> -				   max_entries, true /*delete*/);
> +				   max_entries, true /*delete*/,
> +				   has_holes);
>  
>  	/* close all map fds */
>  	for (map_index = 0; map_index < max_entries; map_index++)

OUTER_MAP_ENTRIES instead of max_entries ?
> @@ -237,16 +259,20 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
>  
>  void test_map_in_map_batch_ops_array(void)
>  {
> -	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY);
> +	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY, false);
>  	printf("%s:PASS with inner ARRAY map\n", __func__);
> -	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH);
> +	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH, false);
>  	printf("%s:PASS with inner HASH map\n", __func__);
> +	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY, true);
> +	printf("%s:PASS with inner ARRAY map with holes\n", __func__);
> +	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH, true);
> +	printf("%s:PASS with inner HASH map with holes\n", __func__);
>  }
>  
>  void test_map_in_map_batch_ops_hash(void)
>  {
> -	_map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY);
> +	_map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY, false);
>  	printf("%s:PASS with inner ARRAY map\n", __func__);
> -	_map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH);
> +	_map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH, false);
>  	printf("%s:PASS with inner HASH map\n", __func__);
>  }