Re: [PATCH v4 bpf-next 5/6] selftests/bpf: test map percpu stats

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Fri, 7 Jul 2023 17:40:28 +0800



Hi,

On 7/7/2023 3:28 PM, Anton Protopopov wrote:
> On Fri, Jul 07, 2023 at 09:41:03AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 7/6/2023 8:57 PM, Anton Protopopov wrote:
>>> On Thu, Jul 06, 2023 at 08:21:17PM +0800, Hou Tao wrote:
>>>> Hi,
>>>>
>>>> On 7/6/2023 7:54 PM, Anton Protopopov wrote:
>>>>
>> SNIP
>>>>> +static void delete_and_lookup_batch(int map_fd, void *keys, __u32 count)
>>>>> +{
>>>>> +	static __u8 values[(8 << 10) * MAX_ENTRIES];
>>>>> +	void *in_batch = NULL, *out_batch;
>>>>> +	__u32 save_count = count;
>>>>> +	int ret;
>>>>> +
>>>>> +	ret = bpf_map_lookup_and_delete_batch(map_fd,
>>>>> +					      &in_batch, &out_batch,
>>>>> +					      keys, values, &count,
>>>>> +					      NULL);
>>>>> +
>>>>> +	/*
>>>>> +	 * Despite what uapi header says, lookup_and_delete_batch will return
>>>>> +	 * -ENOENT in case we successfully have deleted all elements, so check
>>>>> +	 * this separately
>>>>> +	 */
>>>>>> It seems it is a bug in __htab_map_lookup_and_delete_batch(). I could
>>>>>> post a patch to fix it if you don't plan to do that by yourself.
>>>>> This should be as simple as
>>>>>
>>>>> @@ -1876,7 +1876,8 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>>>>>         total += bucket_cnt;
>>>>>         batch++;
>>>>>         if (batch >= htab->n_buckets) {
>>>>> -               ret = -ENOENT;
>>>>> +               if (!total)
>>>>> +                       ret = -ENOENT;
>>>>>                 goto after_loop;
>>>>>         }
>>>>>         goto again;
>>>> No. I think changing it to "if (max_count > total) ret = -ENOENT;" will
>>>> be more appropriate, because it means the requested count couldn't been
>>>> fulfilled and it is also consistent with the comments in 
>>>> include/uapi/linux/bpf.h
>>> Say, I have a map of size N and I don't know how many entries there are.
>>> Then I will do
>>>
>>>     count=N
>>>     lookup_and_delete(&count)
>>>
>>> In this case we will walk through the whole map, reach the 'batch >=
>>> htab->n_buckets', and set the count to the number of elements we read.
>>>
>>> (If, in opposite, there's no space to read a whole bucket, then we check this
>>> above and return -ENOSPC.)
>>>
>>>>> However, this might be already utilized by some apps to check that they've read
>>>>> all entries. Two local examples are map_tests/map_in_map_batch_ops.c and
>>>>> map_tests/htab_map_batch_ops.c. Another example I know is from BCC tools:
>>>>> https://github.com/iovisor/bcc/blob/master/libbpf-tools/map_helpers.c#L58
>>>> I think these use cases will be fine. Because when the last element has
>>>> been successfully iterated and returned, the out_batch is also updated,
>>>> so if the batch op is called again, -ENOENT will be returned.
>>>>> Can we update comments in include/uapi/linux/bpf.h?
>>>> I think the comments are correct.
>>> Currently we return -ENOENT as an indicator that (a) 'in_batch' is out of
>>> bounds (b) we reached the end of map. So technically, this is an optimization,
>>> as if we read elements in a loop by passing 'in_batch', 'out_batch', even if we
>>> return 0 in case (b), the next syscall would return -ENOENT, because the new
>>> 'in_batch' would point to out of bounds.
>>>
>>> This also makes sense for a map which is empty: we reached the end of map,
>>> didn't find any elements, so we're returning -ENOENT (in contrast with saying
>>> "all is ok, we read 0 elements").
>>>
>>> So from my point of view -ENOENT makes sense. However, comments say "Returns
>>> zero on success" which doesn't look true to me as I think that reading the
>>> whole map in one syscall is a success :)
>> I get your point. The current implementation of BPF_MAP_LOOKUP_BATCH
>> does the following two things:
>> 1) returns 0 when the whole map has not been iterated but there is no
>> space for current bucket.
> The algorithm works per bucket. For a bucket number X it checks if there is
> enough space in the output buffer to store all bucket elements. If there is,
> ok, go to the next bucket. If not, then it checks if any elements were written
> already [from previous buckets]. If not, then it returns -ENOSPC, meaning,
> "you've asked to copy at most N elements, but I can only copy M > N, not less,
> please provide a bigger buffer."
Yes.
>
>> 2) doesn't return 0 when the whole map has been iterated successfully
>> (and the requested count is fulfilled)
>>
>> For 1) I prefer to update the comments in uapi. If instead we fix the
>> implementation, we may break the existed users which need to check
>> ENOSPC to continue the batch op.
>> For 2) I don't have a preference. Both updating the comments and
>> implementation are fine to me.
>>
>> WDYT ?
>>
> I think that (1) is perfectly fine, -ENOSPC is returned only when we can't copy
> elements, which is an error.

Maybe I misinterpreted the comments in bpf.h. As said in the comment:
"On success, *count* elements from the map are copied into the user
buffer",  I think the count here means the value of count which is used
as input instead of output.
>
> The (2) requires updating docs. The API is similar to get_next_key, and docs
> can be updated in the same way. By updating docs we're not changing any uapi,
> right?
I think it is fine.