Hi, On 7/7/2023 3:28 PM, Anton Protopopov wrote: > On Fri, Jul 07, 2023 at 09:41:03AM +0800, Hou Tao wrote: >> Hi, >> >> On 7/6/2023 8:57 PM, Anton Protopopov wrote: >>> On Thu, Jul 06, 2023 at 08:21:17PM +0800, Hou Tao wrote: >>>> Hi, >>>> >>>> On 7/6/2023 7:54 PM, Anton Protopopov wrote: >>>> >> SNIP >>>>> +static void delete_and_lookup_batch(int map_fd, void *keys, __u32 count) >>>>> +{ >>>>> + static __u8 values[(8 << 10) * MAX_ENTRIES]; >>>>> + void *in_batch = NULL, *out_batch; >>>>> + __u32 save_count = count; >>>>> + int ret; >>>>> + >>>>> + ret = bpf_map_lookup_and_delete_batch(map_fd, >>>>> + &in_batch, &out_batch, >>>>> + keys, values, &count, >>>>> + NULL); >>>>> + >>>>> + /* >>>>> + * Despite what uapi header says, lookup_and_delete_batch will return >>>>> + * -ENOENT in case we successfully have deleted all elements, so check >>>>> + * this separately >>>>> + */ >>>>>> It seems it is a bug in __htab_map_lookup_and_delete_batch(). I could >>>>>> post a patch to fix it if you don't plan to do that by yourself. >>>>> This should be as simple as >>>>> >>>>> @@ -1876,7 +1876,8 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, >>>>> total += bucket_cnt; >>>>> batch++; >>>>> if (batch >= htab->n_buckets) { >>>>> - ret = -ENOENT; >>>>> + if (!total) >>>>> + ret = -ENOENT; >>>>> goto after_loop; >>>>> } >>>>> goto again; >>>> No. I think changing it to "if (max_count > total) ret = -ENOENT;" will >>>> be more appropriate, because it means the requested count couldn't been >>>> fulfilled and it is also consistent with the comments in >>>> include/uapi/linux/bpf.h >>> Say, I have a map of size N and I don't know how many entries there are. >>> Then I will do >>> >>> count=N >>> lookup_and_delete(&count) >>> >>> In this case we will walk through the whole map, reach the 'batch >= >>> htab->n_buckets', and set the count to the number of elements we read. >>> >>> (If, in opposite, there's no space to read a whole bucket, then we check this >>> above and return -ENOSPC.) >>> >>>>> However, this might be already utilized by some apps to check that they've read >>>>> all entries. Two local examples are map_tests/map_in_map_batch_ops.c and >>>>> map_tests/htab_map_batch_ops.c. Another example I know is from BCC tools: >>>>> https://github.com/iovisor/bcc/blob/master/libbpf-tools/map_helpers.c#L58 >>>> I think these use cases will be fine. Because when the last element has >>>> been successfully iterated and returned, the out_batch is also updated, >>>> so if the batch op is called again, -ENOENT will be returned. >>>>> Can we update comments in include/uapi/linux/bpf.h? >>>> I think the comments are correct. >>> Currently we return -ENOENT as an indicator that (a) 'in_batch' is out of >>> bounds (b) we reached the end of map. So technically, this is an optimization, >>> as if we read elements in a loop by passing 'in_batch', 'out_batch', even if we >>> return 0 in case (b), the next syscall would return -ENOENT, because the new >>> 'in_batch' would point to out of bounds. >>> >>> This also makes sense for a map which is empty: we reached the end of map, >>> didn't find any elements, so we're returning -ENOENT (in contrast with saying >>> "all is ok, we read 0 elements"). >>> >>> So from my point of view -ENOENT makes sense. However, comments say "Returns >>> zero on success" which doesn't look true to me as I think that reading the >>> whole map in one syscall is a success :) >> I get your point. The current implementation of BPF_MAP_LOOKUP_BATCH >> does the following two things: >> 1) returns 0 when the whole map has not been iterated but there is no >> space for current bucket. > The algorithm works per bucket. For a bucket number X it checks if there is > enough space in the output buffer to store all bucket elements. If there is, > ok, go to the next bucket. If not, then it checks if any elements were written > already [from previous buckets]. If not, then it returns -ENOSPC, meaning, > "you've asked to copy at most N elements, but I can only copy M > N, not less, > please provide a bigger buffer." Yes. > >> 2) doesn't return 0 when the whole map has been iterated successfully >> (and the requested count is fulfilled) >> >> For 1) I prefer to update the comments in uapi. If instead we fix the >> implementation, we may break the existed users which need to check >> ENOSPC to continue the batch op. >> For 2) I don't have a preference. Both updating the comments and >> implementation are fine to me. >> >> WDYT ? >> > I think that (1) is perfectly fine, -ENOSPC is returned only when we can't copy > elements, which is an error. Maybe I misinterpreted the comments in bpf.h. As said in the comment: "On success, *count* elements from the map are copied into the user buffer", I think the count here means the value of count which is used as input instead of output. > > The (2) requires updating docs. The API is similar to get_next_key, and docs > can be updated in the same way. By updating docs we're not changing any uapi, > right? I think it is fine.