On Fri, Jul 07, 2023 at 09:41:03AM +0800, Hou Tao wrote: > Hi, > > On 7/6/2023 8:57 PM, Anton Protopopov wrote: > > On Thu, Jul 06, 2023 at 08:21:17PM +0800, Hou Tao wrote: > >> Hi, > >> > >> On 7/6/2023 7:54 PM, Anton Protopopov wrote: > >> > SNIP > >>> +static void delete_and_lookup_batch(int map_fd, void *keys, __u32 count) > >>> +{ > >>> + static __u8 values[(8 << 10) * MAX_ENTRIES]; > >>> + void *in_batch = NULL, *out_batch; > >>> + __u32 save_count = count; > >>> + int ret; > >>> + > >>> + ret = bpf_map_lookup_and_delete_batch(map_fd, > >>> + &in_batch, &out_batch, > >>> + keys, values, &count, > >>> + NULL); > >>> + > >>> + /* > >>> + * Despite what uapi header says, lookup_and_delete_batch will return > >>> + * -ENOENT in case we successfully have deleted all elements, so check > >>> + * this separately > >>> + */ > >>>> It seems it is a bug in __htab_map_lookup_and_delete_batch(). I could > >>>> post a patch to fix it if you don't plan to do that by yourself. > >>> This should be as simple as > >>> > >>> @@ -1876,7 +1876,8 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, > >>> total += bucket_cnt; > >>> batch++; > >>> if (batch >= htab->n_buckets) { > >>> - ret = -ENOENT; > >>> + if (!total) > >>> + ret = -ENOENT; > >>> goto after_loop; > >>> } > >>> goto again; > >> No. I think changing it to "if (max_count > total) ret = -ENOENT;" will > >> be more appropriate, because it means the requested count couldn't been > >> fulfilled and it is also consistent with the comments in > >> include/uapi/linux/bpf.h > > Say, I have a map of size N and I don't know how many entries there are. > > Then I will do > > > > count=N > > lookup_and_delete(&count) > > > > In this case we will walk through the whole map, reach the 'batch >= > > htab->n_buckets', and set the count to the number of elements we read. > > > > (If, in opposite, there's no space to read a whole bucket, then we check this > > above and return -ENOSPC.) > > > >>> However, this might be already utilized by some apps to check that they've read > >>> all entries. Two local examples are map_tests/map_in_map_batch_ops.c and > >>> map_tests/htab_map_batch_ops.c. Another example I know is from BCC tools: > >>> https://github.com/iovisor/bcc/blob/master/libbpf-tools/map_helpers.c#L58 > >> I think these use cases will be fine. Because when the last element has > >> been successfully iterated and returned, the out_batch is also updated, > >> so if the batch op is called again, -ENOENT will be returned. > >>> Can we update comments in include/uapi/linux/bpf.h? > >> I think the comments are correct. > > Currently we return -ENOENT as an indicator that (a) 'in_batch' is out of > > bounds (b) we reached the end of map. So technically, this is an optimization, > > as if we read elements in a loop by passing 'in_batch', 'out_batch', even if we > > return 0 in case (b), the next syscall would return -ENOENT, because the new > > 'in_batch' would point to out of bounds. > > > > This also makes sense for a map which is empty: we reached the end of map, > > didn't find any elements, so we're returning -ENOENT (in contrast with saying > > "all is ok, we read 0 elements"). > > > > So from my point of view -ENOENT makes sense. However, comments say "Returns > > zero on success" which doesn't look true to me as I think that reading the > > whole map in one syscall is a success :) > > I get your point. The current implementation of BPF_MAP_LOOKUP_BATCH > does the following two things: > 1) returns 0 when the whole map has not been iterated but there is no > space for current bucket. The algorithm works per bucket. For a bucket number X it checks if there is enough space in the output buffer to store all bucket elements. If there is, ok, go to the next bucket. If not, then it checks if any elements were written already [from previous buckets]. If not, then it returns -ENOSPC, meaning, "you've asked to copy at most N elements, but I can only copy M > N, not less, please provide a bigger buffer." > 2) doesn't return 0 when the whole map has been iterated successfully > (and the requested count is fulfilled) > > For 1) I prefer to update the comments in uapi. If instead we fix the > implementation, we may break the existed users which need to check > ENOSPC to continue the batch op. > For 2) I don't have a preference. Both updating the comments and > implementation are fine to me. > > WDYT ? > I think that (1) is perfectly fine, -ENOSPC is returned only when we can't copy elements, which is an error. The (2) requires updating docs. The API is similar to get_next_key, and docs can be updated in the same way. By updating docs we're not changing any uapi, right?