On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@xxxxxxxxx> wrote: > > On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote: > > > > > > If prev_key is deleted before map_get_next_key(), we get the first key > > > > > > again. This is pretty weird. > > > > > > > > > > Yes, I know. But note that the current scenario happens even for the > > > > > old interface (imagine you are walking a map from userspace and you > > > > > tried get_next_key the prev_key was removed, you will start again from > > > > > the beginning without noticing it). > > > > > I tried to sent a patch in the past but I was missing some context: > > > > > before NULL was used to get the very first_key the interface relied in > > > > > a random (non existent) key to retrieve the first_key in the map, and > > > > > I was told what we still have to support that scenario. > > > > > > > > BPF_MAP_DUMP is slightly different, as you may return the first key > > > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we > > > > don't have to support legacy scenarios. > > > > > > > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try > > > > to look up previous keys. Would something down this direction work? > > > > > > I've been thinking about it and I think first we need a way to detect > > > that since key was not present we got the first_key instead: > > > > > > - One solution I had in mind was to explicitly asked for the first key > > > with map_get_next_key(map, NULL, first_key) and while walking the map > > > check that map_get_next_key(map, prev_key, key) doesn't return the > > > same key. This could be done using memcmp. > > > - Discussing with Stan, he mentioned that another option is to support > > > a flag in map_get_next_key to let it know that we want an error > > > instead of the first_key. > > > > > > After detecting the problem we also need to define what we want to do, > > > here some options: > > > > > > a) Return the error to the caller > > > b) Try with previous keys if any (which be limited to the keys that we > > > have traversed so far in this dump call) > > > c) continue with next entries in the map. array is easy just get the > > > next valid key (starting on i+1), but hmap might be difficult since > > > starting on the next bucket could potentially skip some keys that were > > > concurrently added to the same bucket where key used to be, and > > > starting on the same bucket could lead us to return repeated elements. > > > > > > Or maybe we could support those 3 cases via flags and let the caller > > > decide which one to use? > > > > this type of indecision is the reason why I wasn't excited about > > batch dumping in the first place and gave 'soft yes' when Stan > > mentioned it during lsf/mm/bpf uconf. > > We probably shouldn't do it. > > It feels this map_dump makes api more complex and doesn't really > > give much benefit to the user other than large map dump becomes faster. > > I think we gotta solve this problem differently. > > Some users are working around the dumping problems with the existing > api by creating a bpf_map_get_next_key_and_delete userspace function > (see https://www.bouncybouncy.net/blog/bpf_map_get_next_key-pitfalls/) > which in my opinion is actually a good idea. The only problem with > that is that calling bpf_map_get_next_key(fd, key, next_key) and then > bpf_map_delete_elem(fd, key) from userspace is racing with kernel code > and it might lose some information when deleting. > We could then do map_dump_and_delete using that idea but in the kernel > where we could better handle the racing condition. In that scenario > even if we retrieve the same key it will contain different info ( the > delta between old and new value). Would that work? you mean get_next+lookup+delete at once? Sounds useful. Yonghong has been thinking about batching api as well. I think if we cannot figure out how to make a batch of two commands get_next + lookup to work correctly then we need to identify/invent one command and make batching more generic. Like make one jumbo/compound/atomic command to be get_next+lookup+delete. Define the semantics of this single compound command. And then let batching to be a multiplier of such command. In a sense that multiplier 1 or N should be have the same way. No extra flags to alter the batching. The high level description of the batch would be: pls execute get_next,lookup,delete and repeat it N times. or pls execute get_next,lookup and repeat N times. where each command action is defined to be composable. Just a rough idea.