On Thu, Jul 25, 2019 at 11:10 PM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 7/25/19 6:47 PM, Alexei Starovoitov wrote: > > On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@xxxxxxxxx> wrote: > >> > >> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov > >> <alexei.starovoitov@xxxxxxxxx> wrote: > >>> > >>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote: > >>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key > >>>>>>> again. This is pretty weird. > >>>>>> > >>>>>> Yes, I know. But note that the current scenario happens even for the > >>>>>> old interface (imagine you are walking a map from userspace and you > >>>>>> tried get_next_key the prev_key was removed, you will start again from > >>>>>> the beginning without noticing it). > >>>>>> I tried to sent a patch in the past but I was missing some context: > >>>>>> before NULL was used to get the very first_key the interface relied in > >>>>>> a random (non existent) key to retrieve the first_key in the map, and > >>>>>> I was told what we still have to support that scenario. > >>>>> > >>>>> BPF_MAP_DUMP is slightly different, as you may return the first key > >>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we > >>>>> don't have to support legacy scenarios. > >>>>> > >>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try > >>>>> to look up previous keys. Would something down this direction work? > >>>> > >>>> I've been thinking about it and I think first we need a way to detect > >>>> that since key was not present we got the first_key instead: > >>>> > >>>> - One solution I had in mind was to explicitly asked for the first key > >>>> with map_get_next_key(map, NULL, first_key) and while walking the map > >>>> check that map_get_next_key(map, prev_key, key) doesn't return the > >>>> same key. This could be done using memcmp. > >>>> - Discussing with Stan, he mentioned that another option is to support > >>>> a flag in map_get_next_key to let it know that we want an error > >>>> instead of the first_key. > >>>> > >>>> After detecting the problem we also need to define what we want to do, > >>>> here some options: > >>>> > >>>> a) Return the error to the caller > >>>> b) Try with previous keys if any (which be limited to the keys that we > >>>> have traversed so far in this dump call) > >>>> c) continue with next entries in the map. array is easy just get the > >>>> next valid key (starting on i+1), but hmap might be difficult since > >>>> starting on the next bucket could potentially skip some keys that were > >>>> concurrently added to the same bucket where key used to be, and > >>>> starting on the same bucket could lead us to return repeated elements. > >>>> > >>>> Or maybe we could support those 3 cases via flags and let the caller > >>>> decide which one to use? > >>> > >>> this type of indecision is the reason why I wasn't excited about > >>> batch dumping in the first place and gave 'soft yes' when Stan > >>> mentioned it during lsf/mm/bpf uconf. > >>> We probably shouldn't do it. > >>> It feels this map_dump makes api more complex and doesn't really > >>> give much benefit to the user other than large map dump becomes faster. > >>> I think we gotta solve this problem differently. > >> > >> Some users are working around the dumping problems with the existing > >> api by creating a bpf_map_get_next_key_and_delete userspace function > >> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= ) > >> which in my opinion is actually a good idea. The only problem with > >> that is that calling bpf_map_get_next_key(fd, key, next_key) and then > >> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code > >> and it might lose some information when deleting. > >> We could then do map_dump_and_delete using that idea but in the kernel > >> where we could better handle the racing condition. In that scenario > >> even if we retrieve the same key it will contain different info ( the > >> delta between old and new value). Would that work? > > > > you mean get_next+lookup+delete at once? > > Sounds useful. > > Yonghong has been thinking about batching api as well. > > In bcc, we have many instances like this: > getting all (key value) pairs, do some analysis and output, > delete all keys > > The implementation typically like > /* to get all (key, value) pairs */ > while(bpf_get_next_key() == 0) > bpf_map_lookup() > /* do analysis and output */ > for (all keys) > bpf_map_delete() If you do that in a map that is being modified while you are doing the analysis and output, you will lose some new data by deleting the keys, right? > get_next+lookup+delete will be definitely useful. > batching will be even better to save the number of syscalls. > > An alternative is to do batch get_next+lookup and batch delete > to achieve similar goal as the above code. What I mentioned above is what it makes me think that with the deletion it'd be better if we perform these 3 operations at once: get_next+lookup+delete in a jumbo/atomic command and batch them later? > > There is a minor difference between this approach > and the above get_next+lookup+delete. > During scanning the hash map, get_next+lookup may get less number > of elements compared to get_next+lookup+delete as the latter > may have more later-inserted hash elements after the operation > start. But both are inaccurate, so probably the difference > is minor. > > > > > I think if we cannot figure out how to make a batch of two commands > > get_next + lookup to work correctly then we need to identify/invent one > > command and make batching more generic. > > not 100% sure. It will be hard to define what is "correctly". I agree, it'll be hard to define what is the right behavior. > For not changing map, looping of (get_next, lookup) and batch > get_next+lookup should have the same results. This is true for the api I'm presenting the only think that I was missing was what to do for changing maps to avoid the weird scenario (getting the first key due a concurrent deletion). And, in my opinion the way to go should be what also Willem supported: return the err to the caller and restart the dumping. I could do this with existing code just by detecting that we do provide a prev_key and got the first_key instead of the next_key or even implement a new function if you want to. > For constant changing loops, not sure how to define which one > is correct. If users have concerns, they may need to just pick one > which gives them more comfort. > > > Like make one jumbo/compound/atomic command to be get_next+lookup+delete. > > Define the semantics of this single compound command. > > And then let batching to be a multiplier of such command. > > In a sense that multiplier 1 or N should be have the same way. > > No extra flags to alter the batching. > > The high level description of the batch would be: > > pls execute get_next,lookup,delete and repeat it N times. > > or > > pls execute get_next,lookup and repeat N times. But any attempt to do get_next+lookup will have same problem with deletions right? I don't see how we could do it more consistent than what I'm proposing. Let's just support one case: report an error if the prev_key was not found instead of retrieving the first_key. Would that work? > > where each command action is defined to be composable. > > > > Just a rough idea. > >