Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call

Yonghong Song <yhs@xxxxxx> · Sat, 27 Jul 2019 17:54:58 +0000

On 7/26/19 4:36 PM, Brian Vazquez wrote:
> On Thu, Jul 25, 2019 at 11:10 PM Yonghong Song <yhs@xxxxxx> wrote:
>>
>>
>>
>> On 7/25/19 6:47 PM, Alexei Starovoitov wrote:
>>> On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@xxxxxxxxx> wrote:
>>>>
>>>> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
>>>> <alexei.starovoitov@xxxxxxxxx> wrote:
>>>>>
>>>>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
>>>>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key
>>>>>>>>> again. This is pretty weird.
>>>>>>>>
>>>>>>>> Yes, I know. But note that the current scenario happens even for the
>>>>>>>> old interface (imagine you are walking a map from userspace and you
>>>>>>>> tried get_next_key the prev_key was removed, you will start again from
>>>>>>>> the beginning without noticing it).
>>>>>>>> I tried to sent a patch in the past but I was missing some context:
>>>>>>>> before NULL was used to get the very first_key the interface relied in
>>>>>>>> a random (non existent) key to retrieve the first_key in the map, and
>>>>>>>> I was told what we still have to support that scenario.
>>>>>>>
>>>>>>> BPF_MAP_DUMP is slightly different, as you may return the first key
>>>>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
>>>>>>> don't have to support legacy scenarios.
>>>>>>>
>>>>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
>>>>>>> to look up previous keys. Would something down this direction work?
>>>>>>
>>>>>> I've been thinking about it and I think first we need a way to detect
>>>>>> that since key was not present we got the first_key instead:
>>>>>>
>>>>>> - One solution I had in mind was to explicitly asked for the first key
>>>>>> with map_get_next_key(map, NULL, first_key) and while walking the map
>>>>>> check that map_get_next_key(map, prev_key, key) doesn't return the
>>>>>> same key. This could be done using memcmp.
>>>>>> - Discussing with Stan, he mentioned that another option is to support
>>>>>> a flag in map_get_next_key to let it know that we want an error
>>>>>> instead of the first_key.
>>>>>>
>>>>>> After detecting the problem we also need to define what we want to do,
>>>>>> here some options:
>>>>>>
>>>>>> a) Return the error to the caller
>>>>>> b) Try with previous keys if any (which be limited to the keys that we
>>>>>> have traversed so far in this dump call)
>>>>>> c) continue with next entries in the map. array is easy just get the
>>>>>> next valid key (starting on i+1), but hmap might be difficult since
>>>>>> starting on the next bucket could potentially skip some keys that were
>>>>>> concurrently added to the same bucket where key used to be, and
>>>>>> starting on the same bucket could lead us to return repeated elements.
>>>>>>
>>>>>> Or maybe we could support those 3 cases via flags and let the caller
>>>>>> decide which one to use?
>>>>>
>>>>> this type of indecision is the reason why I wasn't excited about
>>>>> batch dumping in the first place and gave 'soft yes' when Stan
>>>>> mentioned it during lsf/mm/bpf uconf.
>>>>> We probably shouldn't do it.
>>>>> It feels this map_dump makes api more complex and doesn't really
>>>>> give much benefit to the user other than large map dump becomes faster.
>>>>> I think we gotta solve this problem differently.
>>>>
>>>> Some users are working around the dumping problems with the existing
>>>> api by creating a bpf_map_get_next_key_and_delete userspace function
>>>> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= )
>>>> which in my opinion is actually a good idea. The only problem with
>>>> that is that calling bpf_map_get_next_key(fd, key, next_key) and then
>>>> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
>>>> and it might lose some information when deleting.
>>>> We could then do map_dump_and_delete using that idea but in the kernel
>>>> where we could better handle the racing condition. In that scenario
>>>> even if we retrieve the same key it will contain different info ( the
>>>> delta between old and new value). Would that work?
>>>
>>> you mean get_next+lookup+delete at once?
>>> Sounds useful.
>>> Yonghong has been thinking about batching api as well.
>>
>> In bcc, we have many instances like this:
>>      getting all (key value) pairs, do some analysis and output,
>>      delete all keys
>>
>> The implementation typically like
>>      /* to get all (key, value) pairs */
>>      while(bpf_get_next_key() == 0)
>>        bpf_map_lookup()
>>      /* do analysis and output */
>>      for (all keys)
>>        bpf_map_delete()
> 
> If you do that in a map that is being modified while you are doing the
> analysis and output, you will lose some new data by deleting the keys,
> right?

Agreed, it is possible that if the same keys are reused to generate data 
during analysis and output period, we will miss them by deleting them.
 From that perspective, your above approach
       while (bpf_get_next_key())
           bpf_map_delete(prev_key)
           bpf_map_lookup()
           reset prev_keey
should provide a better alternative.

> 
>> get_next+lookup+delete will be definitely useful.
>> batching will be even better to save the number of syscalls.
>>
>> An alternative is to do batch get_next+lookup and batch delete
>> to achieve similar goal as the above code.
> 
> What I mentioned above is what it makes me think that with the
> deletion it'd be better if we perform these 3 operations at once:
> get_next+lookup+delete in a jumbo/atomic command and batch them later?

Agree. This is indeed the one most useful for bcc use case as well.

> 
>>
>> There is a minor difference between this approach
>> and the above get_next+lookup+delete.
>> During scanning the hash map, get_next+lookup may get less number
>> of elements compared to get_next+lookup+delete as the latter
>> may have more later-inserted hash elements after the operation
>> start. But both are inaccurate, so probably the difference
>> is minor.
>>
>>>
>>> I think if we cannot figure out how to make a batch of two commands
>>> get_next + lookup to work correctly then we need to identify/invent one
>>> command and make batching more generic.
>>
>> not 100% sure. It will be hard to define what is "correctly".
> 
> I agree, it'll be hard to define what is the right behavior.
> 
>> For not changing map, looping of (get_next, lookup) and batch
>> get_next+lookup should have the same results.
> 
> This is true for the api I'm presenting the only think that I was
> missing was what to do for changing maps to avoid the weird scenario
> (getting the first key due a concurrent deletion). And, in my opinion
> the way to go should be what also Willem supported: return the err to
> the caller and restart the dumping. I could do this with existing code
> just by detecting that we do provide a prev_key and got the first_key
> instead of the next_key or even implement a new function if you want
> to.

Always starting from the first key has its drawback as we keep getting 
the new elements if they are constantly populated. This may skew
the results for a large hash table.

Maybe we can just do lookup+delete or batch lookup+delete?
user gives NULL means the first key to lookup/delete.
Every (batch) lookup+delete will deletes one or a set of keys.
The set of keys are retrieved using internal get_next .
The (batch) lookup+delete will return next available key, which
user can be used for next (batch) lookup+delete.
If user provided key does not match, user can provide a flag
to go to the first key, or return an error.

> 
>> For constant changing loops, not sure how to define which one
>> is correct. If users have concerns, they may need to just pick one
>> which gives them more comfort.
>>
>>> Like make one jumbo/compound/atomic command to be get_next+lookup+delete.
>>> Define the semantics of this single compound command.
>>> And then let batching to be a multiplier of such command.
>>> In a sense that multiplier 1 or N should be have the same way.
>>> No extra flags to alter the batching.
>>> The high level description of the batch would be:
>>> pls execute get_next,lookup,delete and repeat it N times.
>>> or
>>> pls execute get_next,lookup and repeat N times.
> 
> But any attempt to do get_next+lookup will have same problem with
> deletions right?
> 
> I don't see how we could do it more consistent than what I'm
> proposing. Let's just support one case: report an error if the
> prev_key was not found instead of retrieving the first_key. Would that
> work?
> 
>>> where each command action is defined to be composable.
>>>
>>> Just a rough idea.
>>>