Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Wed, 26 Feb 2025 12:05:33 +0800

Hi,

On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> Hi Toke,
>>
>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
>>> Hou Tao <houtao@xxxxxxxxxxxxxxx> writes:
>>>
>>>> +cc Cody Haas
>>>>
>>>> Sorry for the resend. I sent the reply in the HTML format.
>>>>
>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>>>> Currently, the update of existing element in hash map involves two
>>>>> steps:
>>>>> 1) insert the new element at the head of the hash list
>>>>> 2) remove the old element
>>>>>
>>>>> It is possible that the concurrent lookup operation may fail to find
>>>>> either the old element or the new element if the lookup operation starts
>>>>> before the addition and continues after the removal.
>>>>>
>>>>> Therefore, replacing the two-step update with an atomic update. After
>>>>> the change, the update will be atomic in the perspective of the lookup
>>>>> operation: it will either find the old element or the new element.
> I'm missing the point.
> This "atomic" replacement doesn't really solve anything.
> lookup will see one element.
> That element could be deleted by another thread.
> bucket lock and either two step update or single step
> don't change anything from the pov of bpf prog doing lookup.

The point is that overwriting an existed element may lead to concurrent
lookups return ENOENT as demonstrated by the added selftest and the
patch tried to "fix" that. However, it seems using
hlist_nulls_replace_rcu() for the overwriting update is still not
enough. Because when the lookup procedure found the old element, the old
element may be reusing, the comparison of the map key may fail, and the
lookup procedure may still return ENOENT.
>
>>>>> Signed-off-by: Hou Tao <hotforest@xxxxxxxxx>
>>>>> ---
>>>>>  kernel/bpf/hashtab.c | 14 ++++++++------
>>>>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>> index 4a9eeb7aef85..a28b11ce74c6 100644
>>>>> --- a/kernel/bpf/hashtab.c
>>>>> +++ b/kernel/bpf/hashtab.c
>>>>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>>>>             goto err;
>>>>>     }
>>>>>
>>>>> -   /* add new element to the head of the list, so that
>>>>> -    * concurrent search will find it before old elem
>>>>> -    */
>>>>> -   hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>>>> -   if (l_old) {
>>>>> -           hlist_nulls_del_rcu(&l_old->hash_node);
>>>>> +   if (!l_old) {
>>>>> +           hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>>>> +   } else {
>>>>> +           /* Replace the old element atomically, so that
>>>>> +            * concurrent search will find either the new element or
>>>>> +            * the old element.
>>>>> +            */
>>>>> +           hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>>>>>
>>>>>             /* l_old has already been stashed in htab->extra_elems, free
>>>>>              * its special fields before it is available for reuse. Also
>>>>>
>>>> After thinking about it the second time, the atomic list replacement on
>>>> the update side is enough to make lookup operation always find the
>>>> existing element. However, due to the immediate reuse, the lookup may
>>>> find an unexpected value. Maybe we should disable the immediate reuse
>>>> for specific map (e.g., htab of maps).
>>> Hmm, in an RCU-protected data structure, reusing the memory before an
>>> RCU grace period has elapsed is just as wrong as freeing it, isn't it?
>>> I.e., the reuse logic should have some kind of call_rcu redirection to
>>> be completely correct?
>> Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash
>> map, the reuse is also tricky (e.g., the goto again case in
>> lookup_nulls_elem_raw), however it can not prevent the lookup procedure
>> from returning unexpected value. I had post a patch set [1] to "fix"
>> that, but Alexei said it is "a known quirk". Here I am not sure about
>> whether it is reasonable to disable the reuse for htab of maps only. I
>> will post a v2 for the patch set.
>>
>> [1]:
>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@xxxxxxxxxxxxxxx/
> yes. we still have to keep prealloc as default for now :(
> Eventually bpf_mem_alloc is replaced with fully re-entrant
> and safe kmalloc, then we can do fully re-entrant and safe
> kfree_rcu. Then we can talk about closing this quirk.
> Until then the prog has to deal with immediate reuse.
> That was the case for a decade already.

I see. In v2 I will fallback to the original idea: adding a standalone
update procedure for htab of maps in which it will atomically overwrite
the map_ptr just like array of maps does.