Hi, On 2/27/2025 7:17 AM, Zvi Effron wrote: > On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: >> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>> Hi, >>> >>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote: >>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>>>> Hi Toke, >>>>> >>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote: >>>>>> Hou Tao <houtao@xxxxxxxxxxxxxxx> writes: >>>>>> >>>>>>> +cc Cody Haas >>>>>>> >>>>>>> Sorry for the resend. I sent the reply in the HTML format. >>>>>>> >>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote: >>>>>>>> Currently, the update of existing element in hash map involves two >>>>>>>> steps: >>>>>>>> 1) insert the new element at the head of the hash list >>>>>>>> 2) remove the old element >>>>>>>> >>>>>>>> It is possible that the concurrent lookup operation may fail to find >>>>>>>> either the old element or the new element if the lookup operation starts >>>>>>>> before the addition and continues after the removal. >>>>>>>> >>>>>>>> Therefore, replacing the two-step update with an atomic update. After >>>>>>>> the change, the update will be atomic in the perspective of the lookup >>>>>>>> operation: it will either find the old element or the new element. >>>> I'm missing the point. >>>> This "atomic" replacement doesn't really solve anything. >>>> lookup will see one element. >>>> That element could be deleted by another thread. >>>> bucket lock and either two step update or single step >>>> don't change anything from the pov of bpf prog doing lookup. >>> The point is that overwriting an existed element may lead to concurrent >>> lookups return ENOENT as demonstrated by the added selftest and the >>> patch tried to "fix" that. However, it seems using >>> hlist_nulls_replace_rcu() for the overwriting update is still not >>> enough. Because when the lookup procedure found the old element, the old >>> element may be reusing, the comparison of the map key may fail, and the >>> lookup procedure may still return ENOENT. >> you mean l_old == l_new ? I don't think it's possible >> within htab_map_update_elem(), >> but htab_map_delete_elem() doing hlist_nulls_del_rcu() >> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu >> and just deleted elem is reused to be inserted >> into another bucket. No. I mean the following procedure in which the lookup procedure finds the old element and tries to match its key, one update procedure has already deleted the old element, and another update procedure is reusing the old element: lookup procedure A A: find the old element (instead of the new old) update procedure B B: delete the old element update procedure C on the same CPU: C: reuse the old element (overwrite its key and insert in the same bucket) A: the key is mismatched and return -ENOENT. It can be reproduced by increasing ctx.loop in the selftest. >> >> I'm not sure whether this new hlist_nulls_replace_rcu() >> primitive works with nulls logic. >> >> So back to the problem statement.. >> Are you saying that adding new to a head while lookup is in the middle >> is causing it to miss an element that >> is supposed to be updated assuming atomicity of the update? >> While now update_elem() is more like a sequence of delete + insert? >> >> Hmm. > Yes, exactly that. Because update_elem is actually a delete + insert (actually > an insert + delete, I think?), it is possible for a concurrent lookup to see no > element instead of either the old or new value. Yep. > >>> I see. In v2 I will fallback to the original idea: adding a standalone >>> update procedure for htab of maps in which it will atomically overwrite >>> the map_ptr just like array of maps does. >> hold on. is this only for hash-of-maps ? > I believe this was also replicated for hash as well as hash-of-maps. Cody can > confirm, or use the reproducer he has to test that case. The fix for hash-of-maps will be much simpler and the returned map pointer will be correct. For other types of hash map, beside switching to hlist_nulls_replace_rcu(), I think we also need to detect the reuse of element during the loop: if the element has been reused, the lookup should restart from the head again. However, even withing the above two fixes, the value returned by lookup procedure for other types of hash map may still be incorrect (as shown long time ago [1]), so I think maybe for other types of map, the atomic update doesn't matter too much. [1]: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@xxxxxxxxxxxxxxx/ > >> How that non-atomic update manifested in real production? >> > See [1] (in the cover letter for this series, also replicated below). > > [1] : https://lore.kernel.org/xdp-newbies/07a365d8-2e66-2899-4298-b8b158a928fa@xxxxxxxxxxxxxxx/T/#m06fcd687c6cfdbd0f9b643b227e69b479fc8c2f6