Hi, On 2/5/2025 8:58 PM, Ritesh Oedayrajsingh Varma wrote: > Hi, > > We are in a situation where we're frequently updating a > BPF_MAP_TYPE_HASH_OF_MAPS with new data for a given key via > bpf_map_update_elem(). During profiling, we've noticed that > bpf_map_update_elem() on such maps is _very_ expensive. In our tests, > the average time is ~9ms per call, with spikes to ~45ms per call: > > Function Name: bpf_map_update_elem > Number of calls: 1213 > Total time: 11s 880ms 994µs > Maximum: 45ms 431µs > Top Quartile: 11ms 660µs > Average: 9ms 794µs > Median: 9ms 218µs > Bottom Quartile: 7ms 363µs > Minimum: 23µs > > The cause of this poor performance is the wait for the RCU grace > period when map_update_elem() is called: after the update has > completed without errors, it calls maybe_wait_bpf_programs() which in > turn calls synchronize_rcu() for BPF_MAP_TYPE_HASH_OF_MAPS (and > BPF_MAP_TYPE_ARRAY_OF_MAPS). > > As I understand from the commit that introduced this [1], the RCU GP > wait was added to ensure that user space could be guaranteed that > after the update, no BPF programs are still looking at the old value > of the map [2]. When this commit was introduced, the RCU GP wait also > covered a potential UAF when updating the outer map while a BPF > program was still looking at the old inner map. That UAF was (much) > later addressed by a different patchset [3] and the discussion in that > patchset [4] mentions that maybe_wait_bpf_programs() is not needed > anymore with the UAF fixes: > >> So, you're correct, maybe_wait_bpf_programs() is not sufficient any more, >> but we cannot delete it, since it addresses user space assumptions >> on what bpf progs see when the inner map is replaced. > Given this, while it's not possible to remove the wait entirely > without breaking user space, I was wondering if it would be > possible/acceptable to add a way to opt-out of this behavior for > programs like ours that don't care about this. One way to do so could > be to add an additional flag to the BPF_MAP_CREATE flags, perhaps > something like BPF_F_INNER_MAP_NO_SYNC. There are already map-specific > flags in there (for example, BPF_F_NO_COMMON_LRU or > BPF_F_STACK_BUILD_ID), so it would fit with that pattern; > maybe_wait_bpf_programs() could then check the map flags and only > perform the wait if the flag is not set (which is the default). > > In our case, we don't care if running BPF programs are still working > with the old map, but for the thousands of bpf_map_update_elem() calls > we're doing in certain situations, we're spending _seconds_ waiting on > the RCU GP, so adding something like this would greatly improve the > latency in our scenarios. > > If this sounds like something that would be acceptable, I'd be happy > to make the change and send a patch, of course. Any thoughts on this > are appreciated! If the time used for synchronize_rcu() is too long, maybe we could switch to synchronize_rcu_expedited() instead. Could you please check the average map update time for synchronize_rcu_expedited() ? > > [1] commit 1ae80cf31938 ("bpf: wait for running BPF programs when > updating map-in-map") > [2] https://lore.kernel.org/lkml/20181111221706.032923266@xxxxxxxxxxxxxxxxxxx/ > [3] https://lore.kernel.org/bpf/20231113123324.3914612-1-houtao@xxxxxxxxxxxxxxx/ > [4] https://lore.kernel.org/bpf/CAADnVQK=tJRhQY1zfLK2n7_tPA5+vN8+KqWmSLqjubUuh6UFAw@xxxxxxxxxxxxxx/ > > Cheers, > Ritesh > > > .