On Sat, Feb 8, 2025 at 7:22 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 2/5/2025 8:58 PM, Ritesh Oedayrajsingh Varma wrote: > > Hi, > > > > We are in a situation where we're frequently updating a > > BPF_MAP_TYPE_HASH_OF_MAPS with new data for a given key via > > bpf_map_update_elem(). During profiling, we've noticed that > > bpf_map_update_elem() on such maps is _very_ expensive. In our tests, > > the average time is ~9ms per call, with spikes to ~45ms per call: > > > > Function Name: bpf_map_update_elem > > Number of calls: 1213 > > Total time: 11s 880ms 994µs > > Maximum: 45ms 431µs > > Top Quartile: 11ms 660µs > > Average: 9ms 794µs > > Median: 9ms 218µs > > Bottom Quartile: 7ms 363µs > > Minimum: 23µs > > > > The cause of this poor performance is the wait for the RCU grace > > period when map_update_elem() is called: after the update has > > completed without errors, it calls maybe_wait_bpf_programs() which in > > turn calls synchronize_rcu() for BPF_MAP_TYPE_HASH_OF_MAPS (and > > BPF_MAP_TYPE_ARRAY_OF_MAPS). > > > > As I understand from the commit that introduced this [1], the RCU GP > > wait was added to ensure that user space could be guaranteed that > > after the update, no BPF programs are still looking at the old value > > of the map [2]. When this commit was introduced, the RCU GP wait also > > covered a potential UAF when updating the outer map while a BPF > > program was still looking at the old inner map. That UAF was (much) > > later addressed by a different patchset [3] and the discussion in that > > patchset [4] mentions that maybe_wait_bpf_programs() is not needed > > anymore with the UAF fixes: > > > >> So, you're correct, maybe_wait_bpf_programs() is not sufficient any more, > >> but we cannot delete it, since it addresses user space assumptions > >> on what bpf progs see when the inner map is replaced. > > Given this, while it's not possible to remove the wait entirely > > without breaking user space, I was wondering if it would be > > possible/acceptable to add a way to opt-out of this behavior for > > programs like ours that don't care about this. One way to do so could > > be to add an additional flag to the BPF_MAP_CREATE flags, perhaps > > something like BPF_F_INNER_MAP_NO_SYNC. There are already map-specific > > flags in there (for example, BPF_F_NO_COMMON_LRU or > > BPF_F_STACK_BUILD_ID), so it would fit with that pattern; > > maybe_wait_bpf_programs() could then check the map flags and only > > perform the wait if the flag is not set (which is the default). > > > > In our case, we don't care if running BPF programs are still working > > with the old map, but for the thousands of bpf_map_update_elem() calls > > we're doing in certain situations, we're spending _seconds_ waiting on > > the RCU GP, so adding something like this would greatly improve the > > latency in our scenarios. > > > > If this sounds like something that would be acceptable, I'd be happy > > to make the change and send a patch, of course. Any thoughts on this > > are appreciated! > > If the time used for synchronize_rcu() is too long, maybe we could > switch to synchronize_rcu_expedited() instead. Could you please check > the average map update time for synchronize_rcu_expedited() ? I'm not very familiar with synchronize_rcu_expedited(), but does it provide similar guarantees as synchronize_rcu()? If not, it would be a breaking change. Reading up on it, it also seems to have a different effect on the system regarding efficiency/disturbance from synchronize_rcu(). Either way, I can attempt to test it, but I'll need to see when I can schedule in some time for it. That all being said, even if synchronize_rcu_expedited() is faster than synchronize_rcu(), a flag to skip it entirely would still be faster, so I think it'd make sense to make both changes. > > > > [1] commit 1ae80cf31938 ("bpf: wait for running BPF programs when > > updating map-in-map") > > [2] https://lore.kernel.org/lkml/20181111221706.032923266@xxxxxxxxxxxxxxxxxxx/ > > [3] https://lore.kernel.org/bpf/20231113123324.3914612-1-houtao@xxxxxxxxxxxxxxx/ > > [4] https://lore.kernel.org/bpf/CAADnVQK=tJRhQY1zfLK2n7_tPA5+vN8+KqWmSLqjubUuh6UFAw@xxxxxxxxxxxxxx/ > > > > Cheers, > > Ritesh > > > > > > . >