Re: Poor performance of bpf_map_update_elem() for BPF_MAP_TYPE_HASH_OF_MAPS / BPF_MAP_TYPE_ARRAY_OF_MAPS

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Sat, 8 Feb 2025 14:22:40 +0800



Hi,

On 2/5/2025 8:58 PM, Ritesh Oedayrajsingh Varma wrote:
> Hi,
>
> We are in a situation where we're frequently updating a
> BPF_MAP_TYPE_HASH_OF_MAPS with new data for a given key via
> bpf_map_update_elem(). During profiling, we've noticed that
> bpf_map_update_elem() on such maps is _very_ expensive. In our tests,
> the average time is ~9ms per call, with spikes to ~45ms per call:
>
> Function Name:   bpf_map_update_elem
> Number of calls:  1213
> Total time:            11s 880ms 994µs
> Maximum:            45ms 431µs
> Top Quartile:        11ms 660µs
> Average:              9ms 794µs
> Median:                9ms 218µs
> Bottom Quartile:   7ms 363µs
> Minimum:             23µs
>
> The cause of this poor performance is the wait for the RCU grace
> period when map_update_elem() is called: after the update has
> completed without errors, it calls maybe_wait_bpf_programs() which in
> turn calls synchronize_rcu() for BPF_MAP_TYPE_HASH_OF_MAPS (and
> BPF_MAP_TYPE_ARRAY_OF_MAPS).
>
> As I understand from the commit that introduced this [1], the RCU GP
> wait was added to ensure that user space could be guaranteed that
> after the update, no BPF programs are still looking at the old value
> of the map [2]. When this commit was introduced, the RCU GP wait also
> covered a potential UAF when updating the outer map while a BPF
> program was still looking at the old inner map. That UAF was (much)
> later addressed by a different patchset [3] and the discussion in that
> patchset [4] mentions that maybe_wait_bpf_programs() is not needed
> anymore with the UAF fixes:
>
>> So, you're correct, maybe_wait_bpf_programs() is not sufficient any more,
>> but we cannot delete it, since it addresses user space assumptions
>> on what bpf progs see when the inner map is replaced.
> Given this, while it's not possible to remove the wait entirely
> without breaking user space, I was wondering if it would be
> possible/acceptable to add a way to opt-out of this behavior for
> programs like ours that don't care about this. One way to do so could
> be to add an additional flag to the BPF_MAP_CREATE flags, perhaps
> something like BPF_F_INNER_MAP_NO_SYNC. There are already map-specific
> flags in there (for example, BPF_F_NO_COMMON_LRU or
> BPF_F_STACK_BUILD_ID), so it would fit with that pattern;
> maybe_wait_bpf_programs() could then check the map flags and only
> perform the wait if the flag is not set (which is the default).
>
> In our case, we don't care if running BPF programs are still working
> with the old map, but for the thousands of bpf_map_update_elem() calls
> we're doing in certain situations, we're spending _seconds_ waiting on
> the RCU GP, so adding something like this would greatly improve the
> latency in our scenarios.
>
> If this sounds like something that would be acceptable, I'd be happy
> to make the change and send a patch, of course. Any thoughts on this
> are appreciated!

If the time used for synchronize_rcu() is too long, maybe we could
switch to synchronize_rcu_expedited() instead. Could you please check
the average map update time for synchronize_rcu_expedited() ?
>
> [1] commit 1ae80cf31938 ("bpf: wait for running BPF programs when
> updating map-in-map")
> [2] https://lore.kernel.org/lkml/20181111221706.032923266@xxxxxxxxxxxxxxxxxxx/
> [3] https://lore.kernel.org/bpf/20231113123324.3914612-1-houtao@xxxxxxxxxxxxxxx/
> [4] https://lore.kernel.org/bpf/CAADnVQK=tJRhQY1zfLK2n7_tPA5+vN8+KqWmSLqjubUuh6UFAw@xxxxxxxxxxxxxx/
>
> Cheers,
> Ritesh
>
>
> .