Re: Poor performance of bpf_map_update_elem() for BPF_MAP_TYPE_HASH_OF_MAPS / BPF_MAP_TYPE_ARRAY_OF_MAPS

Ritesh Oedayrajsingh Varma <ritesh@xxxxxxxxxxxxxxx> · Sat, 8 Feb 2025 11:49:21 +0100

On Sat, Feb 8, 2025 at 7:22 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 2/5/2025 8:58 PM, Ritesh Oedayrajsingh Varma wrote:
> > Hi,
> >
> > We are in a situation where we're frequently updating a
> > BPF_MAP_TYPE_HASH_OF_MAPS with new data for a given key via
> > bpf_map_update_elem(). During profiling, we've noticed that
> > bpf_map_update_elem() on such maps is _very_ expensive. In our tests,
> > the average time is ~9ms per call, with spikes to ~45ms per call:
> >
> > Function Name:   bpf_map_update_elem
> > Number of calls:  1213
> > Total time:            11s 880ms 994µs
> > Maximum:            45ms 431µs
> > Top Quartile:        11ms 660µs
> > Average:              9ms 794µs
> > Median:                9ms 218µs
> > Bottom Quartile:   7ms 363µs
> > Minimum:             23µs
> >
> > The cause of this poor performance is the wait for the RCU grace
> > period when map_update_elem() is called: after the update has
> > completed without errors, it calls maybe_wait_bpf_programs() which in
> > turn calls synchronize_rcu() for BPF_MAP_TYPE_HASH_OF_MAPS (and
> > BPF_MAP_TYPE_ARRAY_OF_MAPS).
> >
> > As I understand from the commit that introduced this [1], the RCU GP
> > wait was added to ensure that user space could be guaranteed that
> > after the update, no BPF programs are still looking at the old value
> > of the map [2]. When this commit was introduced, the RCU GP wait also
> > covered a potential UAF when updating the outer map while a BPF
> > program was still looking at the old inner map. That UAF was (much)
> > later addressed by a different patchset [3] and the discussion in that
> > patchset [4] mentions that maybe_wait_bpf_programs() is not needed
> > anymore with the UAF fixes:
> >
> >> So, you're correct, maybe_wait_bpf_programs() is not sufficient any more,
> >> but we cannot delete it, since it addresses user space assumptions
> >> on what bpf progs see when the inner map is replaced.
> > Given this, while it's not possible to remove the wait entirely
> > without breaking user space, I was wondering if it would be
> > possible/acceptable to add a way to opt-out of this behavior for
> > programs like ours that don't care about this. One way to do so could
> > be to add an additional flag to the BPF_MAP_CREATE flags, perhaps
> > something like BPF_F_INNER_MAP_NO_SYNC. There are already map-specific
> > flags in there (for example, BPF_F_NO_COMMON_LRU or
> > BPF_F_STACK_BUILD_ID), so it would fit with that pattern;
> > maybe_wait_bpf_programs() could then check the map flags and only
> > perform the wait if the flag is not set (which is the default).
> >
> > In our case, we don't care if running BPF programs are still working
> > with the old map, but for the thousands of bpf_map_update_elem() calls
> > we're doing in certain situations, we're spending _seconds_ waiting on
> > the RCU GP, so adding something like this would greatly improve the
> > latency in our scenarios.
> >
> > If this sounds like something that would be acceptable, I'd be happy
> > to make the change and send a patch, of course. Any thoughts on this
> > are appreciated!
>
> If the time used for synchronize_rcu() is too long, maybe we could
> switch to synchronize_rcu_expedited() instead. Could you please check
> the average map update time for synchronize_rcu_expedited() ?

I'm not very familiar with synchronize_rcu_expedited(), but does it
provide similar guarantees as synchronize_rcu()? If not, it would be a
breaking change.  Reading up on it, it also seems to have a different
effect on the system regarding efficiency/disturbance from
synchronize_rcu().
Either way, I can attempt to test it, but I'll need to see when I can
schedule in some time for it.

That all being said, even if synchronize_rcu_expedited() is faster
than synchronize_rcu(), a flag to skip it entirely would still be
faster, so I think it'd make sense to make both changes.

> >
> > [1] commit 1ae80cf31938 ("bpf: wait for running BPF programs when
> > updating map-in-map")
> > [2] https://lore.kernel.org/lkml/20181111221706.032923266@xxxxxxxxxxxxxxxxxxx/
> > [3] https://lore.kernel.org/bpf/20231113123324.3914612-1-houtao@xxxxxxxxxxxxxxx/
> > [4] https://lore.kernel.org/bpf/CAADnVQK=tJRhQY1zfLK2n7_tPA5+vN8+KqWmSLqjubUuh6UFAw@xxxxxxxxxxxxxx/
> >
> > Cheers,
> > Ritesh
> >
> >
> > .
>