Poor performance of bpf_map_update_elem() for BPF_MAP_TYPE_HASH_OF_MAPS / BPF_MAP_TYPE_ARRAY_OF_MAPS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We are in a situation where we're frequently updating a
BPF_MAP_TYPE_HASH_OF_MAPS with new data for a given key via
bpf_map_update_elem(). During profiling, we've noticed that
bpf_map_update_elem() on such maps is _very_ expensive. In our tests,
the average time is ~9ms per call, with spikes to ~45ms per call:

Function Name:   bpf_map_update_elem
Number of calls:  1213
Total time:            11s 880ms 994µs
Maximum:            45ms 431µs
Top Quartile:        11ms 660µs
Average:              9ms 794µs
Median:                9ms 218µs
Bottom Quartile:   7ms 363µs
Minimum:             23µs

The cause of this poor performance is the wait for the RCU grace
period when map_update_elem() is called: after the update has
completed without errors, it calls maybe_wait_bpf_programs() which in
turn calls synchronize_rcu() for BPF_MAP_TYPE_HASH_OF_MAPS (and
BPF_MAP_TYPE_ARRAY_OF_MAPS).

As I understand from the commit that introduced this [1], the RCU GP
wait was added to ensure that user space could be guaranteed that
after the update, no BPF programs are still looking at the old value
of the map [2]. When this commit was introduced, the RCU GP wait also
covered a potential UAF when updating the outer map while a BPF
program was still looking at the old inner map. That UAF was (much)
later addressed by a different patchset [3] and the discussion in that
patchset [4] mentions that maybe_wait_bpf_programs() is not needed
anymore with the UAF fixes:

> So, you're correct, maybe_wait_bpf_programs() is not sufficient any more,
> but we cannot delete it, since it addresses user space assumptions
> on what bpf progs see when the inner map is replaced.

Given this, while it's not possible to remove the wait entirely
without breaking user space, I was wondering if it would be
possible/acceptable to add a way to opt-out of this behavior for
programs like ours that don't care about this. One way to do so could
be to add an additional flag to the BPF_MAP_CREATE flags, perhaps
something like BPF_F_INNER_MAP_NO_SYNC. There are already map-specific
flags in there (for example, BPF_F_NO_COMMON_LRU or
BPF_F_STACK_BUILD_ID), so it would fit with that pattern;
maybe_wait_bpf_programs() could then check the map flags and only
perform the wait if the flag is not set (which is the default).

In our case, we don't care if running BPF programs are still working
with the old map, but for the thousands of bpf_map_update_elem() calls
we're doing in certain situations, we're spending _seconds_ waiting on
the RCU GP, so adding something like this would greatly improve the
latency in our scenarios.

If this sounds like something that would be acceptable, I'd be happy
to make the change and send a patch, of course. Any thoughts on this
are appreciated!

[1] commit 1ae80cf31938 ("bpf: wait for running BPF programs when
updating map-in-map")
[2] https://lore.kernel.org/lkml/20181111221706.032923266@xxxxxxxxxxxxxxxxxxx/
[3] https://lore.kernel.org/bpf/20231113123324.3914612-1-houtao@xxxxxxxxxxxxxxx/
[4] https://lore.kernel.org/bpf/CAADnVQK=tJRhQY1zfLK2n7_tPA5+vN8+KqWmSLqjubUuh6UFAw@xxxxxxxxxxxxxx/

Cheers,
Ritesh





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux