Re: Poor performance of bpf_map_update_elem() for BPF_MAP_TYPE_HASH_OF_MAPS / BPF_MAP_TYPE_ARRAY_OF_MAPS

Ritesh Oedayrajsingh Varma <ritesh@xxxxxxxxxxxxxxx> · Sat, 8 Feb 2025 11:39:31 +0100

On Sat, Feb 8, 2025 at 4:58 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Feb 5, 2025 at 4:58 AM Ritesh Oedayrajsingh Varma
> <ritesh@xxxxxxxxxxxxxxx> wrote:
> >
> > Given this, while it's not possible to remove the wait entirely
> > without breaking user space, I was wondering if it would be
> > possible/acceptable to add a way to opt-out of this behavior for
> > programs like ours that don't care about this. One way to do so could
> > be to add an additional flag to the BPF_MAP_CREATE flags, perhaps
> > something like BPF_F_INNER_MAP_NO_SYNC.
>
> Sounds reasonable.
> The flag name is a bit cryptic, BPF_F_UPDATE_INNER_MAP_NO_WAIT
> is a bit more explicit, but I'm not sure whether it's better.

I agree the name is a bit cryptic. A related question is whether the
optimization to skip the RCU wait should only apply to the update, or
also to delete. I think it would make sense for it to apply to all
operations. What do you think?

I also realized the flag should technically apply to the *outer* map,
since that's the map that's actually being modified and synced on, not
the inner map. So I don't think "inner" should be part of the name in
retrospect. Perhaps BPF_F_MAP_OF_MAPS_NO_WAIT or
BPF_F_MAP_IN_MAP_NO_WAIT? I'm slightly leaning towards the latter
because the map of maps support code is also located in map_in_map.c,
so that matches nicely. They're both a bit long though. Either way,
the definition of the outer map when using this flag would become
something like:

struct {
    __uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
    __uint(max_entries, 4096);
    __type(key, u64);
    __type(value, u32);
    __uint(map_flags, BPF_F_MAP_IN_MAP_NO_WAIT);
} mapInMap SEC(".maps");

> Also have you considered a batched update? There will be
> only one sync_rcu() for the whole batch.

We have, yes, but in our case, these updates are a result of another
system generating events, and it is a bit hard to batch them together:
it would involve waiting for multiple events to arrive, instead of
processing events as they come in, which introduces an additional
layer of latency by itself.

Regarding batched update, we've found that it is also very slow when
inserting a large number of elements. In one example where we inserted
~1.2 million 16-byte elements, we found it took 4-500 milliseconds to
update the map via batched update. This is due to the overhead of
generic_map_update_batch() individually copying each element to be
inserted from user space via copy_from_user(); almost all time is
going to __check_object_size() called by copy_from_user(). I suspect
this one is hard to fix though, due to how the elements in a map are
laid out in memory; it would be hard to change such batched updates
into a single copy. But perhaps that's something for a different
thread (and we can easily work around it on our side).

Thanks,
Ritesh