Re: Poor performance of bpf_map_update_elem() for BPF_MAP_TYPE_HASH_OF_MAPS / BPF_MAP_TYPE_ARRAY_OF_MAPS

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Sat, 8 Feb 2025 10:33:10 -0800

On Sat, Feb 08, 2025 at 11:39:31AM +0100, Ritesh Oedayrajsingh Varma wrote:
> On Sat, Feb 8, 2025 at 4:58 AM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Wed, Feb 5, 2025 at 4:58 AM Ritesh Oedayrajsingh Varma
> > <ritesh@xxxxxxxxxxxxxxx> wrote:
> > >
> > > Given this, while it's not possible to remove the wait entirely
> > > without breaking user space, I was wondering if it would be
> > > possible/acceptable to add a way to opt-out of this behavior for
> > > programs like ours that don't care about this. One way to do so could
> > > be to add an additional flag to the BPF_MAP_CREATE flags, perhaps
> > > something like BPF_F_INNER_MAP_NO_SYNC.
> >
> > Sounds reasonable.
> > The flag name is a bit cryptic, BPF_F_UPDATE_INNER_MAP_NO_WAIT
> > is a bit more explicit, but I'm not sure whether it's better.
> 
> I agree the name is a bit cryptic. A related question is whether the
> optimization to skip the RCU wait should only apply to the update, or
> also to delete. I think it would make sense for it to apply to all
> operations. What do you think?

Exposing kernel-behavior to user-space is not a good idea, since users
have to understand kernel details to know how to safely use this flag.

> 
> I also realized the flag should technically apply to the *outer* map,
> since that's the map that's actually being modified and synced on, not
> the inner map. So I don't think "inner" should be part of the name in
> retrospect. Perhaps BPF_F_MAP_OF_MAPS_NO_WAIT or
> BPF_F_MAP_IN_MAP_NO_WAIT? I'm slightly leaning towards the latter
> because the map of maps support code is also located in map_in_map.c,
> so that matches nicely. They're both a bit long though. Either way,
> the definition of the outer map when using this flag would become
> something like:
> 
> struct {
>     __uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
>     __uint(max_entries, 4096);
>     __type(key, u64);
>     __type(value, u32);
>     __uint(map_flags, BPF_F_MAP_IN_MAP_NO_WAIT);
> } mapInMap SEC(".maps");
> 
> > Also have you considered a batched update? There will be
> > only one sync_rcu() for the whole batch.
> 
> We have, yes, but in our case, these updates are a result of another
> system generating events, and it is a bit hard to batch them together:
> it would involve waiting for multiple events to arrive, instead of
> processing events as they come in, which introduces an additional
> layer of latency by itself.
> 
> Regarding batched update, we've found that it is also very slow when
> inserting a large number of elements. In one example where we inserted
> ~1.2 million 16-byte elements, we found it took 4-500 milliseconds to
> update the map via batched update. This is due to the overhead of
> generic_map_update_batch() individually copying each element to be
> inserted from user space via copy_from_user(); almost all time is
> going to __check_object_size() called by copy_from_user(). I suspect
> this one is hard to fix though, due to how the elements in a map are
> laid out in memory; it would be hard to change such batched updates
> into a single copy. But perhaps that's something for a different
> thread (and we can easily work around it on our side).

I think there are rooms to improve this:

1) As you mentioned, for hash-based or any other non-linear maps, it is
indeed hard to optimize. However, for linear ones like array map, it is
possible to copy the whole memory from user-space once.

2) There are actualy two copies here, one is from user-space to a temporary
kernel memory, the other is from this temporary memory to the actual map
key/value memory. _I speculate_ it is possible to optimize them down to
one copy, at least for simple cases.

Just my two cents.

Thanks!