Re: Poor performance of bpf_map_update_elem() for BPF_MAP_TYPE_HASH_OF_MAPS / BPF_MAP_TYPE_ARRAY_OF_MAPS

Ritesh Oedayrajsingh Varma <ritesh@xxxxxxxxxxxxxxx> · Tue, 11 Feb 2025 12:10:21 +0100

On Sat, Feb 8, 2025 at 7:33 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
>
> On Sat, Feb 08, 2025 at 11:39:31AM +0100, Ritesh Oedayrajsingh Varma wrote:
> > On Sat, Feb 8, 2025 at 4:58 AM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Wed, Feb 5, 2025 at 4:58 AM Ritesh Oedayrajsingh Varma
> > > <ritesh@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > > Given this, while it's not possible to remove the wait entirely
> > > > without breaking user space, I was wondering if it would be
> > > > possible/acceptable to add a way to opt-out of this behavior for
> > > > programs like ours that don't care about this. One way to do so could
> > > > be to add an additional flag to the BPF_MAP_CREATE flags, perhaps
> > > > something like BPF_F_INNER_MAP_NO_SYNC.
> > >
> > > Sounds reasonable.
> > > The flag name is a bit cryptic, BPF_F_UPDATE_INNER_MAP_NO_WAIT
> > > is a bit more explicit, but I'm not sure whether it's better.
> >
> > I agree the name is a bit cryptic. A related question is whether the
> > optimization to skip the RCU wait should only apply to the update, or
> > also to delete. I think it would make sense for it to apply to all
> > operations. What do you think?
>
> Exposing kernel-behavior to user-space is not a good idea, since users
> have to understand kernel details to know how to safely use this flag.

I agree it's not ideal, but I don't really see another option that
doesn't involve breaking user-space, and the current brute force sync
in maybe_wait_bpf_programs() is quite heavy-handed for a guarantee
that at least some percentange of users (like us) don't care about.
Given that it's an *opt-out* and not an opt-in, I think it's okay,
because everything will just continue functioning as before if you
don't use the flag; if you *do* use the flag, then presumably you've
spent time to understand what (and why) it does (and we can of course
add a comment to the new flag with explanation).

>
> >
> > I also realized the flag should technically apply to the *outer* map,
> > since that's the map that's actually being modified and synced on, not
> > the inner map. So I don't think "inner" should be part of the name in
> > retrospect. Perhaps BPF_F_MAP_OF_MAPS_NO_WAIT or
> > BPF_F_MAP_IN_MAP_NO_WAIT? I'm slightly leaning towards the latter
> > because the map of maps support code is also located in map_in_map.c,
> > so that matches nicely. They're both a bit long though. Either way,
> > the definition of the outer map when using this flag would become
> > something like:
> >
> > struct {
> >     __uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
> >     __uint(max_entries, 4096);
> >     __type(key, u64);
> >     __type(value, u32);
> >     __uint(map_flags, BPF_F_MAP_IN_MAP_NO_WAIT);
> > } mapInMap SEC(".maps");
> >
> > > Also have you considered a batched update? There will be
> > > only one sync_rcu() for the whole batch.
> >
> > We have, yes, but in our case, these updates are a result of another
> > system generating events, and it is a bit hard to batch them together:
> > it would involve waiting for multiple events to arrive, instead of
> > processing events as they come in, which introduces an additional
> > layer of latency by itself.
> >
> > Regarding batched update, we've found that it is also very slow when
> > inserting a large number of elements. In one example where we inserted
> > ~1.2 million 16-byte elements, we found it took 4-500 milliseconds to
> > update the map via batched update. This is due to the overhead of
> > generic_map_update_batch() individually copying each element to be
> > inserted from user space via copy_from_user(); almost all time is
> > going to __check_object_size() called by copy_from_user(). I suspect
> > this one is hard to fix though, due to how the elements in a map are
> > laid out in memory; it would be hard to change such batched updates
> > into a single copy. But perhaps that's something for a different
> > thread (and we can easily work around it on our side).
>
> I think there are rooms to improve this:
>
> 1) As you mentioned, for hash-based or any other non-linear maps, it is
> indeed hard to optimize. However, for linear ones like array map, it is
> possible to copy the whole memory from user-space once.

It is possible yes, but it would have the effect that memory usage
would be (temporarily) higher than before as there would have to be a
temporary copy of the full array in kernel-space, whereas right now
there is only temporary space required for a single element. For large
arrays like we're talking about, that could be quite a lot of memory.

>
> 2) There are actualy two copies here, one is from user-space to a temporary
> kernel memory, the other is from this temporary memory to the actual map
> key/value memory. _I speculate_ it is possible to optimize them down to
> one copy, at least for simple cases.
>
> Just my two cents.
>
> Thanks!

This is something I looked at, but I don't see a way to do this
without significantly changing the code. As mentioned in the other
thread, this is not an active problem for us as we can work around
this quite easily on our end, so it's not something we'll be pursuing
further.