On Sat, Feb 08, 2025 at 11:39:31AM +0100, Ritesh Oedayrajsingh Varma wrote: > On Sat, Feb 8, 2025 at 4:58 AM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Wed, Feb 5, 2025 at 4:58 AM Ritesh Oedayrajsingh Varma > > <ritesh@xxxxxxxxxxxxxxx> wrote: > > > > > > Given this, while it's not possible to remove the wait entirely > > > without breaking user space, I was wondering if it would be > > > possible/acceptable to add a way to opt-out of this behavior for > > > programs like ours that don't care about this. One way to do so could > > > be to add an additional flag to the BPF_MAP_CREATE flags, perhaps > > > something like BPF_F_INNER_MAP_NO_SYNC. > > > > Sounds reasonable. > > The flag name is a bit cryptic, BPF_F_UPDATE_INNER_MAP_NO_WAIT > > is a bit more explicit, but I'm not sure whether it's better. > > I agree the name is a bit cryptic. A related question is whether the > optimization to skip the RCU wait should only apply to the update, or > also to delete. I think it would make sense for it to apply to all > operations. What do you think? Exposing kernel-behavior to user-space is not a good idea, since users have to understand kernel details to know how to safely use this flag. > > I also realized the flag should technically apply to the *outer* map, > since that's the map that's actually being modified and synced on, not > the inner map. So I don't think "inner" should be part of the name in > retrospect. Perhaps BPF_F_MAP_OF_MAPS_NO_WAIT or > BPF_F_MAP_IN_MAP_NO_WAIT? I'm slightly leaning towards the latter > because the map of maps support code is also located in map_in_map.c, > so that matches nicely. They're both a bit long though. Either way, > the definition of the outer map when using this flag would become > something like: > > struct { > __uint(type, BPF_MAP_TYPE_HASH_OF_MAPS); > __uint(max_entries, 4096); > __type(key, u64); > __type(value, u32); > __uint(map_flags, BPF_F_MAP_IN_MAP_NO_WAIT); > } mapInMap SEC(".maps"); > > > Also have you considered a batched update? There will be > > only one sync_rcu() for the whole batch. > > We have, yes, but in our case, these updates are a result of another > system generating events, and it is a bit hard to batch them together: > it would involve waiting for multiple events to arrive, instead of > processing events as they come in, which introduces an additional > layer of latency by itself. > > Regarding batched update, we've found that it is also very slow when > inserting a large number of elements. In one example where we inserted > ~1.2 million 16-byte elements, we found it took 4-500 milliseconds to > update the map via batched update. This is due to the overhead of > generic_map_update_batch() individually copying each element to be > inserted from user space via copy_from_user(); almost all time is > going to __check_object_size() called by copy_from_user(). I suspect > this one is hard to fix though, due to how the elements in a map are > laid out in memory; it would be hard to change such batched updates > into a single copy. But perhaps that's something for a different > thread (and we can easily work around it on our side). I think there are rooms to improve this: 1) As you mentioned, for hash-based or any other non-linear maps, it is indeed hard to optimize. However, for linear ones like array map, it is possible to copy the whole memory from user-space once. 2) There are actualy two copies here, one is from user-space to a temporary kernel memory, the other is from this temporary memory to the actual map key/value memory. _I speculate_ it is possible to optimize them down to one copy, at least for simple cases. Just my two cents. Thanks!