On Mon, Nov 13, 2023 at 08:53:06AM +0800, Hou Tao wrote: > Hi, > > On 11/10/2023 12:58 PM, Paul E. McKenney wrote: > > On Fri, Nov 10, 2023 at 11:34:03AM +0800, Hou Tao wrote: > >> Hi Martin, > >> > >> On 11/10/2023 10:48 AM, Martin KaFai Lau wrote: > >>> On 11/9/23 5:45 PM, Paul E. McKenney wrote: > >>>>>>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer) > >>>>>>>>>> +{ > >>>>>>>>>> + struct bpf_inner_map_element *element = ptr; > >>>>>>>>>> + > >>>>>>>>>> + /* Do bpf_map_put() after a RCU grace period and a tasks > >>>>>>>>>> trace > >>>>>>>>>> + * RCU grace period, so it is certain that the bpf program > >>>>>>>>>> which is > >>>>>>>>>> + * manipulating the map now has exited when bpf_map_put() is > >>>>>>>>>> called. > >>>>>>>>>> + */ > >>>>>>>>>> + if (need_defer) > >>>>>>>>> "need_defer" should only happen from the syscall cmd? Instead of > >>>>>>>>> adding rcu_head to each element, how about > >>>>>>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here? > >>>>>>>> No. I have tried the method before, but it didn't work due to > >>>>>>>> dead-lock > >>>>>>>> (will mention that in commit message in v2). The reason is that bpf > >>>>>>>> syscall program may also do map update through sys_bpf helper. > >>>>>>>> Because > >>>>>>>> bpf syscall program is running with sleep-able context and has > >>>>>>>> rcu_read_lock_trace being held, so call > >>>>>>>> synchronize_rcu_mult(call_rcu, > >>>>>>>> call_rcu_tasks) will lead to dead-lock. > >>> Need to think of a less intrusive solution instead of adding rcu_head > >>> to each element and lookup also needs an extra de-referencing. > >> I see. > >>> May be the bpf_map_{update,delete}_elem(&outer_map, ....) should not > >>> be done by the syscall program? Which selftest does it? > >> Now bpf_map_update_elem is allowed for bpf_sys_bpf helper. If I > >> remembered correctly it was map_ptr. > >>> Can the inner_map learn that it has been deleted from an outer map > >>> that is used in a sleepable prog->aux->used_maps? The > >>> bpf_map_free_deferred() will then wait for a task_trace gp? > >> Considering an inner_map may be used by multiple outer_map, the > >> following solution will be simpler: if the inner map has been deleted > >> from an outer map once, its free must be delayed after one RCU GP and > >> one tasks trace RCU GP. But I will check whether it is possible to only > >> wait for one RCU GP instead of two. > > If you are freeing a large quantity of elements at a time, one approach > > is to use a single rcu_head structure for the group. (Or, in this case, > > maybe a pair of rcu_head structures, one for call_rcu() and the other > > for call_rcu_tasks_trace().) > > > > This requires that you be able to link the elements in the group > > together somehow, which requires some per-element storage, but only > > one word per element instead of two. > > > > There are other variations on this theme, depending on what constraints > > apply here. > > Thanks for your suggestions. Although there are batch update support for > inner map, but I think inner map is updated one-by-one at most case. And > the main concern here is the extra dereference due to memory allocation, > so I think adding extra flags to indicate bpf_mem_free_deferred() to > free the map differently may be appropriate. Whatever gets the job done is of course goodness, and yes, if the freeing cannot be batched, my suggestion won't help much. ;-) Thanx, Paul