On Fri, Nov 10, 2023 at 11:34:03AM +0800, Hou Tao wrote: > Hi Martin, > > On 11/10/2023 10:48 AM, Martin KaFai Lau wrote: > > On 11/9/23 5:45 PM, Paul E. McKenney wrote: > >>>>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer) > >>>>>>>> +{ > >>>>>>>> + struct bpf_inner_map_element *element = ptr; > >>>>>>>> + > >>>>>>>> + /* Do bpf_map_put() after a RCU grace period and a tasks > >>>>>>>> trace > >>>>>>>> + * RCU grace period, so it is certain that the bpf program > >>>>>>>> which is > >>>>>>>> + * manipulating the map now has exited when bpf_map_put() is > >>>>>>>> called. > >>>>>>>> + */ > >>>>>>>> + if (need_defer) > >>>>>>> "need_defer" should only happen from the syscall cmd? Instead of > >>>>>>> adding rcu_head to each element, how about > >>>>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here? > >>>>>> No. I have tried the method before, but it didn't work due to > >>>>>> dead-lock > >>>>>> (will mention that in commit message in v2). The reason is that bpf > >>>>>> syscall program may also do map update through sys_bpf helper. > >>>>>> Because > >>>>>> bpf syscall program is running with sleep-able context and has > >>>>>> rcu_read_lock_trace being held, so call > >>>>>> synchronize_rcu_mult(call_rcu, > >>>>>> call_rcu_tasks) will lead to dead-lock. > > > > Need to think of a less intrusive solution instead of adding rcu_head > > to each element and lookup also needs an extra de-referencing. > > I see. > > > > May be the bpf_map_{update,delete}_elem(&outer_map, ....) should not > > be done by the syscall program? Which selftest does it? > > Now bpf_map_update_elem is allowed for bpf_sys_bpf helper. If I > remembered correctly it was map_ptr. > > > > Can the inner_map learn that it has been deleted from an outer map > > that is used in a sleepable prog->aux->used_maps? The > > bpf_map_free_deferred() will then wait for a task_trace gp? > > Considering an inner_map may be used by multiple outer_map, the > following solution will be simpler: if the inner map has been deleted > from an outer map once, its free must be delayed after one RCU GP and > one tasks trace RCU GP. But I will check whether it is possible to only > wait for one RCU GP instead of two. If you are freeing a large quantity of elements at a time, one approach is to use a single rcu_head structure for the group. (Or, in this case, maybe a pair of rcu_head structures, one for call_rcu() and the other for call_rcu_tasks_trace().) This requires that you be able to link the elements in the group together somehow, which requires some per-element storage, but only one word per element instead of two. There are other variations on this theme, depending on what constraints apply here. Thanx, Paul