Hi, On 11/10/2023 12:58 PM, Paul E. McKenney wrote: > On Fri, Nov 10, 2023 at 11:34:03AM +0800, Hou Tao wrote: >> Hi Martin, >> >> On 11/10/2023 10:48 AM, Martin KaFai Lau wrote: >>> On 11/9/23 5:45 PM, Paul E. McKenney wrote: >>>>>>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer) >>>>>>>>>> +{ >>>>>>>>>> + struct bpf_inner_map_element *element = ptr; >>>>>>>>>> + >>>>>>>>>> + /* Do bpf_map_put() after a RCU grace period and a tasks >>>>>>>>>> trace >>>>>>>>>> + * RCU grace period, so it is certain that the bpf program >>>>>>>>>> which is >>>>>>>>>> + * manipulating the map now has exited when bpf_map_put() is >>>>>>>>>> called. >>>>>>>>>> + */ >>>>>>>>>> + if (need_defer) >>>>>>>>> "need_defer" should only happen from the syscall cmd? Instead of >>>>>>>>> adding rcu_head to each element, how about >>>>>>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here? >>>>>>>> No. I have tried the method before, but it didn't work due to >>>>>>>> dead-lock >>>>>>>> (will mention that in commit message in v2). The reason is that bpf >>>>>>>> syscall program may also do map update through sys_bpf helper. >>>>>>>> Because >>>>>>>> bpf syscall program is running with sleep-able context and has >>>>>>>> rcu_read_lock_trace being held, so call >>>>>>>> synchronize_rcu_mult(call_rcu, >>>>>>>> call_rcu_tasks) will lead to dead-lock. >>> Need to think of a less intrusive solution instead of adding rcu_head >>> to each element and lookup also needs an extra de-referencing. >> I see. >>> May be the bpf_map_{update,delete}_elem(&outer_map, ....) should not >>> be done by the syscall program? Which selftest does it? >> Now bpf_map_update_elem is allowed for bpf_sys_bpf helper. If I >> remembered correctly it was map_ptr. >>> Can the inner_map learn that it has been deleted from an outer map >>> that is used in a sleepable prog->aux->used_maps? The >>> bpf_map_free_deferred() will then wait for a task_trace gp? >> Considering an inner_map may be used by multiple outer_map, the >> following solution will be simpler: if the inner map has been deleted >> from an outer map once, its free must be delayed after one RCU GP and >> one tasks trace RCU GP. But I will check whether it is possible to only >> wait for one RCU GP instead of two. > If you are freeing a large quantity of elements at a time, one approach > is to use a single rcu_head structure for the group. (Or, in this case, > maybe a pair of rcu_head structures, one for call_rcu() and the other > for call_rcu_tasks_trace().) > > This requires that you be able to link the elements in the group > together somehow, which requires some per-element storage, but only > one word per element instead of two. > > There are other variations on this theme, depending on what constraints > apply here. Thanks for your suggestions. Although there are batch update support for inner map, but I think inner map is updated one-by-one at most case. And the main concern here is the extra dereference due to memory allocation, so I think adding extra flags to indicate bpf_mem_free_deferred() to free the map differently may be appropriate. Regards, Tao > Thanx, Paul