Hi Alexei, On 11/27/2023 9:49 AM, Alexei Starovoitov wrote: > On Fri, Nov 24, 2023 at 3:29 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> From: Hou Tao <houtao1@xxxxxxxxxx> >> >> When removing the inner map from the outer map, the inner map will be >> freed after one RCU grace period and one RCU tasks trace grace >> period, so it is certain that the bpf program, which may access the >> inner map, has exited before the inner map is freed. >> >> However there is unnecessary to wait for any RCU grace period, one RCU >> grace period or one RCU tasks trace grace period if the outer map is >> only accessed by userspace, sleepable program or non-sleepable program. >> So recording the sleepable attributes of the owned bpf programs when >> adding the outer map into env->used_maps, copying the recorded >> attributes to inner map atomically when removing inner map from the >> outer map and using the recorded attributes in the inner map to decide >> which, and how many, RCU grace periods are needed when freeing the >> inner map. >> >> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> >> >> + if (deferred) { >> + /* used_in_rcu_gp may be updated concurrently by new bpf >> + * program, so add smp_mb() to guarantee the order between >> + * used_in_rcu_gp and lookup/deletion operation of inner map. >> + * If a new bpf program finds the inner map before it is >> + * removed from outer map, reading used_in_rcu_gp below will >> + * return the newly-set bit set by the new bpf program. >> + */ >> + smp_mb(); >> + atomic_or(atomic_read(&map->used_in_rcu_gp), &inner_map->free_by_rcu_gp); > You resent the patches before I had time to reply to the previous thread... I didn't receive the reply from last Thursday, so I though there is no new suggestions from you and sent out v3 which incorporate suggestions from Martin. I will ping next time before sending out new version if there is still pending discussion about the posted patch-set. > >> I think the main reason is that there is four possible case for the free >> of inner map: >> (1) neither call synchronize_rcu() nor synchronize_rcu_tasks_trace() >> It is true when the outer map is only being accessed in user space. >> (2) only call synchronize_rcu() >> the outer map is only being accessed by non-sleepable bpf program >> (3) only call synchronize_rcu_tasks_trace >> the outer map is only being accessed by sleepable bpf program >> (4) call both synchronize_rcu() and synchronize_rcu_tasks_trace() >> >> Only using sleepable_refcnt can not express 4 possible cases and we also >> need to atomically copy the states from outer map to inner map, because >> one inner map may be used concurrently by multiple outer map, so atomic >> or mask are chosen. > We don't care about optimizing 1, since it's rare case. > We also don't care about optimizing 3, since sync_rcu time is negligible > when we need to wait for sync_rcu_tasks_trace and also > because rcu_trace_implies_rcu_gp() for foreseeable future. I see. > >> need to atomically > we do NOT have such need. Here the atomicity means that multiple updates to inner_map->free_in_rcu_gp should not overwrite each other and it has nothing to do with the memory barrier. And using spin_lock will serve the same purpose. > There is zero need to do this atomic games and barries "just want to > be cautious". The code should not have any code at all "to be > cautious". > Every barrier has to be a real reason behind it. > Please remove them. Will remove the memory barrier. I think we can add it later if it is needed. > The concurent access to refcnt and sleepable_refcnt can be serialized > with simple spin_lock. OK. > > I also don't like >> + BPF_MAP_RCU_GP = BIT(0), >> + BPF_MAP_RCU_TT_GP = BIT(1), > the names should not describe what action should be taken. > The names should describe what the bits do. Understood. > I still think sleepable_refcnt and refcnt is more than enough to > optimize patch 3. Before posting v4, do the following code-snippets match your suggestions ? resolve_pseudo_ldimm64() if (env->prog->aux->sleepable) /* The max number of program is INT_MAX - 1, so atomic_t will be enough */ atomic_inc(&map->sleepable_refcnt); bpf_map_fd_put_ptr() if (deferred) { if (atomic_read(&map->sleepable_refcnt)) WRITE_ONCE(map->free_after_tt_rcu_gp, true); else WRITE_ONCE(map->free_after_rcu_gp, true); } bpf_map_put() if (READ_ONCE(map->free_after_tt_rcu_gp)) call_rcu_tasks_trace(&map->rcu, bpf_map_free_mult_rcu_gp); else if (READ_ONCE(map->free_after_rcu_gp)) call_rcu(&map->rcu, bpf_map_free_rcu_gp); else bpf_map_free_in_work(map); And are you OK with using call_rcu()/call_rcu_tasks_trace() instead of synchronize_rcu()/synchronize_rcu_mult() ?