On Mon, Nov 20, 2023 at 10:45 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi Alexei, > > On 11/21/2023 1:19 PM, Alexei Starovoitov wrote: > > On Mon, Nov 13, 2023 at 08:33:23PM +0800, Hou Tao wrote: > >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > >> index e2d2701ce2c45..5a7906f2b027e 100644 > >> --- a/kernel/bpf/syscall.c > >> +++ b/kernel/bpf/syscall.c > >> @@ -694,12 +694,20 @@ static void bpf_map_free_deferred(struct work_struct *work) > >> { > >> struct bpf_map *map = container_of(work, struct bpf_map, work); > >> struct btf_record *rec = map->record; > >> + int acc_ctx; > >> > >> security_bpf_map_free(map); > >> bpf_map_release_memcg(map); > >> > >> - if (READ_ONCE(map->free_after_mult_rcu_gp)) > >> - synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace); > > The previous patch 3 is doing too much. > > There is maybe_wait_bpf_programs() that will do synchronize_rcu() > > when necessary. > > The patch 3 could do synchronize_rcu_tasks_trace() only and it will solve the issue. > > I didn't follow how synchronize_rcu() in maybe_wait_bpf_programs() will > help bpf_map_free_deferred() to defer the free of inner map. Could you > please elaborate on that ? In my understanding, bpf_map_update_value() > invokes maybe_wait_bpf_programs() after the deletion of old inner map > from outer map completes. If the ref-count of inner map in the outer map > is the last one, bpf_map_free_deferred() will be called when the > deletion completes, so maybe_wait_bpf_programs() will run concurrently > with bpf_map_free_deferred(). The code was quite different back then. See commit 1ae80cf31938 ("bpf: wait for running BPF programs when updating map-in-map") that was added to specifically address the case where bpf prog is looking at the old inner map. The commit log talks about a little bit of a different issue, but the end result was the same. It prevented UAF since map free logic was waiting for normal RCU GP back then. See this comment: void bpf_map_fd_put_ptr(void *ptr) { /* ptr->ops->map_free() has to go through one * rcu grace period by itself. */ bpf_map_put(ptr); } that code was added when map-in-map was introduced. Also see this commit: https://lore.kernel.org/bpf/20220218181801.2971275-1-eric.dumazet@xxxxxxxxx/ In cases of batched updates (when multiple inner maps are deleted from outer map) we should not call sync_rcu for every element being deleted. The introduced latency can be bad. I guess maybe_wait_bpf_programs() was too much brute force. It would call sync_rcu() regardless whether refcnt dropped to zero. It mainly cares about user space assumptions. This patch 3 and 4 will wait for sync_rcu only when refcnt==0, so it should be ok. Now we don't have 'wait for rcu gp' in map_free, so maybe_wait_bpf_programs() is racy as you pointed out. bpf_map_put() will drop refcnt of inner map and it might proceed into bpf_map_free_deferred()->*_map_free() while bpf prog is still observing a pointer to that map. We need to adjust a comment in maybe_wait_bpf_programs() to say it will only wait for non-sleepable bpf progs. Sleepable might still see 'old' inner map after syscall map_delete returns to user space. > > > >> + acc_ctx = atomic_read(&map->may_be_accessed_prog_ctx) & BPF_MAP_ACC_PROG_CTX_MASK; > >> + if (acc_ctx) { > >> + if (acc_ctx == BPF_MAP_ACC_NORMAL_PROG_CTX) > >> + synchronize_rcu(); > >> + else if (acc_ctx == BPF_MAP_ACC_SLEEPABLE_PROG_CTX) > >> + synchronize_rcu_tasks_trace(); > >> + else > >> + synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace); > > and this patch 4 goes to far. > > Could you add sleepable_refcnt in addition to existing refcnt that is incremented > > in outer map when it's used by sleepable prog and when sleepable_refcnt > 0 > > the caller of bpf_map_free_deferred sets free_after_mult_rcu_gp. > > (which should be renamed to free_after_tasks_rcu_gp). > > Patch 3 is simpler and patch 4 is simple too. > > No need for atomic_or games. > > > > In addition I'd like to see an extra patch that demonstrates this UAF > > when update/delete is done by syscall bpf prog type. > > The test case in patch 5 is doing update/delete from user space. > > Do you mean update/delete operations on outer map, right ? Because in > patch 5, inner map is updated from bpf program instead of user space. patch 5 does: bpf_map_update_elem(inner_map,... That's only to trigger UAF. We need a test that does bpf_map_update_elem(outer_map,... from sleepable bpf prog to make sure we do _not_ have a code in the kernel that synchronously waits for RCU tasks trace GP at that time. So, you're correct, maybe_wait_bpf_programs() is not sufficient any more, but we cannot delete it, since it addresses user space assumptions on what bpf progs see when the inner map is replaced. I still don't like atomic_or() logic and masks. Why not to have sleepable_refcnt and if (sleepable_refcnt > 0) synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace); else synchronize_rcu();