Hi, On 11/22/2023 1:49 AM, Alexei Starovoitov wrote: > On Mon, Nov 20, 2023 at 10:45 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> Hi Alexei, >> >> On 11/21/2023 1:19 PM, Alexei Starovoitov wrote: >>> On Mon, Nov 13, 2023 at 08:33:23PM +0800, Hou Tao wrote: >>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >>>> index e2d2701ce2c45..5a7906f2b027e 100644 >>>> --- a/kernel/bpf/syscall.c >>>> +++ b/kernel/bpf/syscall.c >>>> @@ -694,12 +694,20 @@ static void bpf_map_free_deferred(struct work_struct *work) >>>> { >>>> struct bpf_map *map = container_of(work, struct bpf_map, work); >>>> struct btf_record *rec = map->record; >>>> + int acc_ctx; >>>> >>>> security_bpf_map_free(map); >>>> bpf_map_release_memcg(map); >>>> >>>> - if (READ_ONCE(map->free_after_mult_rcu_gp)) >>>> - synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace); >>> The previous patch 3 is doing too much. >>> There is maybe_wait_bpf_programs() that will do synchronize_rcu() >>> when necessary. >>> The patch 3 could do synchronize_rcu_tasks_trace() only and it will solve the issue. >> I didn't follow how synchronize_rcu() in maybe_wait_bpf_programs() will >> help bpf_map_free_deferred() to defer the free of inner map. Could you >> please elaborate on that ? In my understanding, bpf_map_update_value() >> invokes maybe_wait_bpf_programs() after the deletion of old inner map >> from outer map completes. If the ref-count of inner map in the outer map >> is the last one, bpf_map_free_deferred() will be called when the >> deletion completes, so maybe_wait_bpf_programs() will run concurrently >> with bpf_map_free_deferred(). > The code was quite different back then. > See commit 1ae80cf31938 ("bpf: wait for running BPF programs when > updating map-in-map") > that was added to specifically address the case where bpf prog is > looking at the old inner map. > The commit log talks about a little bit of a different issue, > but the end result was the same. It prevented UAF since map free > logic was waiting for normal RCU GP back then. > See this comment: > void bpf_map_fd_put_ptr(void *ptr) > { > /* ptr->ops->map_free() has to go through one > * rcu grace period by itself. > */ > bpf_map_put(ptr); > } > > that code was added when map-in-map was introduced. I see. Thanks for the explanation. If there is still synchronize_rcu in .map_free(), using synchronize_rcu_tasks_trace in bpf_map_free_deferred will be enough. > > Also see this commit: > https://lore.kernel.org/bpf/20220218181801.2971275-1-eric.dumazet@xxxxxxxxx/ > > In cases of batched updates (when multiple inner maps are deleted from > outer map) we should not call sync_rcu for every element being > deleted. > The introduced latency can be bad. > > I guess maybe_wait_bpf_programs() was too much brute force. > It would call sync_rcu() regardless whether refcnt dropped to zero. > It mainly cares about user space assumptions. > This patch 3 and 4 will wait for sync_rcu only when refcnt==0, > so it should be ok. > > Now we don't have 'wait for rcu gp' in map_free, so > maybe_wait_bpf_programs() is racy as you pointed out. > bpf_map_put() will drop refcnt of inner map and it might proceed into > bpf_map_free_deferred()->*_map_free() while bpf prog is still observing > a pointer to that map. > > We need to adjust a comment in maybe_wait_bpf_programs() to say > it will only wait for non-sleepable bpf progs. > Sleepable might still see 'old' inner map after syscall map_delete > returns to user space. Could we make update/deletion operation from bpf_syscall_bpf helper being a special case and don't call synchronize_rcu_tasks_trace() in maybe_wait_bpf_programs(), but does synchronize_rcu_tasks_trace() for all other cases ? > > >>>> + acc_ctx = atomic_read(&map->may_be_accessed_prog_ctx) & BPF_MAP_ACC_PROG_CTX_MASK; >>>> + if (acc_ctx) { >>>> + if (acc_ctx == BPF_MAP_ACC_NORMAL_PROG_CTX) >>>> + synchronize_rcu(); >>>> + else if (acc_ctx == BPF_MAP_ACC_SLEEPABLE_PROG_CTX) >>>> + synchronize_rcu_tasks_trace(); >>>> + else >>>> + synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace); >>> and this patch 4 goes to far. >>> Could you add sleepable_refcnt in addition to existing refcnt that is incremented >>> in outer map when it's used by sleepable prog and when sleepable_refcnt > 0 >>> the caller of bpf_map_free_deferred sets free_after_mult_rcu_gp. >>> (which should be renamed to free_after_tasks_rcu_gp). >>> Patch 3 is simpler and patch 4 is simple too. >>> No need for atomic_or games. >>> >>> In addition I'd like to see an extra patch that demonstrates this UAF >>> when update/delete is done by syscall bpf prog type. >>> The test case in patch 5 is doing update/delete from user space. >> Do you mean update/delete operations on outer map, right ? Because in >> patch 5, inner map is updated from bpf program instead of user space. > patch 5 does: > bpf_map_update_elem(inner_map,... > > That's only to trigger UAF. > We need a test that does bpf_map_update_elem(outer_map,... > from sleepable bpf prog to make sure we do _not_ have a code in > the kernel that synchronously waits for RCU tasks trace GP at that time. I think we already have such test case in prog_tests/map_ptr.c. In map_ptr.c, it will use light skeleton to setup the map-in-map defined in progs/map_ptr_kern.c. I got the dead-lock when trying to do synchronize_rcu_tasks_trace() in bpf_map_fd_put_ptr(). But I could add a explicit one. > So, you're correct, maybe_wait_bpf_programs() is not sufficient any more, > but we cannot delete it, since it addresses user space assumptions > on what bpf progs see when the inner map is replaced. > > I still don't like atomic_or() logic and masks. > Why not to have sleepable_refcnt and > if (sleepable_refcnt > 0) > synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace); > else > synchronize_rcu(); I think the main reason is that there is four possible case for the free of inner map: (1) neither call synchronize_rcu() nor synchronize_rcu_tasks_trace() It is true when the outer map is only being accessed in user space. (2) only call synchronize_rcu() the outer map is only being accessed by non-sleepable bpf program (3) only call synchronize_rcu_tasks_trace the outer map is only being accessed by sleepable bpf program (4) call both synchronize_rcu() and synchronize_rcu_tasks_trace() Only using sleepable_refcnt can not express 4 possible cases and we also need to atomically copy the states from outer map to inner map, because one inner map may be used concurrently by multiple outer map, so atomic or mask are chosen.