Re: [PATCH bpf v2 4/5] bpf: Optimize the free of inner map

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 20, 2023 at 10:45 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi Alexei,
>
> On 11/21/2023 1:19 PM, Alexei Starovoitov wrote:
> > On Mon, Nov 13, 2023 at 08:33:23PM +0800, Hou Tao wrote:
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index e2d2701ce2c45..5a7906f2b027e 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -694,12 +694,20 @@ static void bpf_map_free_deferred(struct work_struct *work)
> >>  {
> >>      struct bpf_map *map = container_of(work, struct bpf_map, work);
> >>      struct btf_record *rec = map->record;
> >> +    int acc_ctx;
> >>
> >>      security_bpf_map_free(map);
> >>      bpf_map_release_memcg(map);
> >>
> >> -    if (READ_ONCE(map->free_after_mult_rcu_gp))
> >> -            synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
> > The previous patch 3 is doing too much.
> > There is maybe_wait_bpf_programs() that will do synchronize_rcu()
> > when necessary.
> > The patch 3 could do synchronize_rcu_tasks_trace() only and it will solve the issue.
>
> I didn't follow how synchronize_rcu() in maybe_wait_bpf_programs() will
> help bpf_map_free_deferred() to defer the free of inner map. Could you
> please elaborate on that ? In my understanding, bpf_map_update_value()
> invokes maybe_wait_bpf_programs() after the deletion of old inner map
> from outer map completes. If the ref-count of inner map in the outer map
> is the last one, bpf_map_free_deferred() will be called when the
> deletion completes, so maybe_wait_bpf_programs() will run concurrently
> with bpf_map_free_deferred().

The code was quite different back then.
See commit 1ae80cf31938 ("bpf: wait for running BPF programs when
updating map-in-map")
that was added to specifically address the case where bpf prog is
looking at the old inner map.
The commit log talks about a little bit of a different issue,
but the end result was the same. It prevented UAF since map free
logic was waiting for normal RCU GP back then.
See this comment:
void bpf_map_fd_put_ptr(void *ptr)
{
        /* ptr->ops->map_free() has to go through one
         * rcu grace period by itself.
         */
        bpf_map_put(ptr);
}

that code was added when map-in-map was introduced.

Also see this commit:
https://lore.kernel.org/bpf/20220218181801.2971275-1-eric.dumazet@xxxxxxxxx/

In cases of batched updates (when multiple inner maps are deleted from
outer map) we should not call sync_rcu for every element being
deleted.
The introduced latency can be bad.

I guess maybe_wait_bpf_programs() was too much brute force.
It would call sync_rcu() regardless whether refcnt dropped to zero.
It mainly cares about user space assumptions.
This patch 3 and 4 will wait for sync_rcu only when refcnt==0,
so it should be ok.

Now we don't have 'wait for rcu gp' in map_free, so
maybe_wait_bpf_programs() is racy as you pointed out.
bpf_map_put() will drop refcnt of inner map and it might proceed into
bpf_map_free_deferred()->*_map_free() while bpf prog is still observing
a pointer to that map.

We need to adjust a comment in maybe_wait_bpf_programs() to say
it will only wait for non-sleepable bpf progs.
Sleepable might still see 'old' inner map after syscall map_delete
returns to user space.


> >
> >> +    acc_ctx = atomic_read(&map->may_be_accessed_prog_ctx) & BPF_MAP_ACC_PROG_CTX_MASK;
> >> +    if (acc_ctx) {
> >> +            if (acc_ctx == BPF_MAP_ACC_NORMAL_PROG_CTX)
> >> +                    synchronize_rcu();
> >> +            else if (acc_ctx == BPF_MAP_ACC_SLEEPABLE_PROG_CTX)
> >> +                    synchronize_rcu_tasks_trace();
> >> +            else
> >> +                    synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
> > and this patch 4 goes to far.
> > Could you add sleepable_refcnt in addition to existing refcnt that is incremented
> > in outer map when it's used by sleepable prog and when sleepable_refcnt > 0
> > the caller of bpf_map_free_deferred sets free_after_mult_rcu_gp.
> > (which should be renamed to free_after_tasks_rcu_gp).
> > Patch 3 is simpler and patch 4 is simple too.
> > No need for atomic_or games.
> >
> > In addition I'd like to see an extra patch that demonstrates this UAF
> > when update/delete is done by syscall bpf prog type.
> > The test case in patch 5 is doing update/delete from user space.
>
> Do you mean update/delete operations on outer map, right ? Because in
> patch 5, inner map is updated from bpf program instead of user space.

patch 5 does:
bpf_map_update_elem(inner_map,...

That's only to trigger UAF.
We need a test that does bpf_map_update_elem(outer_map,...
from sleepable bpf prog to make sure we do _not_ have a code in
the kernel that synchronously waits for RCU tasks trace GP at that time.

So, you're correct, maybe_wait_bpf_programs() is not sufficient any more,
but we cannot delete it, since it addresses user space assumptions
on what bpf progs see when the inner map is replaced.

I still don't like atomic_or() logic and masks.
Why not to have sleepable_refcnt and
if (sleepable_refcnt > 0)
  synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
else
  synchronize_rcu();





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux