Re: [PATCH bpf v2 1/2] bpf: Check map->usercnt again after timer->timer is assigned

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 20 Oct 2023 07:57:59 -0700

On Fri, Oct 20, 2023 at 12:31 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 10/20/2023 10:14 AM, Alexei Starovoitov wrote:
> > On Thu, Oct 19, 2023 at 6:41 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >> From: Hou Tao <houtao1@xxxxxxxxxx>
> >>
> >> When there are concurrent uref release and bpf timer init operations,
> >> the following sequence diagram is possible and it will lead to memory
> >> leak:
> >>
> >> bpf program X
> >>
> >> bpf_timer_init()
> >>   lock timer->lock
> >>     read timer->timer as NULL
> >>     read map->usercnt != 0
> >>
> >>                 process Y
> >>
> >>                 close(map_fd)
> >>                   // put last uref
> >>                   bpf_map_put_uref()
> >>                     atomic_dec_and_test(map->usercnt)
> >>                       array_map_free_timers()
> >>                         bpf_timer_cancel_and_free()
> >>                           // just return and lead to memory leak
> >>                           read timer->timer is NULL
> >>
> >>     t = bpf_map_kmalloc_node()
> >>     timer->timer = t
> >>   unlock timer->lock
> >>
> >> Fix the problem by checking map->usercnt again after timer->timer is
> >> assigned, so when there are concurrent uref release and bpf timer init,
> >> either bpf_timer_cancel_and_free() from uref release reads a no-NULL
> >> timer and the newly-added check of map->usercnt reads a zero usercnt.
> >>
> >> Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
> >> in bpf_timer_cancel_and_free() are not protected by a lock, so add
> >> a memory barrier to guarantee the order between map->usercnt and
> >> timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
> >> read of timer->timer.
> >>
> >> Reported-by: Hsin-Wei Hung <hsinweih@xxxxxxx>
> >> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@xxxxxxxxxxxxxx
> >> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
> >> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
> >> ---
> >>  kernel/bpf/helpers.c | 18 +++++++++++++++---
> >>  1 file changed, 15 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index 757b99c1e613f..a7d92c3ddc3dd 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -1156,7 +1156,7 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> >>            u64, flags)
> >>  {
> >>         clockid_t clockid = flags & (MAX_CLOCKS - 1);
> >> -       struct bpf_hrtimer *t;
> >> +       struct bpf_hrtimer *t, *to_free = NULL;
> >>         int ret = 0;
> >>
> >>         BUILD_BUG_ON(MAX_CLOCKS != 16);
> >> @@ -1197,9 +1197,21 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> >>         rcu_assign_pointer(t->callback_fn, NULL);
> >>         hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> >>         t->timer.function = bpf_timer_cb;
> >> -       timer->timer = t;
> >> +       WRITE_ONCE(timer->timer, t);
> >> +       /* Guarantee order between timer->timer and map->usercnt. So when
> >> +        * there are concurrent uref release and bpf timer init, either
> >> +        * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
> >> +        * timer or atomic64_read() below reads a zero usercnt.
> >> +        */
> >> +       smp_mb();
> >> +       if (!atomic64_read(&map->usercnt)) {
> >> +               WRITE_ONCE(timer->timer, NULL);
> >> +               to_free = t;
> > just kfree(t); here.
>
> Will do. It is a slow path, so I think doing kfree() under spin-lock is
> acceptable.
> >
> >> +               ret = -EPERM;
> >> +       }
> > This will add a second atomic64_read(&map->usercnt) in the same function.
> > Let's remove the first one ?
>
> I prefer to still keep it. Because it can detect the release of map uref
> early and the handle of uref release is simple compared with the second
> atomic64_read(). Do you have a strong preference ?

I bet somebody will send a patch to remove the first one as redundant.
So let's do it now.
The only reason we do repeated early check is to avoid taking a lock.
Here doing an extra early check to avoid kmalloc is an overkill.
That check is highly unlikely to hit while for locks it's a likely one.
Hence extra check is justified for locks, but not here.

Reading your other email it looks like this patchset is incomplete anyway?