Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked

Hao Luo <haoluo@xxxxxxxxxx> · Mon, 22 Aug 2022 11:01:13 -0700

On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 8/22/2022 11:21 AM, Hao Luo wrote:
> > Hi, Hou Tao
> >
> > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >> Hi,
> >>
> >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> >>> Hi Hou Tao,
> >>>
> >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >>>> From: Hou Tao <houtao1@xxxxxxxxxx>
> >>>>
> >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> >>>> disallows concurrent updates from normal contexts (e.g. userspace
> >>>> processes) unexpectedly as shown below:
> >>>>
> >>>> process A                      process B
> >>>>
> >>>> htab_map_update_elem()
> >>>>   htab_lock_bucket()
> >>>>     migrate_disable()
> >>>>     /* return 1 */
> >>>>     __this_cpu_inc_return()
> >>>>     /* preempted by B */
> >>>>
> >>>>                                htab_map_update_elem()
> >>>>                                  /* the same bucket as A */
> >>>>                                  htab_lock_bucket()
> >>>>                                    migrate_disable()
> >>>>                                    /* return 2, so lock fails */
> >>>>                                    __this_cpu_inc_return()
> >>>>                                    return -EBUSY
> >>>>
> >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> >>>> only checking the value of map_locked for nmi context. But it will
> >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> >>>> through non-tracing program (e.g. fentry program).
> >>>>
> >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> >>>> so still use migrate_disable() for spin-lock case.
> >>>>
> >>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
> >>>> ---
> >>> IIUC, this patch enlarges the scope of preemption disable to cover inc
> >>> map_locked. But I don't think the change is meaningful.
> >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
> >> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
> >> case, so I don't think that the change is meaningless.
> >>> This patch only affects the case when raw lock is used. In the case of
> >>> raw lock, irq is disabled for b->raw_lock protected critical section.
> >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
> >>> understanding about this patch is, it just makes sure preemption
> >>> doesn't happen on the exact __this_cpu_inc_return. But the window is
> >>> so small that it should be really unlikely to happen.
> >> No, it can be easily reproduced by running multiple htab update processes in the
> >> same CPU. Will add selftest to demonstrate that.
> > Can you clarify what you demonstrate?
> First please enable CONFIG_PREEMPT for the running kernel and then run the
> following test program as shown below.
>

Ah, fully preemptive kernel. It's worth mentioning that in the commit
message. Then it seems promoting migrate_disable to preempt_disable
may be the best way to solve the problem you described.

> # sudo taskset -c 2 ./update.bin
> thread nr 2
> wait for error
> update error -16
> all threads exit
>
> If there is no "update error -16", you can try to create more map update
> threads. For example running 16 update threads:
>
> # sudo taskset -c 2 ./update.bin 16
> thread nr 16
> wait for error
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> all threads exit
>
> The following is the source code for update.bin:
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdbool.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> #include <errno.h>
> #include <pthread.h>
>
> #include "bpf.h"
> #include "libbpf.h"
>
> static bool stop;
> static int fd;
>
> static void *update_fn(void *arg)
> {
>         while (!stop) {
>                 unsigned int key = 0, value = 1;
>                 int err;
>
>                 err = bpf_map_update_elem(fd, &key, &value, 0);
>                 if (err) {
>                         printf("update error %d\n", err);
>                         stop = true;
>                         break;
>                 }
>         }
>
>         return NULL;
> }
>
> int main(int argc, char **argv)
> {
>         LIBBPF_OPTS(bpf_map_create_opts, opts);
>         unsigned int i, nr;
>         pthread_t *tids;
>
>         nr = 2;
>         if (argc > 1)
>                 nr = atoi(argv[1]);
>         printf("thread nr %u\n", nr);
>
>         libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
>         fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
>         if (fd < 0) {
>                 printf("create map error %d\n", fd);
>                 return 1;
>         }
>
>         tids = malloc(nr * sizeof(*tids));
>         for (i = 0; i < nr; i++)
>                 pthread_create(&tids[i], NULL, update_fn, NULL);
>
>         printf("wait for error\n");
>         for (i = 0; i < nr; i++)
>                 pthread_join(tids[i], NULL);
>
>         printf("all threads exit\n");
>
>         free(tids);
>         close(fd);
>
>         return 0;
> }
>
> >
> > Here is my theory, but please correct me if I'm wrong, I haven't
> > tested yet. In non-RT, I doubt preemptions are likely to happen after
> > migrate_disable. That is because very soon after migrate_disable, we
> > enter the critical section of b->raw_lock with irq disabled. In RT,
> > preemptions can happen on acquiring b->lock, that is certainly
> > possible, but this is the !use_raw_lock path, which isn't side-stepped
> > by this patch.
> >
> >>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
> >>>>  1 file changed, 18 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >>>> index 6c530a5e560a..ad09da139589 100644
> >>>> --- a/kernel/bpf/hashtab.c
> >>>> +++ b/kernel/bpf/hashtab.c
> >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> >>>>                                    unsigned long *pflags)
> >>>>  {
> >>>>         unsigned long flags;
> >>>> +       bool use_raw_lock;
> >>>>
> >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >>>>
> >>>> -       migrate_disable();
> >>>> +       use_raw_lock = htab_use_raw_lock(htab);
> >>>> +       if (use_raw_lock)
> >>>> +               preempt_disable();
> >>>> +       else
> >>>> +               migrate_disable();
> >>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> >>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
> >>>> -               migrate_enable();
> >>>> +               if (use_raw_lock)
> >>>> +                       preempt_enable();
> >>>> +               else
> >>>> +                       migrate_enable();
> >>>>                 return -EBUSY;
> >>>>         }
> >>>>
> >>>> -       if (htab_use_raw_lock(htab))
> >>>> +       if (use_raw_lock)
> >>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
> >>>>         else
> >>>>                 spin_lock_irqsave(&b->lock, flags);
> >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
> >>>>                                       struct bucket *b, u32 hash,
> >>>>                                       unsigned long flags)
> >>>>  {
> >>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
> >>>> +
> >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >>>> -       if (htab_use_raw_lock(htab))
> >>>> +       if (use_raw_lock)
> >>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> >>>>         else
> >>>>                 spin_unlock_irqrestore(&b->lock, flags);
> >>>>         __this_cpu_dec(*(htab->map_locked[hash]));
> >>>> -       migrate_enable();
> >>>> +       if (use_raw_lock)
> >>>> +               preempt_enable();
> >>>> +       else
> >>>> +               migrate_enable();
> >>>>  }
> >>>>
> >>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> >>>> --
> >>>> 2.29.2
> >>>>
> >>> .
> > .
>