Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Mon, 22 Aug 2022 20:07:43 +0800

Hi,

On 8/22/2022 11:21 AM, Hao Luo wrote:
> Hi, Hou Tao
>
> On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> Hi,
>>
>> On 8/22/2022 12:42 AM, Hao Luo wrote:
>>> Hi Hou Tao,
>>>
>>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>>>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>>>
>>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
>>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
>>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
>>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
>>>> disallows concurrent updates from normal contexts (e.g. userspace
>>>> processes) unexpectedly as shown below:
>>>>
>>>> process A                      process B
>>>>
>>>> htab_map_update_elem()
>>>>   htab_lock_bucket()
>>>>     migrate_disable()
>>>>     /* return 1 */
>>>>     __this_cpu_inc_return()
>>>>     /* preempted by B */
>>>>
>>>>                                htab_map_update_elem()
>>>>                                  /* the same bucket as A */
>>>>                                  htab_lock_bucket()
>>>>                                    migrate_disable()
>>>>                                    /* return 2, so lock fails */
>>>>                                    __this_cpu_inc_return()
>>>>                                    return -EBUSY
>>>>
>>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
>>>> only checking the value of map_locked for nmi context. But it will
>>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
>>>> through non-tracing program (e.g. fentry program).
>>>>
>>>> So fixing it by using disable_preempt() instead of migrate_disable() when
>>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
>>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
>>>> so still use migrate_disable() for spin-lock case.
>>>>
>>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>>>> ---
>>> IIUC, this patch enlarges the scope of preemption disable to cover inc
>>> map_locked. But I don't think the change is meaningful.
>> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
>> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
>> case, so I don't think that the change is meaningless.
>>> This patch only affects the case when raw lock is used. In the case of
>>> raw lock, irq is disabled for b->raw_lock protected critical section.
>>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
>>> understanding about this patch is, it just makes sure preemption
>>> doesn't happen on the exact __this_cpu_inc_return. But the window is
>>> so small that it should be really unlikely to happen.
>> No, it can be easily reproduced by running multiple htab update processes in the
>> same CPU. Will add selftest to demonstrate that.
> Can you clarify what you demonstrate?
First please enable CONFIG_PREEMPT for the running kernel and then run the
following test program as shown below.

# sudo taskset -c 2 ./update.bin
thread nr 2
wait for error
update error -16
all threads exit

If there is no "update error -16", you can try to create more map update
threads. For example running 16 update threads:

# sudo taskset -c 2 ./update.bin 16
thread nr 16
wait for error
update error -16
update error -16
update error -16
update error -16
update error -16
update error -16
update error -16
update error -16
all threads exit

The following is the source code for update.bin:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>

#include "bpf.h"
#include "libbpf.h"

static bool stop;
static int fd;

static void *update_fn(void *arg)
{
        while (!stop) {
                unsigned int key = 0, value = 1;
                int err;

                err = bpf_map_update_elem(fd, &key, &value, 0);
                if (err) {
                        printf("update error %d\n", err);
                        stop = true;
                        break;
                }
        }

        return NULL;
}

int main(int argc, char **argv)
{
        LIBBPF_OPTS(bpf_map_create_opts, opts);
        unsigned int i, nr;
        pthread_t *tids;

        nr = 2;
        if (argc > 1)
                nr = atoi(argv[1]);
        printf("thread nr %u\n", nr);

        libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
        fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
        if (fd < 0) {
                printf("create map error %d\n", fd);
                return 1;
        }

        tids = malloc(nr * sizeof(*tids));
        for (i = 0; i < nr; i++)
                pthread_create(&tids[i], NULL, update_fn, NULL);

        printf("wait for error\n");
        for (i = 0; i < nr; i++)
                pthread_join(tids[i], NULL);

        printf("all threads exit\n");

        free(tids);
        close(fd);

        return 0;
}

>
> Here is my theory, but please correct me if I'm wrong, I haven't
> tested yet. In non-RT, I doubt preemptions are likely to happen after
> migrate_disable. That is because very soon after migrate_disable, we
> enter the critical section of b->raw_lock with irq disabled. In RT,
> preemptions can happen on acquiring b->lock, that is certainly
> possible, but this is the !use_raw_lock path, which isn't side-stepped
> by this patch.
>
>>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
>>>>  1 file changed, 18 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>> index 6c530a5e560a..ad09da139589 100644
>>>> --- a/kernel/bpf/hashtab.c
>>>> +++ b/kernel/bpf/hashtab.c
>>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>>                                    unsigned long *pflags)
>>>>  {
>>>>         unsigned long flags;
>>>> +       bool use_raw_lock;
>>>>
>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>>
>>>> -       migrate_disable();
>>>> +       use_raw_lock = htab_use_raw_lock(htab);
>>>> +       if (use_raw_lock)
>>>> +               preempt_disable();
>>>> +       else
>>>> +               migrate_disable();
>>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
>>>> -               migrate_enable();
>>>> +               if (use_raw_lock)
>>>> +                       preempt_enable();
>>>> +               else
>>>> +                       migrate_enable();
>>>>                 return -EBUSY;
>>>>         }
>>>>
>>>> -       if (htab_use_raw_lock(htab))
>>>> +       if (use_raw_lock)
>>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>>>>         else
>>>>                 spin_lock_irqsave(&b->lock, flags);
>>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>>>>                                       struct bucket *b, u32 hash,
>>>>                                       unsigned long flags)
>>>>  {
>>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
>>>> +
>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>> -       if (htab_use_raw_lock(htab))
>>>> +       if (use_raw_lock)
>>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
>>>>         else
>>>>                 spin_unlock_irqrestore(&b->lock, flags);
>>>>         __this_cpu_dec(*(htab->map_locked[hash]));
>>>> -       migrate_enable();
>>>> +       if (use_raw_lock)
>>>> +               preempt_enable();
>>>> +       else
>>>> +               migrate_enable();
>>>>  }
>>>>
>>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
>>>> --
>>>> 2.29.2
>>>>
>>> .
> .