Hi, On 8/22/2022 11:21 AM, Hao Luo wrote: > Hi, Hou Tao > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> Hi, >> >> On 8/22/2022 12:42 AM, Hao Luo wrote: >>> Hi Hou Tao, >>> >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>>> From: Hou Tao <houtao1@xxxxxxxxxx> >>>> >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: >>>> Make migrate_disable/enable() independent of RT"), migrations_disable() >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also >>>> disallows concurrent updates from normal contexts (e.g. userspace >>>> processes) unexpectedly as shown below: >>>> >>>> process A process B >>>> >>>> htab_map_update_elem() >>>> htab_lock_bucket() >>>> migrate_disable() >>>> /* return 1 */ >>>> __this_cpu_inc_return() >>>> /* preempted by B */ >>>> >>>> htab_map_update_elem() >>>> /* the same bucket as A */ >>>> htab_lock_bucket() >>>> migrate_disable() >>>> /* return 2, so lock fails */ >>>> __this_cpu_inc_return() >>>> return -EBUSY >>>> >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and >>>> only checking the value of map_locked for nmi context. But it will >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered >>>> through non-tracing program (e.g. fentry program). >>>> >>>> So fixing it by using disable_preempt() instead of migrate_disable() when >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false, >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), >>>> so still use migrate_disable() for spin-lock case. >>>> >>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> >>>> --- >>> IIUC, this patch enlarges the scope of preemption disable to cover inc >>> map_locked. But I don't think the change is meaningful. >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of >> RT"), the preemption is disabled before increasing map_locked for !PREEMPT_RT >> case, so I don't think that the change is meaningless. >>> This patch only affects the case when raw lock is used. In the case of >>> raw lock, irq is disabled for b->raw_lock protected critical section. >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my >>> understanding about this patch is, it just makes sure preemption >>> doesn't happen on the exact __this_cpu_inc_return. But the window is >>> so small that it should be really unlikely to happen. >> No, it can be easily reproduced by running multiple htab update processes in the >> same CPU. Will add selftest to demonstrate that. > Can you clarify what you demonstrate? First please enable CONFIG_PREEMPT for the running kernel and then run the following test program as shown below. # sudo taskset -c 2 ./update.bin thread nr 2 wait for error update error -16 all threads exit If there is no "update error -16", you can try to create more map update threads. For example running 16 update threads: # sudo taskset -c 2 ./update.bin 16 thread nr 16 wait for error update error -16 update error -16 update error -16 update error -16 update error -16 update error -16 update error -16 update error -16 all threads exit The following is the source code for update.bin: #define _GNU_SOURCE #include <stdio.h> #include <stdbool.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <errno.h> #include <pthread.h> #include "bpf.h" #include "libbpf.h" static bool stop; static int fd; static void *update_fn(void *arg) { while (!stop) { unsigned int key = 0, value = 1; int err; err = bpf_map_update_elem(fd, &key, &value, 0); if (err) { printf("update error %d\n", err); stop = true; break; } } return NULL; } int main(int argc, char **argv) { LIBBPF_OPTS(bpf_map_create_opts, opts); unsigned int i, nr; pthread_t *tids; nr = 2; if (argc > 1) nr = atoi(argv[1]); printf("thread nr %u\n", nr); libbpf_set_strict_mode(LIBBPF_STRICT_ALL); fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts); if (fd < 0) { printf("create map error %d\n", fd); return 1; } tids = malloc(nr * sizeof(*tids)); for (i = 0; i < nr; i++) pthread_create(&tids[i], NULL, update_fn, NULL); printf("wait for error\n"); for (i = 0; i < nr; i++) pthread_join(tids[i], NULL); printf("all threads exit\n"); free(tids); close(fd); return 0; } > > Here is my theory, but please correct me if I'm wrong, I haven't > tested yet. In non-RT, I doubt preemptions are likely to happen after > migrate_disable. That is because very soon after migrate_disable, we > enter the critical section of b->raw_lock with irq disabled. In RT, > preemptions can happen on acquiring b->lock, that is certainly > possible, but this is the !use_raw_lock path, which isn't side-stepped > by this patch. > >>>> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- >>>> 1 file changed, 18 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>>> index 6c530a5e560a..ad09da139589 100644 >>>> --- a/kernel/bpf/hashtab.c >>>> +++ b/kernel/bpf/hashtab.c >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, >>>> unsigned long *pflags) >>>> { >>>> unsigned long flags; >>>> + bool use_raw_lock; >>>> >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; >>>> >>>> - migrate_disable(); >>>> + use_raw_lock = htab_use_raw_lock(htab); >>>> + if (use_raw_lock) >>>> + preempt_disable(); >>>> + else >>>> + migrate_disable(); >>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { >>>> __this_cpu_dec(*(htab->map_locked[hash])); >>>> - migrate_enable(); >>>> + if (use_raw_lock) >>>> + preempt_enable(); >>>> + else >>>> + migrate_enable(); >>>> return -EBUSY; >>>> } >>>> >>>> - if (htab_use_raw_lock(htab)) >>>> + if (use_raw_lock) >>>> raw_spin_lock_irqsave(&b->raw_lock, flags); >>>> else >>>> spin_lock_irqsave(&b->lock, flags); >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, >>>> struct bucket *b, u32 hash, >>>> unsigned long flags) >>>> { >>>> + bool use_raw_lock = htab_use_raw_lock(htab); >>>> + >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; >>>> - if (htab_use_raw_lock(htab)) >>>> + if (use_raw_lock) >>>> raw_spin_unlock_irqrestore(&b->raw_lock, flags); >>>> else >>>> spin_unlock_irqrestore(&b->lock, flags); >>>> __this_cpu_dec(*(htab->map_locked[hash])); >>>> - migrate_enable(); >>>> + if (use_raw_lock) >>>> + preempt_enable(); >>>> + else >>>> + migrate_enable(); >>>> } >>>> >>>> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); >>>> -- >>>> 2.29.2 >>>> >>> . > .