On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 8/22/2022 11:21 AM, Hao Luo wrote: > > Hi, Hou Tao > > > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > >> Hi, > >> > >> On 8/22/2022 12:42 AM, Hao Luo wrote: > >>> Hi Hou Tao, > >>> > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > >>>> From: Hou Tao <houtao1@xxxxxxxxxx> > >>>> > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable() > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also > >>>> disallows concurrent updates from normal contexts (e.g. userspace > >>>> processes) unexpectedly as shown below: > >>>> > >>>> process A process B > >>>> > >>>> htab_map_update_elem() > >>>> htab_lock_bucket() > >>>> migrate_disable() > >>>> /* return 1 */ > >>>> __this_cpu_inc_return() > >>>> /* preempted by B */ > >>>> > >>>> htab_map_update_elem() > >>>> /* the same bucket as A */ > >>>> htab_lock_bucket() > >>>> migrate_disable() > >>>> /* return 2, so lock fails */ > >>>> __this_cpu_inc_return() > >>>> return -EBUSY > >>>> > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and > >>>> only checking the value of map_locked for nmi context. But it will > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered > >>>> through non-tracing program (e.g. fentry program). > >>>> > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false, > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), > >>>> so still use migrate_disable() for spin-lock case. > >>>> > >>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> > >>>> --- > >>> IIUC, this patch enlarges the scope of preemption disable to cover inc > >>> map_locked. But I don't think the change is meaningful. > >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of > >> RT"), the preemption is disabled before increasing map_locked for !PREEMPT_RT > >> case, so I don't think that the change is meaningless. > >>> This patch only affects the case when raw lock is used. In the case of > >>> raw lock, irq is disabled for b->raw_lock protected critical section. > >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my > >>> understanding about this patch is, it just makes sure preemption > >>> doesn't happen on the exact __this_cpu_inc_return. But the window is > >>> so small that it should be really unlikely to happen. > >> No, it can be easily reproduced by running multiple htab update processes in the > >> same CPU. Will add selftest to demonstrate that. > > Can you clarify what you demonstrate? > First please enable CONFIG_PREEMPT for the running kernel and then run the > following test program as shown below. > Ah, fully preemptive kernel. It's worth mentioning that in the commit message. Then it seems promoting migrate_disable to preempt_disable may be the best way to solve the problem you described. > # sudo taskset -c 2 ./update.bin > thread nr 2 > wait for error > update error -16 > all threads exit > > If there is no "update error -16", you can try to create more map update > threads. For example running 16 update threads: > > # sudo taskset -c 2 ./update.bin 16 > thread nr 16 > wait for error > update error -16 > update error -16 > update error -16 > update error -16 > update error -16 > update error -16 > update error -16 > update error -16 > all threads exit > > The following is the source code for update.bin: > > #define _GNU_SOURCE > #include <stdio.h> > #include <stdbool.h> > #include <stdlib.h> > #include <unistd.h> > #include <string.h> > #include <errno.h> > #include <pthread.h> > > #include "bpf.h" > #include "libbpf.h" > > static bool stop; > static int fd; > > static void *update_fn(void *arg) > { > while (!stop) { > unsigned int key = 0, value = 1; > int err; > > err = bpf_map_update_elem(fd, &key, &value, 0); > if (err) { > printf("update error %d\n", err); > stop = true; > break; > } > } > > return NULL; > } > > int main(int argc, char **argv) > { > LIBBPF_OPTS(bpf_map_create_opts, opts); > unsigned int i, nr; > pthread_t *tids; > > nr = 2; > if (argc > 1) > nr = atoi(argv[1]); > printf("thread nr %u\n", nr); > > libbpf_set_strict_mode(LIBBPF_STRICT_ALL); > fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts); > if (fd < 0) { > printf("create map error %d\n", fd); > return 1; > } > > tids = malloc(nr * sizeof(*tids)); > for (i = 0; i < nr; i++) > pthread_create(&tids[i], NULL, update_fn, NULL); > > printf("wait for error\n"); > for (i = 0; i < nr; i++) > pthread_join(tids[i], NULL); > > printf("all threads exit\n"); > > free(tids); > close(fd); > > return 0; > } > > > > > Here is my theory, but please correct me if I'm wrong, I haven't > > tested yet. In non-RT, I doubt preemptions are likely to happen after > > migrate_disable. That is because very soon after migrate_disable, we > > enter the critical section of b->raw_lock with irq disabled. In RT, > > preemptions can happen on acquiring b->lock, that is certainly > > possible, but this is the !use_raw_lock path, which isn't side-stepped > > by this patch. > > > >>>> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- > >>>> 1 file changed, 18 insertions(+), 5 deletions(-) > >>>> > >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > >>>> index 6c530a5e560a..ad09da139589 100644 > >>>> --- a/kernel/bpf/hashtab.c > >>>> +++ b/kernel/bpf/hashtab.c > >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, > >>>> unsigned long *pflags) > >>>> { > >>>> unsigned long flags; > >>>> + bool use_raw_lock; > >>>> > >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; > >>>> > >>>> - migrate_disable(); > >>>> + use_raw_lock = htab_use_raw_lock(htab); > >>>> + if (use_raw_lock) > >>>> + preempt_disable(); > >>>> + else > >>>> + migrate_disable(); > >>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { > >>>> __this_cpu_dec(*(htab->map_locked[hash])); > >>>> - migrate_enable(); > >>>> + if (use_raw_lock) > >>>> + preempt_enable(); > >>>> + else > >>>> + migrate_enable(); > >>>> return -EBUSY; > >>>> } > >>>> > >>>> - if (htab_use_raw_lock(htab)) > >>>> + if (use_raw_lock) > >>>> raw_spin_lock_irqsave(&b->raw_lock, flags); > >>>> else > >>>> spin_lock_irqsave(&b->lock, flags); > >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, > >>>> struct bucket *b, u32 hash, > >>>> unsigned long flags) > >>>> { > >>>> + bool use_raw_lock = htab_use_raw_lock(htab); > >>>> + > >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; > >>>> - if (htab_use_raw_lock(htab)) > >>>> + if (use_raw_lock) > >>>> raw_spin_unlock_irqrestore(&b->raw_lock, flags); > >>>> else > >>>> spin_unlock_irqrestore(&b->lock, flags); > >>>> __this_cpu_dec(*(htab->map_locked[hash])); > >>>> - migrate_enable(); > >>>> + if (use_raw_lock) > >>>> + preempt_enable(); > >>>> + else > >>>> + migrate_enable(); > >>>> } > >>>> > >>>> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); > >>>> -- > >>>> 2.29.2 > >>>> > >>> . > > . >