On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@xxxxxxxxxx> wrote: > > On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > > > Hi, > > > > On 8/22/2022 11:21 AM, Hao Luo wrote: > > > Hi, Hou Tao > > > > > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > >> Hi, > > >> > > >> On 8/22/2022 12:42 AM, Hao Luo wrote: > > >>> Hi Hou Tao, > > >>> > > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > >>>> From: Hou Tao <houtao1@xxxxxxxxxx> > > >>>> > > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses > > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: > > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable() > > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also > > >>>> disallows concurrent updates from normal contexts (e.g. userspace > > >>>> processes) unexpectedly as shown below: > > >>>> > > >>>> process A process B > > >>>> > > >>>> htab_map_update_elem() > > >>>> htab_lock_bucket() > > >>>> migrate_disable() > > >>>> /* return 1 */ > > >>>> __this_cpu_inc_return() > > >>>> /* preempted by B */ > > >>>> > > >>>> htab_map_update_elem() > > >>>> /* the same bucket as A */ > > >>>> htab_lock_bucket() > > >>>> migrate_disable() > > >>>> /* return 2, so lock fails */ > > >>>> __this_cpu_inc_return() > > >>>> return -EBUSY > > >>>> > > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and > > >>>> only checking the value of map_locked for nmi context. But it will > > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered > > >>>> through non-tracing program (e.g. fentry program). > > >>>> > > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when > > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false, > > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), > > >>>> so still use migrate_disable() for spin-lock case. > > >>>> > > >>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> > > >>>> --- > > >>> IIUC, this patch enlarges the scope of preemption disable to cover inc > > >>> map_locked. But I don't think the change is meaningful. > > >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of > > >> RT"), the preemption is disabled before increasing map_locked for !PREEMPT_RT > > >> case, so I don't think that the change is meaningless. > > >>> This patch only affects the case when raw lock is used. In the case of > > >>> raw lock, irq is disabled for b->raw_lock protected critical section. > > >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my > > >>> understanding about this patch is, it just makes sure preemption > > >>> doesn't happen on the exact __this_cpu_inc_return. But the window is > > >>> so small that it should be really unlikely to happen. > > >> No, it can be easily reproduced by running multiple htab update processes in the > > >> same CPU. Will add selftest to demonstrate that. > > > Can you clarify what you demonstrate? > > First please enable CONFIG_PREEMPT for the running kernel and then run the > > following test program as shown below. > > > > Ah, fully preemptive kernel. It's worth mentioning that in the commit > message. Then it seems promoting migrate_disable to preempt_disable > may be the best way to solve the problem you described. > > > # sudo taskset -c 2 ./update.bin > > thread nr 2 > > wait for error > > update error -16 > > all threads exit > > > > If there is no "update error -16", you can try to create more map update > > threads. For example running 16 update threads: > > > > # sudo taskset -c 2 ./update.bin 16 > > thread nr 16 > > wait for error > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > all threads exit > > > > The following is the source code for update.bin: > > > > #define _GNU_SOURCE > > #include <stdio.h> > > #include <stdbool.h> > > #include <stdlib.h> > > #include <unistd.h> > > #include <string.h> > > #include <errno.h> > > #include <pthread.h> > > > > #include "bpf.h" > > #include "libbpf.h" > > > > static bool stop; > > static int fd; > > > > static void *update_fn(void *arg) > > { > > while (!stop) { > > unsigned int key = 0, value = 1; > > int err; > > > > err = bpf_map_update_elem(fd, &key, &value, 0); > > if (err) { > > printf("update error %d\n", err); > > stop = true; > > break; > > } > > } > > > > return NULL; > > } > > > > int main(int argc, char **argv) > > { > > LIBBPF_OPTS(bpf_map_create_opts, opts); > > unsigned int i, nr; > > pthread_t *tids; > > > > nr = 2; > > if (argc > 1) > > nr = atoi(argv[1]); > > printf("thread nr %u\n", nr); > > > > libbpf_set_strict_mode(LIBBPF_STRICT_ALL); > > fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts); > > if (fd < 0) { > > printf("create map error %d\n", fd); > > return 1; > > } > > > > tids = malloc(nr * sizeof(*tids)); > > for (i = 0; i < nr; i++) > > pthread_create(&tids[i], NULL, update_fn, NULL); > > > > printf("wait for error\n"); > > for (i = 0; i < nr; i++) > > pthread_join(tids[i], NULL); > > > > printf("all threads exit\n"); > > > > free(tids); > > close(fd); > > > > return 0; > > } > > Tao, thanks very much for the test. I played it a bit and I can confirm that map_update failures are seen under CONFIG_PREEMPT. The failures are not present under CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was thinking of and they didn't work. It looks like Hou Tao's idea, promoting migrate_disable to preempt_disable, is probably the best we can do for the non-RT case. So Reviewed-by: Hao Luo <haoluo@xxxxxxxxxx> But, I am not sure if we want to get rid of preemption-caused batch map updates on preemptive kernels, or if there are better solutions to address [0]. Tao, could you mention CONFIG_PREEMPT in your commit message? Thanks. > > > > > > Here is my theory, but please correct me if I'm wrong, I haven't > > > tested yet. In non-RT, I doubt preemptions are likely to happen after > > > migrate_disable. That is because very soon after migrate_disable, we > > > enter the critical section of b->raw_lock with irq disabled. In RT, > > > preemptions can happen on acquiring b->lock, that is certainly > > > possible, but this is the !use_raw_lock path, which isn't side-stepped > > > by this patch. > > > > > >>>> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- > > >>>> 1 file changed, 18 insertions(+), 5 deletions(-) > > >>>> > > >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > > >>>> index 6c530a5e560a..ad09da139589 100644 > > >>>> --- a/kernel/bpf/hashtab.c > > >>>> +++ b/kernel/bpf/hashtab.c > > >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, > > >>>> unsigned long *pflags) > > >>>> { > > >>>> unsigned long flags; > > >>>> + bool use_raw_lock; > > >>>> > > >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; > > >>>> > > >>>> - migrate_disable(); > > >>>> + use_raw_lock = htab_use_raw_lock(htab); > > >>>> + if (use_raw_lock) > > >>>> + preempt_disable(); > > >>>> + else > > >>>> + migrate_disable(); > > >>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { > > >>>> __this_cpu_dec(*(htab->map_locked[hash])); > > >>>> - migrate_enable(); > > >>>> + if (use_raw_lock) > > >>>> + preempt_enable(); > > >>>> + else > > >>>> + migrate_enable(); > > >>>> return -EBUSY; > > >>>> } > > >>>> > > >>>> - if (htab_use_raw_lock(htab)) > > >>>> + if (use_raw_lock) > > >>>> raw_spin_lock_irqsave(&b->raw_lock, flags); > > >>>> else > > >>>> spin_lock_irqsave(&b->lock, flags); > > >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, > > >>>> struct bucket *b, u32 hash, > > >>>> unsigned long flags) > > >>>> { > > >>>> + bool use_raw_lock = htab_use_raw_lock(htab); > > >>>> + > > >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; > > >>>> - if (htab_use_raw_lock(htab)) > > >>>> + if (use_raw_lock) > > >>>> raw_spin_unlock_irqrestore(&b->raw_lock, flags); > > >>>> else > > >>>> spin_unlock_irqrestore(&b->lock, flags); > > >>>> __this_cpu_dec(*(htab->map_locked[hash])); > > >>>> - migrate_enable(); > > >>>> + if (use_raw_lock) > > >>>> + preempt_enable(); > > >>>> + else > > >>>> + migrate_enable(); > > >>>> } > > >>>> > > >>>> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); > > >>>> -- > > >>>> 2.29.2 > > >>>> > > >>> . > > > . > >