Hi, On 8/22/2022 12:42 AM, Hao Luo wrote: > Hi Hou Tao, > > On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> From: Hou Tao <houtao1@xxxxxxxxxx> >> >> Per-cpu htab->map_locked is used to prohibit the concurrent accesses >> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: >> Make migrate_disable/enable() independent of RT"), migrations_disable() >> is also preemptible under !PREEMPT_RT case, so now map_locked also >> disallows concurrent updates from normal contexts (e.g. userspace >> processes) unexpectedly as shown below: >> >> process A process B >> >> htab_map_update_elem() >> htab_lock_bucket() >> migrate_disable() >> /* return 1 */ >> __this_cpu_inc_return() >> /* preempted by B */ >> >> htab_map_update_elem() >> /* the same bucket as A */ >> htab_lock_bucket() >> migrate_disable() >> /* return 2, so lock fails */ >> __this_cpu_inc_return() >> return -EBUSY >> >> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and >> only checking the value of map_locked for nmi context. But it will >> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered >> through non-tracing program (e.g. fentry program). >> >> So fixing it by using disable_preempt() instead of migrate_disable() when >> increasing htab->map_locked. However when htab_use_raw_lock() is false, >> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), >> so still use migrate_disable() for spin-lock case. >> >> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> >> --- > IIUC, this patch enlarges the scope of preemption disable to cover inc > map_locked. But I don't think the change is meaningful. Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of RT"), the preemption is disabled before increasing map_locked for !PREEMPT_RT case, so I don't think that the change is meaningless. > > This patch only affects the case when raw lock is used. In the case of > raw lock, irq is disabled for b->raw_lock protected critical section. > A raw spin lock itself doesn't block in both RT and non-RT. So, my > understanding about this patch is, it just makes sure preemption > doesn't happen on the exact __this_cpu_inc_return. But the window is > so small that it should be really unlikely to happen. No, it can be easily reproduced by running multiple htab update processes in the same CPU. Will add selftest to demonstrate that. > >> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- >> 1 file changed, 18 insertions(+), 5 deletions(-) >> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >> index 6c530a5e560a..ad09da139589 100644 >> --- a/kernel/bpf/hashtab.c >> +++ b/kernel/bpf/hashtab.c >> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, >> unsigned long *pflags) >> { >> unsigned long flags; >> + bool use_raw_lock; >> >> hash = hash & HASHTAB_MAP_LOCK_MASK; >> >> - migrate_disable(); >> + use_raw_lock = htab_use_raw_lock(htab); >> + if (use_raw_lock) >> + preempt_disable(); >> + else >> + migrate_disable(); >> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { >> __this_cpu_dec(*(htab->map_locked[hash])); >> - migrate_enable(); >> + if (use_raw_lock) >> + preempt_enable(); >> + else >> + migrate_enable(); >> return -EBUSY; >> } >> >> - if (htab_use_raw_lock(htab)) >> + if (use_raw_lock) >> raw_spin_lock_irqsave(&b->raw_lock, flags); >> else >> spin_lock_irqsave(&b->lock, flags); >> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, >> struct bucket *b, u32 hash, >> unsigned long flags) >> { >> + bool use_raw_lock = htab_use_raw_lock(htab); >> + >> hash = hash & HASHTAB_MAP_LOCK_MASK; >> - if (htab_use_raw_lock(htab)) >> + if (use_raw_lock) >> raw_spin_unlock_irqrestore(&b->raw_lock, flags); >> else >> spin_unlock_irqrestore(&b->lock, flags); >> __this_cpu_dec(*(htab->map_locked[hash])); >> - migrate_enable(); >> + if (use_raw_lock) >> + preempt_enable(); >> + else >> + migrate_enable(); >> } >> >> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); >> -- >> 2.29.2 >> > .