On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@xxxxxxxxxx> wrote: > > On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@xxxxxxxxxx> wrote: > > > > On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > > > > > Hi, > > > > > > On 8/22/2022 11:21 AM, Hao Luo wrote: > > > > Hi, Hou Tao > > > > > > > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > > >> Hi, > > > >> > > > >> On 8/22/2022 12:42 AM, Hao Luo wrote: > > > >>> Hi Hou Tao, > > > >>> > > > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > > >>>> From: Hou Tao <houtao1@xxxxxxxxxx> > > > >>>> > > > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses > > > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: > > > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable() > > > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also > > > >>>> disallows concurrent updates from normal contexts (e.g. userspace > > > >>>> processes) unexpectedly as shown below: > > > >>>> > > > >>>> process A process B > > > >>>> > > > >>>> htab_map_update_elem() > > > >>>> htab_lock_bucket() > > > >>>> migrate_disable() > > > >>>> /* return 1 */ > > > >>>> __this_cpu_inc_return() > > > >>>> /* preempted by B */ > > > >>>> > > > >>>> htab_map_update_elem() > > > >>>> /* the same bucket as A */ > > > >>>> htab_lock_bucket() > > > >>>> migrate_disable() > > > >>>> /* return 2, so lock fails */ > > > >>>> __this_cpu_inc_return() > > > >>>> return -EBUSY > > > >>>> > > > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and > > > >>>> only checking the value of map_locked for nmi context. But it will > > > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered > > > >>>> through non-tracing program (e.g. fentry program). > > > >>>> > > > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when > > > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false, > > > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), > > > >>>> so still use migrate_disable() for spin-lock case. > > > >>>> > > > >>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> > > Tao, thanks very much for the test. I played it a bit and I can > confirm that map_update failures are seen under CONFIG_PREEMPT. The > failures are not present under CONFIG_PREEMPT_NONE or > CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was > thinking of and they didn't work. It looks like Hou Tao's idea, > promoting migrate_disable to preempt_disable, is probably the best we > can do for the non-RT case. So preempt_disable is also faster than migrate_disable, so patch 1 will not only fix this issue, but will improve performance. Patch 2 is too hacky though. I think it's better to wait until my bpf_mem_alloc patches land. RT case won't be special anymore. We will be able to remove htab_use_raw_lock() helper and unconditionally use raw_spin_lock. With bpf_mem_alloc there is no inline memory allocation anymore. So please address Hao's comments, add a test and resubmit patches 1 and 3. Also please use [PATCH bpf-next] in the subject to help BPF CI and patchwork scripts.