Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 22 Aug 2022 18:29:09 -0700

On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@xxxxxxxxxx> wrote:
>
> On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@xxxxxxxxxx> wrote:
> >
> > On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > On 8/22/2022 11:21 AM, Hao Luo wrote:
> > > > Hi, Hou Tao
> > > >
> > > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> > > >> Hi,
> > > >>
> > > >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> > > >>> Hi Hou Tao,
> > > >>>
> > > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> > > >>>> From: Hou Tao <houtao1@xxxxxxxxxx>
> > > >>>>
> > > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> > > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> > > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> > > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> > > >>>> disallows concurrent updates from normal contexts (e.g. userspace
> > > >>>> processes) unexpectedly as shown below:
> > > >>>>
> > > >>>> process A                      process B
> > > >>>>
> > > >>>> htab_map_update_elem()
> > > >>>>   htab_lock_bucket()
> > > >>>>     migrate_disable()
> > > >>>>     /* return 1 */
> > > >>>>     __this_cpu_inc_return()
> > > >>>>     /* preempted by B */
> > > >>>>
> > > >>>>                                htab_map_update_elem()
> > > >>>>                                  /* the same bucket as A */
> > > >>>>                                  htab_lock_bucket()
> > > >>>>                                    migrate_disable()
> > > >>>>                                    /* return 2, so lock fails */
> > > >>>>                                    __this_cpu_inc_return()
> > > >>>>                                    return -EBUSY
> > > >>>>
> > > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> > > >>>> only checking the value of map_locked for nmi context. But it will
> > > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> > > >>>> through non-tracing program (e.g. fentry program).
> > > >>>>
> > > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> > > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> > > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> > > >>>> so still use migrate_disable() for spin-lock case.
> > > >>>>
> > > >>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>
> Tao, thanks very much for the test. I played it a bit and I can
> confirm that map_update failures are seen under CONFIG_PREEMPT. The
> failures are not present under CONFIG_PREEMPT_NONE or
> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
> thinking of and they didn't work. It looks like Hou Tao's idea,
> promoting migrate_disable to preempt_disable, is probably the best we
> can do for the non-RT case. So

preempt_disable is also faster than migrate_disable,
so patch 1 will not only fix this issue, but will improve performance.

Patch 2 is too hacky though.
I think it's better to wait until my bpf_mem_alloc patches land.
RT case won't be special anymore. We will be able to remove
htab_use_raw_lock() helper and unconditionally use raw_spin_lock.
With bpf_mem_alloc there is no inline memory allocation anymore.

So please address Hao's comments, add a test and
resubmit patches 1 and 3.
Also please use [PATCH bpf-next] in the subject to help BPF CI
and patchwork scripts.